科学研究 – 第5页 – 吴金闪的工作和思考

AI拼写检查软件AfterTheDeadline

刚才试着安装了一下AfterTheDeadline的atd服务器还有用来和服务器对话并且返回检查结果的命令行工具atdtool。希望用来做文章的拼写检查。没有成功。只好继续用它的网页版：http://www.polishmywriting.com/。等有时间琢磨一下，给它安装上。

这个拼写检查是基于语料库做的，而不是仅仅从字典来检查单词拼写错误。它还可以提出一些语法错误和书写style的建议，例如时态用错了、单复数没有匹配好、尽量避免复杂的困难的词、尽量用主动语态等等。我用了一下还是很不错的，比我自己去一个一个读出来要快多了。它也会提供修改建议。当然，这时候，需要就你来自己做决定了。

如何通过观察生活来学习

我大约9岁的时候，第一次给父母做了点心（浙江农村习惯在下午两点多钟的时候吃一顿加餐）送到了他们干活的农田。我还记得他们很开心，但是也非常吃惊。问：你怎么知道如何做梅菜馅饼的啊？其实，我忘了加肉了，只有面粉和梅菜。不过，确实也挺厉害的。我说：我喜欢吃，平时看看妈妈你做，就会了啊。

我还特别喜欢吃鱼，所以做鱼也是我很早就看会，并且通过自己尝试——自己去钓鱼自己清理自己烧自己吃，就学会了。所以，主要就是看和尝试。当然，有的时候还会尝试一些不是看到的，而是自己想的，就有可能会更好吃了。当然，一般更难吃，直到尝试出来一定的境界——对做饭的理解的境界。现在回忆起来，我从小就是一个吃货，也是一个科学家。

顺便，题外话，如果一定要说一下做饭的境界的话，大概有下面这些。第一、一定要有主要的味道，这个通常是咸，有的时候是甜，很少的时候酸也行，或者这几种的组合。第二、一定要让食材自身的味道和主味配合起来，在主味还是主味的前提下，能够突出食材自己的味道。例如，鱼一定要降低腥味但是保留鲜味，不要用其他味道盖住它。当然，也有违反的时候，例如酸菜鱼。不过那个时候，其实酸菜是主要食材，本来就应该让它的味道做主。第三、有一些食材的使用经验，其实大部分时候就是食材的特点，是很有用的。例如蘑菇由于有大量氨基酸可以替代味精。例如好的黄酱和酱油有的时候可以替代一部分盐，并且氨基酸的含量也很高。第四、某些前人已经积累好的经验可以借鉴，但是要思考为什么管用。例如葱姜可以去掉腥味、鸡蛋打发的时候可以考虑加入柠檬汁几滴、腌制牛肉鸡肉要做好按摩松弛、还有菜谱（食材的组合是烹饪的顺序手法）之类的。学习经验是一方面，思考然后可能推广和借鉴是另一方面。好了，再写下去这个帖子成了做饭帖了。后来在东西双方料理的影响之下，就更加成了带思考的吃货科学家（实验者）了。

回到主题，生活中也有很多可以学习的东西，最主要的原则是“看在眼里”和“做实验”。其实，课程的学习也一样，把老师怎么做的“看在眼里”，然后“自己来尝试，来实验”。那么，怎样才能做到“看在眼里”呢？这是我今天这个帖子的主题。

今天心儿做了一件我经常做的事情，没做好，被我说了。其实，我不应该说，我又没有教过她要做生活有心人，看在眼里，没有教过如何看。那么，到底如何看呢？跟我的理解型阅读和理解型写作一样，WHWM——问What、How、Why、Meanginful这四个问题：他在做什么？他怎么做的？他为什么要这样做，为什么要做这个？如果我来做会怎样，这事对我有意义吗，我要学习和思考吗？

在阅读的时候，我们问：主要信息是什么？这个信息（通过例子、其他概念等）如何构建起来的？为什么传达这个信息，为什么这样构建？对我来说意味着什么，我喜欢吗？

在写作的时候，我们问：我要传达主要信息是什么？这个信息我是如何构建起来的？为什么我要传达这个信息，为什么我要这样传达？对我的读者来说意味着什么，有合理性吗？

在设计一个专业、一门课程甚至一节课的时候，我们问：从促进学生理解学科大图景和学生以后自己学习的角度，我要传达什么信息？这个信息是如何构建的，考虑到学生已有的背景？为什么我要传达这个信息，为什么我要这样传达？对于我的学生来说意味着什么，能够促进理解世界理解学科深入思考吗？

阅读理解作业：这个帖子的主要信息是是什么？作者如何构建的这个主要信息？构建的逻辑上有什么好的地方和缺陷？作者为什么要传达这个信息？有没有一个这个信息本身之上的，一定意义上更重要的“母信息（能够在一定程度上导出这个信息的信息）”？你觉得如何？

Distinguish internal and external factors using machine learning

Recently I came across a paper working on distinguish internal and external factors of a dynamical process using machine learning, with only data on time sequences \(x\left(t-\tau\right), x\left(t-\tau+1\right),\cdots, x\left(t\right)\).

In fact, in the paper, only a few \(x\left(t-2\right), x\left(t-1\right), x\left(t\right) \) are known and are used to predict \(x\left(t+1\right)\) and want to distinguish that how much of this \(x\left(t+1\right)\) is due to internal factors, which leads to \(x_{i}\left(t+1\right)\) and how much is due to external factors, which leads to the additional \(x_e\left(t+1\right)\) such that indeed the observed \(x\left(t+1\right)\) can be explained by combining the two, ie., \(x\left(t+1\right)=x_{i}\left(t+1\right)+x_{e}\left(t+1\right)\) . I have quite some doubts on this idea and this working logic, which I will try to explain here.

Assuming \(x\left(t\right)\) is indeed governed by the following dynamics,

\begin{align}
x\left(t+1\right) = f\left[x\left(t\right)\right] + z\left(t\right) \\
z\left(t+1\right) = g\left[z\left(t\right)\right]
\end{align}
However with \(f, g\) unknown and \(z\) not observed at all. Not being observed means that we can not really make use of it in constructing our predicting model. We regard \(f\left[x\left(t\right)\right] \) as the internal factor and \(z\left(t\right)\) as the external factor. Here only the full \(x\), thus \(x_{real}\), is known and \(z\) is not known.

Now our question becomes how to figure out \(f\left[x\left(t+1\right)\right] \) and \(z\left(t+1\right)\) starting from a time series of \(x\left(t\right)\).
Assuming there is a perfect algorithm to generate \(x\left(t+1\right)\) starting from a time series of \(x\left(t\right)\) when the dynamics of \(x\left(t\right)\) is fully internal. Let us call this algorithm LP, just for fun. Will this helps us to solve the problem defined above?
How about we apply blindly to the observed data: \(x\left(t-\tau\right), x\left(t-\tau+1\right),\cdots, x\left(t\right)\)? In principle, we might get
\begin{align}
x\left(t-\tau+1\right) = LP\left[x\left(t-\tau\right)\right] + z\left(t-\tau\right) \\
\cdots \\
x\left(t+1\right) = LP\left[x\left(t-\tau\right), x\left(t-\tau+1\right),\cdots, x\left(t\right)\right] + z\left(t+1\right)
\end{align}
However, there are technical problems and logic problem in this line of thinking.
First, technically we need LP to be very accurate, in fact so accurate that

\begin{align}
\left|x_{internal}\left(t+1\right) – LP\left[x_{internal}\left(t+1\right)\right]\right| \ll \left|x_{real}\left(t+1\right) – LP\left[x_{internal}\left(t+1\right)\right]\right| \\
= \left|x_{real}\left(t+1\right) – x_{internal}\left(t+1\right) + x_{internal}\left(t+1\right) – LP\left[x_{internal}\left(t+1\right)\right]\right|
\end{align}
Since we are interested in \(x_{real}\left(t+1\right) – x_{internal}\left(t+1\right)\), we have to make sure that the difference between the predicted and the real value is far beyond the difference between the difference between the predicted and the internal.

In practice, providing some data on this accuracy can be done and is very helpful. For example, by providing LP’s accuracy data on generic examples and \(\left|x_{real}\left(t+1\right) – LP\left[x_{internal}\left(t+1\right)\right]\right|\) for this specific example. They have to be very different. If they are close, then what are supposed to be external might simply due to the inaccurate part of \(LP\left[x_{internal}\left(t+1\right)\right]\).

If will be even better if one can provide a comparison directly between \(\left|x_{internal}\left(t+1\right) – LP\left[x_{internal}\left(t+1\right)\right]\right|\) and \(\left|x_{real}\left(t+1\right) – LP\left[x_{internal}\left(t+1\right)\right]\right|\). It is however impossible since there is no data on \(x_{internal}\left(t+1\right)\), but only the observed \(x_{real}\left(t+1\right)\). In fact, one can only have \(\left|x_{real}\left(t+1\right) – LP\left[x_{real}\left(t+1\right)\right]\right|\), which is not either of the desired ones above.

In this case, the accuracy of LP would totally relies on previous experience. This is a logic problem: even when it is accurate on all previous examples, it is not for sure that it will have similar accuracy on this example unless there are theoretical proof of bounds. However, for most machine learning LP, there is no such theoretical bounds, let alone a proof.

Besides the techinical problem, I think, there is a even serious logic problem.

Remmeber that we can only observed \(x_{real}\) but not \(z\) and not \(x_{internal}\). Therefore, we in fact have something like this,
\begin{align}
x_{real}\left(t-\tau+1\right) = LP\left[x_{real}\left(t-\tau\right)\right] + \hat{z}\left(t-\tau\right) \\
\cdots \\
x_{real}\left(t+1\right) = LP\left[x_{real}\left(t-\tau\right), x_{real}\left(t-\tau+1\right),\cdots, x_{real}\left(t\right)\right] + \hat{z}\left(t+1\right)
\end{align}
There is not guarantee that \(\hat{z}\) is any close to or anyhow related to the external factor \(z\).

So, I conclude that firstly there is a serious technical issue, which can be lifted to certain degree by providing data to confirm \(\left|x_{real}\left(t-\tau+1\right) – LP\left[x_{real}\left(t-\tau\right)\right] \right|\) is much bigger than the usual residue of LP. However, if this happens there is a chance that LP is bad for this specific case, then it is natural to see the large inaccuracy above. This is a very tricky problem and in principle impossible to aviod or solve. Secondly, there is a serious logic issue, since the difference between the real data and the predicted data may not due to the desired external factors.

bigphysics Wiki 站点

主要出于整理研究工作实现团队内更好地交流分享的目的，我们建设了研究团队网站——Big Physics（大物理学）。

网站采用MediaWiki的后台，用分类来组织具有上下级关系的内容，用超链接来组织具有更一般关系的内容。前者相当于概念地图的层级连接，后者相当于横向连接或者层内连接。一般来说除了最底下的页面，都采用分类当做页面的格式。在Wiki里面，分类页面不仅仅是一个内容组织容器，还是可以提供内容的。

目前已经有的分类有：文献讨论、研究项目（实际上是研究方向，具体项目在这个层级的下一层或者下两层）、研究成果、数据。分类之间允许有重叠。例如，某一篇文献的分享应该放到大类“文献讨论”下面，同时按照内容还需要放到合适的“研究项目”下面。甚至，如果是自己团队的文章，那么，还需要放到“研究成果”的下面。

通过这样的整理，更加容易看清楚研究的大图景、细节，还能够促进交流讨论。经过这几天的使用，我发现确实能够大大地提高工作效率。例如，看过的文献就能够随手整理在合适的逻辑框架中，并且能够留下很好的记录。

团队的各位成员需要学习一点点MediaWiki的语法，例如小节标题、到外部网页的超链接、到内部网页的超链接、加入到上一级分类、新建页面、新建分类、列表、数学公式等。

顺便，整个研究团队的信息技术架构包括：owncloud做的云盘和云同步，git做的版本控制，wordpress做的博客群，MediaWiki做的大物理学网站，CmapServer做的概念地图服务器，Otree做的博弈实验服务器。
其他还有汉字学习网站。

学生到底出了什么问题，或者反过来，导师到底需要如何来调整教的方法

学生做了几组对比实验，需要看一看这些实验结果之间是否有差别。当然，如果差别比较大的时候，只需要列出来各组实验的某个量的均值就可以了。但是，一般来说还需要列上方差，才更容易看清楚这个差别。当然，如果这些实验数据背后是正态分布，则提供方差和均值，也就差不多了。两个样本是否来自同一个分布函数的概率，可以从这个均值和方差计算出来。从统计检验上，也有这样的方法直接对两组数计算这个是否来自于同一个分布函数的概率。如果是两组不一定来自于正态分布的样本，则需要考虑非参数检验。

实际上，R语言已经对这些统计检验都做了实现。具体来说，有t.test, wilcox.test和ks.test等函数。但是，由于ks直接计算的累计分布函数，当累积分布函数更大的时候，说明这个样本更往左（值小的时候，概率比较大）偏，因此，ks.test的“greater”参数的含义实际上是前一组样本x小于后一组样本y。这个和t.test和wilcox.test刚好相反。

以上是问题背景。

学生表现如下：第一、对比图只显示均值，没有方差（标准差）。第二、统计检验显示有的组间存在差别，但是当进一步检验这些有差别的组是不是其中一个比另一个大的时候，显示没差别。样本量已经到达1000个左右。原则上，不应该是样本不够的问题。

学生给我这个分析结果，说结论不知道怎么做了。当然，第一，研究得到什么，结论就写什么，负面结论（无差别）也是有意义的有意思的。更重要的是以下两点。第二，没有方差的统计图实验结果图，我从来没见过。只要做过实验，误差这一点是非常非常重要的。第三，显然上面的检验结果自相矛盾。当然，我确实仅仅告诉学生这些检验方法的名字，没有教过每一个检验是怎么回事，怎么做的。但是，人家R语言有自己的程序说明，有示例函数。这些都比我来说要好很多。

由于我注意到这个矛盾，也对于样本不够非常的不相信，我自己做了检验。我发现，t.test, wilcox.test和ks.test检验的结果完全一样，只要在前两者里面用greater，后者用less。因此，根本问题就是学生没有看程序说明，想当然的就拿来用了。

我想不太明白这是怎么发生的。希望学生能够就这个问题给我分析以下，帮我解个惑，提高一下。我想知道实验图不画误差线、程序说明不看、自相矛盾的地方不注意的深层原因是什么。当然，学生可以偷懒回答，我就是笨，我就是懒。我希望能够思考得比这个答案深刻一些。这样有助于以后提高。

更进步，为什么不去思考这三个方法的联系和区别，既然有三个可能的方法。

了解了这些，才能更好地指导学生。当然，我本来就从来都不直接告诉学生一个问题的答案的。我给问题，给思考的线索，分解成进一步的小问题，就是不给答案不管我是否已经知道答案。难道是这样的指导方式错了？应该直接给答案？直接给答案就剥夺了学生思考和进步的机会。学生在有一个方向的情况下，自己获取信息，犯点错，解决这个错误，然后解决这个问题，这个经验是很重要的。不是所有的老师都给学生这个犯错误的时间和机会的。象民工一样教给学生怎么做最简单，但是就没有了探索的过程。