internal factors – 吴金闪的工作和思考

Recently I came across a paper working on distinguish internal and external factors of a dynamical process using machine learning, with only data on time sequences \(x\left(t-\tau\right), x\left(t-\tau+1\right),\cdots, x\left(t\right)\).

In fact, in the paper, only a few \(x\left(t-2\right), x\left(t-1\right), x\left(t\right) \) are known and are used to predict \(x\left(t+1\right)\) and want to distinguish that how much of this \(x\left(t+1\right)\) is due to internal factors, which leads to \(x_{i}\left(t+1\right)\) and how much is due to external factors, which leads to the additional \(x_e\left(t+1\right)\) such that indeed the observed \(x\left(t+1\right)\) can be explained by combining the two, ie., \(x\left(t+1\right)=x_{i}\left(t+1\right)+x_{e}\left(t+1\right)\) . I have quite some doubts on this idea and this working logic, which I will try to explain here.

Assuming \(x\left(t\right)\) is indeed governed by the following dynamics,

\begin{align}
x\left(t+1\right) = f\left[x\left(t\right)\right] + z\left(t\right) \\
z\left(t+1\right) = g\left[z\left(t\right)\right]
\end{align}
However with \(f, g\) unknown and \(z\) not observed at all. Not being observed means that we can not really make use of it in constructing our predicting model. We regard \(f\left[x\left(t\right)\right] \) as the internal factor and \(z\left(t\right)\) as the external factor. Here only the full \(x\), thus \(x_{real}\), is known and \(z\) is not known.

Now our question becomes how to figure out \(f\left[x\left(t+1\right)\right] \) and \(z\left(t+1\right)\) starting from a time series of \(x\left(t\right)\).
Assuming there is a perfect algorithm to generate \(x\left(t+1\right)\) starting from a time series of \(x\left(t\right)\) when the dynamics of \(x\left(t\right)\) is fully internal. Let us call this algorithm LP, just for fun. Will this helps us to solve the problem defined above?
How about we apply blindly to the observed data: \(x\left(t-\tau\right), x\left(t-\tau+1\right),\cdots, x\left(t\right)\)? In principle, we might get
\begin{align}
x\left(t-\tau+1\right) = LP\left[x\left(t-\tau\right)\right] + z\left(t-\tau\right) \\
\cdots \\
x\left(t+1\right) = LP\left[x\left(t-\tau\right), x\left(t-\tau+1\right),\cdots, x\left(t\right)\right] + z\left(t+1\right)
\end{align}
However, there are technical problems and logic problem in this line of thinking.
First, technically we need LP to be very accurate, in fact so accurate that

\begin{align}
\left|x_{internal}\left(t+1\right) – LP\left[x_{internal}\left(t+1\right)\right]\right| \ll \left|x_{real}\left(t+1\right) – LP\left[x_{internal}\left(t+1\right)\right]\right| \\
= \left|x_{real}\left(t+1\right) – x_{internal}\left(t+1\right) + x_{internal}\left(t+1\right) – LP\left[x_{internal}\left(t+1\right)\right]\right|
\end{align}
Since we are interested in \(x_{real}\left(t+1\right) – x_{internal}\left(t+1\right)\), we have to make sure that the difference between the predicted and the real value is far beyond the difference between the difference between the predicted and the internal.

In practice, providing some data on this accuracy can be done and is very helpful. For example, by providing LP’s accuracy data on generic examples and \(\left|x_{real}\left(t+1\right) – LP\left[x_{internal}\left(t+1\right)\right]\right|\) for this specific example. They have to be very different. If they are close, then what are supposed to be external might simply due to the inaccurate part of \(LP\left[x_{internal}\left(t+1\right)\right]\).

If will be even better if one can provide a comparison directly between \(\left|x_{internal}\left(t+1\right) – LP\left[x_{internal}\left(t+1\right)\right]\right|\) and \(\left|x_{real}\left(t+1\right) – LP\left[x_{internal}\left(t+1\right)\right]\right|\). It is however impossible since there is no data on \(x_{internal}\left(t+1\right)\), but only the observed \(x_{real}\left(t+1\right)\). In fact, one can only have \(\left|x_{real}\left(t+1\right) – LP\left[x_{real}\left(t+1\right)\right]\right|\), which is not either of the desired ones above.

In this case, the accuracy of LP would totally relies on previous experience. This is a logic problem: even when it is accurate on all previous examples, it is not for sure that it will have similar accuracy on this example unless there are theoretical proof of bounds. However, for most machine learning LP, there is no such theoretical bounds, let alone a proof.

Besides the techinical problem, I think, there is a even serious logic problem.

Remmeber that we can only observed \(x_{real}\) but not \(z\) and not \(x_{internal}\). Therefore, we in fact have something like this,
\begin{align}
x_{real}\left(t-\tau+1\right) = LP\left[x_{real}\left(t-\tau\right)\right] + \hat{z}\left(t-\tau\right) \\
\cdots \\
x_{real}\left(t+1\right) = LP\left[x_{real}\left(t-\tau\right), x_{real}\left(t-\tau+1\right),\cdots, x_{real}\left(t\right)\right] + \hat{z}\left(t+1\right)
\end{align}
There is not guarantee that \(\hat{z}\) is any close to or anyhow related to the external factor \(z\).

So, I conclude that firstly there is a serious technical issue, which can be lifted to certain degree by providing data to confirm \(\left|x_{real}\left(t-\tau+1\right) – LP\left[x_{real}\left(t-\tau\right)\right] \right|\) is much bigger than the usual residue of LP. However, if this happens there is a chance that LP is bad for this specific case, then it is natural to see the large inaccuracy above. This is a very tricky problem and in principle impossible to aviod or solve. Secondly, there is a serious logic issue, since the difference between the real data and the predicted data may not due to the desired external factors.

标签： internal factors

Distinguish internal and external factors using machine learning