# Data Mining

The good quant trading models reveal the nature of the market; the bad ones are merely statistical artifacts.

One most popular way to create spurious trading model is data snooping or data mining. Suppose we want to create a model to trade AAPL daily. We download some data of, e.g., 100 days of AAPL, from Yahoo. If we work hard enough with the data, we will find a curve (model) that explains the data very well. For example, the following curve perfectly fits the data.

Suppose the prices are $${ x_1, x_2, \dots x_n }$$

$$\frac{(t-2)\dots(t-n)}{(1-2)\dots(1-n)}(x_1) + \frac{(t-1)\dots(t-n)}{(2-1)\dots(2-n)}(x_2) + \dots + \frac{(t-1)\dots(t-n+1)}{(n-1)\dots(n-n+1)}(x_n)$$

Of course, most of us are judicious enough to avoid this obvious over-fitting formula. Unfortunately, some may fall into the trap of it in disguise. Let’s say we want to understand what factors contribute to the AAPL price movements or returns. (We now have 99 returns.) We come up with a list of 99 possible factors, such as PE, capitalization, dividends, etc. One very popular method to find significant factors is linear regression. So, we have

$$r_t = \alpha + \beta_1f_{1t} + \dots + \beta_{99}f_{99t} + \epsilon_t$$

Guess how well this fits? The goodness-of-fit (R-squared) turns out be 100% – a perfect fit! It can be proved that this regression is a complete nonsense. Even if we throw in random values for those 99 factors, we will also end up with a perfect fit regression. Consequently, the coefficients and t-stats mean nothing.
Could we do a “smaller” regression on a small subset of factors, e.g., one factor at a time, and hope to identify the most significant factor? This step-wise regression turns out to be spurious as well. For a pool of large enough factors, there is big probability of finding (the most) significant factors even when the factors values are randomly generated.

Suppose we happen to regress returns on only capitalization and finds that this factor is significant. Even so, we may in fact be doing some form of data snooping. This is because there are thousands other people testing the same or different factors using the same data set, i.e., AAPL prices from Yahoo. This community, taken as a whole, is doing exactly the same step-wise regression described in the last paragraph. In summary, empirical evidence alone is not sufficient to justify a trading model.

To avoid data snooping in designing a trading strategy, Numerical Method Inc. recommends our clients a four-step procedure.

1. Hypothesis: we start with an insight, a theory, or a common sense about how the market works.
2. Modeling: translate the insight in English into mathematics (in Greek).
3. Application: in-sample calibration and out-sample backtesting.
4. Analysis: understand and explain the winning vs. losing trades.

In steps 1 and 2, we explicitly write down the model assumptions, deduce the model properties, and compute the p&l distribution. We prove that under those assumptions, the strategy will always make money (on average). Whether these assumptions are true can be verified against data using techniques such as hypothesis testing. Given the model parameters, we know exactly how much money we expect to make. This is all done before we even look at a particular data set. In other words, we avoid data snooping by using the data set only until the calibration step and after we have created a trading model.

An example of creating a trend following strategy using this procedure can be found in lecture 1 of the course “Introduction to Algorithmic Trading Strategies”.