When testing trading strategies a common approach is to divide the initial data set into **in sample** data: the part of the data designed to calibrate the model and **out of sample** data: the part of the data used to validate the calibration and ensure that the performance created in sample will be reflected in the real world. As a rule of thumb around 70% of the initial data can be used for calibration (i.e. in sample) and 30% for validation (i.e. out of sample). Then a comparison of the in and out of sample data help to decide whether the model is robust enough. This post aims at going a step further and provides a statistical method to decide whether the out of sample data is in line with what was created in sample.

In the chart below the blue area represents the out of sample performance for one of my strategies.

A simple visual inspection reveals a good fit between the in and out of sample performance but what degree of confidence do I have in this? At this stage not much and this is the issue. What is truly needed is a measure of similarity between the in and out of sample data sets. In statistical terms this could be translated as the likelihood that the in and out of sample performance figures coming from the same distribution. There is a non-parametric statistical test that does exactly this: the **Kruskall-Wallis Test**. A good definition of this test could be found on R-Tutor “A collection of data samples are independent if they come from unrelated populations and the samples do not affect each other. Using the Kruskal-Wallis Test, we can decide whether the population distributions are identical without assuming them to follow the normal distribution.” The added benefit of this test is not assuming a normal distribution.

It exists other tests of the same nature that could fit into that framework. The **Mann-Whitney-Wilcoxon** test or the **Kolmogorov-Smirnov** tests would perfectly suits the framework describes here however this is beyond the scope of this article to discuss the pros and cons of each of these tests. A good description along with R examples can be found here.

Here’s the code used to generate the chart above and the analysis:

################################################ ## Making the most of the OOS data ## ## thertrader@gmail.com - Aug. 2016 ################################################ library(xts) library(PerformanceAnalytics) thePath <- "myPath" #change this theFile <- "data.csv" data <- read.csv(paste0(thePath,theFile),header=TRUE,sep=",") data <- xts(data[,2],order.by=as.Date(as.character(data[,1]),format = "%d/%m/%Y")) ##----- Strategy's Chart par(mex=0.8,cex=1) thePeriod <- c("2012-02/2016-05") chart.TimeSeries(cumsum(data), main = "System 1", ylab="", period.areas = thePeriod, grid.color = "lightgray", period.color = "slategray1") ##----- Kruskal tests pValue <- NULL i <- 1 while (i < 1000){ isSample <- sample(isData,length(osData)) pValue <- rbind(pValue,kruskal.test(list(osData, isSample))$p.value) i <- i + 1 } ##----- Mean of p-values mean(pValue)

In the example above the in sample period is longer than the out of sample period therefore I randomly created 1000 subsets of the in sample data each of them having the same length as the out of sample data. Then I tested each in sample subset against the out of sample data and I recorded the p-values. This process creates not a single p-value for the Kruskall-Wallis test but a distribution making the analysis more robust.** **In this example the mean of the p-values is well above zero (0.478) indicating that the null hypothesis should be accepted: there are strong evidences that the in and out of sample data is coming from the same distribution.

As usual what is presented in this post is a toy example that only scratches the surface of the problem and should be tailored to individual needs. However I think it proposes an interesting and rational statistical framework to evaluate out of sample results.

This post is inspired by the following two papers:

Great post! Wouldn’t this test applied to the underlying assets that the strategy trades be better for assessing robustness? The fact the strategy returns aren’t from different processes is good, the strategy was robust to any process change in the underlying during the sample periods. How impressive this is depends on if the underlying process changed which is what the test on the underlying would tell you.

It occurs to me this could be used to categorize regimes – if two samples (with resampling) are significantly different processes it implies that a regime shift occurred. If many assets show the same result for the period this would be compelling, especially if you saw a bifurcation in the performance of different strategy types (e.g. trend and mean reversion) coinciding with the change. Identifying predictors of these changes would be ultimate goal.

Question: if you start with period a and b that are adjacent and each have say 2000 length. then resample from them as you did to generate your probability distribution. Then shift a and b forward with a rolling window, would the evolution of the test p distribution over time be meaningful? Seems like a trough in the distribution could indicate when you’ve isolated the process change point.

Hi,

Thank you for reaching out.

As you mentioned this methodology could be used in many, many ways and regime identification is one of them.

Regarding your question, if I understand well you want to shift in and out of sample period and then resample them at every iteration. In this situation if your sample is large enough I don’t see why the p-value would be unstable. It’s actually just a variation of the method described in the post.

As mentioned in the post there are other tests that could suit this framework. They all have their pros and cons and if I were you I would have a closer look at this. In some situation, it can significantly affect the result.

Hope this helps.

Arnaud

Great post, and this actually sounds like a great suggestion. Although in the author’s example he finds no difference between the two, I suspect that is less common that the opposite case. Knowing whether the difference is due to substantially different strategy behavior (raising the question of overfitting) or due to the fact that the underlying distribution has itself changed is the key to whether strategy needs more work or better left alone.

I have a question about resampling: although not mentioned in the post, am I correct in assuming that all drawn 1000-long samples are contiguous?

Hi Karen,

To answer your question (if this is your question) the 1000 in-sample subsets are not contiguous. They are simply randomly selected trading days from the in-sample period. Have a look at the sample function in R for more details.

HTH

Arnaud