Using CART for Stock Market Forecasting

There is an enormous body of literature both academic and empirical about market forecasting. Most of the time it mixes two market features: Magnitude and Direction. In this article I want to focus on identifying the market direction only. The goal I set myself, is to identify market conditions when the odds are significantly biased toward an up or a down market. This post gives an example of how CART (Classification And Regression Trees) can be used in this context. Before I proceed the usual reminder: What I present in this post is just a toy example and not an invitation to invest. It’s not a finished strategy either but a research idea that needs to be further researched, developed and tailored to individual needs.

1 – What is CART and why using it?

From, CART are a set of techniques for classification and prediction. The technique is aimed at producing rules that predict the value of an outcome (target) variable from known values of predictor (explanatory) variables. There are many different implementations but they are all sharing a general characteristic and that’s what I’m interested in. From Wikipedia, “Algorithms for constructing decision trees usually work top-down, by choosing a variable at each step that best splits the set of items. Different algorithms use different metrics for measuring “best”. These generally measure the homogeneity of the target variable within the subsets. These metrics are applied to each candidate subset, and the resulting values are combined (e.g., averaged) to provide a measure of the quality of the split”.

CART methodology exhibits some characteristics that are very well suited for market analysis:

  • Non parametric: CART can handle any type of statistical distributions
  • Non linear: CART can handle a large spectrum of dependency between variables (e.g., not limited to linear relationships)
  • Robust to outliers

There are various R packages dealing with Recursive Partitioning, I use here rpart for trees estimation and rpart.plot for trees drawing.

2 – Data & Experiment Design

Daily OHLC prices for most liquid ETFs from January 2000 to December 2013 extracted from Google finance. The in sample period goes from January 2000 to December 2010;  the rest of the dataset is the out of sample period. Before running any type of analysis the dataset has to be prepared for the task.

The target variable is the ETF weekly forward return defined as a two states of the world  outcome (UP or DOWN). If weekly forward return > 0 then the market in the UP state, DOWN state otherwise

The explanatory variables are a set of technical indicators derived from the initial daily OHLC dataset. Each indicator represents a well-documented market behavior.  In order to reduce the noise in the data and to try to identify robust relationships, each independent variable is considered to have a binary outcome.

  • Volatility (VAR1): High volatility is usually associated with a down market and low volatility with an up market. Volatility is defined as the 20 days raw ATR (Average True Range) spread to its moving average (MA).  If raw ATR > MA then VAR1 = 1, else VAR1 = -1.
  • Short term momentum (VAR2): The equity market exhibits short term momentum behavior  captured here by a 5 days simple moving averages (SMA). If  Price > SMA  then VAR2 = 1 else VAR2 = -1
  • Long term momentum (VAR3): The equity market exhibits long term momentum behavior  captured here by a 50 days simple moving averages (LMA). If Price > LMA then VAR3 = 1 else VAR3  = -1
  • Short term reversal (VAR4): This is captured by the CRTDR which stands for Close Relative To Daily Range and calculated as following:  CRTDR = {Close - Low }/ {High - Low}. If CRTDR > 0.5, then VAR4 = 1 else VAR4 = -1
  • Autocorrelation regime (VAR5):  The equity market tends to go through periods of negative and positive autocorrelation regimes. If returns autocorrelation over the last 5 days  > 0 then VAR5 = 1 else VAR5 = -1

I put below a tree example with some explanations


In the tree above, the path to reach node #4 is: VAR3 >=0 (Long Term Momentum >= 0)  and  VAR4 >= 0 (CRTDR >= 0).  The red rectangle indicates this is a DOWN leaf (e.g., terminal node) with a probability of 58% (1 – 0.42). In market terms this means that if Long Term Momentum is Up and CRTDR is > 0.5 then the probability of a positive return next week is 42% based on the in sample sample data. 18% indicates the proportion of the data set that falls into that terminal node (e.g., leaf).

There are many ways to use the above approach, I chose to estimate and combine all possible trees. From the in sample data, I collect all leaves from all possible trees and I gather them into a matrix. This is the “rules matrix”  giving the probability of next week beeing UP or DOWN.

3 – Results

I apply the rules in the above matrix to the out of sample data  (Jan 2011 – Dec 2013) and I compare the results to the real outcome. The problem with this approach is that a single point (week) can fall into several rules and even belong to UP and DOWN rules simultaneously. Therefore I apply a voting scheme. For a given week I sum up all the rules that apply to that week giving a +1 for an UP rule and -1 for a DOWN rule. If the sum is greater than 0 the week is classified as UP, if the sum is negative it’s a DOWN week and if the sum is equal to 0 there will be no position taken that week (return = 0)

The above methodology is applied to a set of very liquid ETFs. I plot below the out of sample equity curves along with the buy and hold strategy over the same period.


4 – Conclusion

Initial results seem encouraging even if the quality of the outcome varies greatly by instrument. However there is a huge room for improvement. I put below some directions for further analysis

  • Path optimality: The algorithm used here for defining the trees is optimal at each split but it doesn’t guarantee the optimality of the path. Adding a metric to measure the optimality of the path would certainly improve the above results.
  • Other variables: I chose the explanatory variables solely based on experience. It’s very likely that this choice is neither good nor optimal.
  • Backtest methodology: I used a simple In and Out of sample methodology. In a more formal backtest I would rather use a rolling or expanding window of in and out sample sub-periods (e.g., walk forward analysis)

As usual, any comments welcome



  1. qusma says:

    Interesting stuff, I’ve looked at DTs in the past without great success. The difference in performance seems to be gigantic, what do you think is the cause? It looks like SPY & EEM are almost always in the market, while the other two are a bit more selective?

    Did you try using data from other instruments (i.e. trade QQQ using data from all 4 ETFs)?

    • The R Trader says:

      Thank you for your suggestions. The good performance of the CART strategy for SPY & EFA is explained by 2 factors
      1 – The strategy is long most of the time over a period where those 2 ETFs went up strongly
      2 – It avoids a few nasty weeks by being either short or not in the market.
      The exact opposite applies to QQQ. I tested the method with TLT and LQD with excellent OOS performance and very poor results with GLD and SLV. What I don’t like at this stage is the performance discrepancy between SPY & QQQ. I can’t think of any sensible way of why this is happening. Any idea?
      I like the idea of using trees generated with one instrument to trade another one. This might make the results a bit more stable across instruments.
      BTW: I really like your blog.

  2. Reza says:

    Would it be possible to publish the codes too? Thank you!

  3. Randy Clayton says:

    Thanks, nice post. I like the description of the logic behind your method. Understanding theory is very useful, I learned a few things reading your Data and Experiment Design section.

  4. Market Map says:

    Good article. We try to use decision logic with our risk profiles:

    2014 is neutral.

  5. Thomas Speidel says:

    You may want to be cautious before you make too much out of the results. CARTs are notoriously unstable and their admittedly attractive simplicity hide many problems. They can be useful as exploratory tools in messy situations or even to deal effectively with missing data. Bagging and random forests were developed in part to deal with the instability of CART.

    • The R Trader says:


      Thank you for your suggestions. I’m well aware of the limitations of CART and this is why this article is designed to put together some sensible steps to analyse the market not a trading strategy. Building a real trading strategy with CART will require more work and yes bagging and random forests are interesting tools in that context.

  6. André Tavares says:

    Thanks for sharing your ideas. Why don’t you apply a weighted voting scheme instead of the equal weight scheme? You could apply cross-validation techniques to “calibrate” the weight factors to the instrument you are modelling. It would be an interesting experiment. Will you share the code?

    • The R Trader says:

      Thanks for reaching out. What you suggest makes sense but my initial target was only to put together things that I had in mind for a while. As I wrote it in the conclusion of the post, there are countless ways to improve what I did.
      To Niklas/Reza/Andre: I’m afraid I’m not willing to share the code as I might use it professionally.

      Hope this helps

  7. Richard Warnung says:

    Hi, very nice post. Fits perfectly to my current interest in decision trees, random forests and these things.

    I am just wondering a bit about the decision tree above. Do I get this right: if long term momentum is off (negative indicator) then the market is classified “up”? This is counterintuitive.

    On the other hand if long term momentum is up and CRTDR > 0.5 (the price is closer to high than to low) then classification is “down” -> so there is reversion. ok … I can imagine reversion … on the other hand we have long term momentum (=trend?).

    The other decision is (more or less): if volatility is low then we classify “up”. This is intuitive.

    I wonder how significant the auto-correlation numbers of your assets are. The power of the corresponding variable will depend on that.

    Lastly: could you provide varimp plots?

    • The R Trader says:

      Hi Richard,

      Thank you for reaching out and sorry for the late reply.
      * Regarding your first question you’re right if Long Term Momentum is down then there is 57% chance (based on sample data) that the following week is UP. This is counterintuitive over the long rum but not over the short term (one week ahead). This node just captures the short term reversion effect.
      * The same logic applies to your second point: the short term reversion dominates. Having said that this tree is here for illustration purposes only, I didn’t spend too much time checking the rules.
      * I didn’t check the significance of auto-correlation and I guess you’re right it’ll have an impact on the corresponding variables.
      * Regarding the variable importance, as I didn’t use RandomForest It would be hard to provide the varImPlot. The reason for that is that I’m not so much interested in the variables themselves but in the rules that the combination of variables create. However I would be keen to see what a RandomForest implementation produces.

      Hope this helps

      • Richard Warnung says:

        Hi R Trader, thanks for your response (I just checked today for the first time in some days). Concerning the rules: I checked them because a decision tree must not be a black box. It should help to put some order onto intution but it should not be against intution. At least we have to question the rules.

        If it were a black box then I wonder whether it would be robust to out-of-sample periods.

        Anyways: interesting post!

  8. If your decision tree outputs a categorical variable, “UP” or “DOWN”, how did you produce the forecasts in the second figure? I tried to reproduce your technique in Python using the GOOG historical data and the DecisionTreeClassifier from, but I had difficulties. Surely you have code on GitHub or somewhere, right?

    • The R Trader says:


      Sorry for the late reply. As mentioned to other readers, I’m not sharing the code of this post as I might use some of it professionally. However I can help you to set this up in Python. Please email me privately.

  9. Régis says:

    the tree above, the path to reach node #4 is: VAR3 >=0 (Long Term Momentum >= 0) and VAR4 >= 0 (CRTDR >= 0). The red rectangle indicates this is a DOWN leaf (e.g., terminal node) with a probability of 58% (1 – 0.42). In market terms this means that if Long Term Momentum is Up and CRTDR is > 0.5 then the probability of a positive return next week is 42% based on the in sample sample data

    Is it not the contrary, 58% to be positive and 42% to be negative according the tree path?.

    • The R Trader says:


      Thank you for reaching out. I think the explanation is exactly as I put it in the post. There is a 42% probability of being up next week.
      All leaves are expressed in the same unit: “probability of being up next week”. A 42% probability means you want to go short.
      I hope this helps.

      The R Trader

  10. Data Science says:

    I’ll right away seize your rss as I can’t in finding your e-mail subscription hyperlink or e-newsletter service.
    Do you have any? Please permit me realize in order that I may subscribe.

  11. Tu Doan says:

    It is great. I want to know how to built this model in R.

Leave a Reply