The Rise of the Robots (Advisors…)

The Asset Management industry is on the verge of a major change. Over the last couple of years Robots Advisors (RA) have emerged as new players. The term itself is hard to define as it encompasses a large variety of services. Some are designed to help traditional advisers to better allocate their clients money and some are real “black box”. The user enter a few criteria (age , income, children etc…) and the robot proposes a tailor-made allocation. Between those two extremes a full range of offers is available. I found the Wikipedia definition pretty good. “They are a class of financial adviser that provides  portfolio management online with minimal human intervention”. More precisely they use algorithm-based portfolio management to offer the full spectrum of services a traditional adviser would offer: dividend reinvesting, compliance reports, portfolio rebalancing, tax loss harvesting etc… (well this is what the quantitative investment community is doing for decades!). The industry is still in its infancy with most players still managing a small amount of money but I only realised how profound the change was when I was in NYC a few days ago. When RA get their names on TV adds or on the roof of NYC cab you know something big is happening…


it is getting more and more attention from the media and above all it makes a lot of sense from an investor perspective. There are actually two main advantages in using RA:

  • Significantly lower fees over traditional advisers
  • Investment is made more transparent and simpler which is more appealing to people with limited financial knowledge

In this post R is just an excuse to present nicely what is a major trend in the asset management industry. The chart below shows the market shares of most popular RA as of the end of 2014. The code used to generate the chart below can be found at the end of this post and the data is here.


Those figures are a bit dated given how fast this industry evolves but are still very informative. Not surprisingly the market is dominated by US providers like Wealthfront and Betterment but RA do emerge all over the world: Asia (8Now!), Switzerland (InvestGlass), France (Marie Quantier)….. It is starting to significantly affect the way traditional asset managers are doing business. A prominent example is the partnership between Fidelity and Betterment. Since December 2014 Betterment past the $2 billion AUM mark.

Despite all the above, I think the real change is ahead of us. Because they use less intermediaries and low commission products (like ETFs) they charge much lower fees than traditional advisers. RA will certainly gain significant market shares but they will also lowers fees charged by the industry as a whole. Ultimately it will affect the way traditional investment firms do business. Active portfolio management which is having a tough time for some years now will suffer even more. The high fees it charges will be even harder to justify unless it reinvents itself. Another potential impact is the rise of ETFs and low commission financial products in general. Obviously this has started a while ago but I do think the effect will be even more pronounced in the coming years. New generations of ETFs track more complex indices and custom made strategies. This trend will get stronger inevitably.

As usual any comments welcome

# The Rise of the Robots (Advisors...)
# - August 2015

# STEP 1: Data gathering
tables <- readHTMLTable("")
a <-
b <- a[,c(1:3)]
colnames(b) <- c("company","aum","type")
c <- b[which(b[,"type"] == "Roboadvisor"),]
d <- b[which(b[,"type"] == "RoboAdvisor"),]
e <- rbind(c,d)
f <- as.numeric(str_replace_all(e[,2], "[[:punct:]]",""))
g <- cbind(e[,c("company","type")],f)
colnames(g) <- c("company","type","aum")

# STEP 2: Chart
globalAum <- 19000000000
h <- cbind(g,g[,"aum"]/globalAum)
colnames(h) <- c("company","type","aum","marketShare")
i <- cbind("Others","Roboadvisor",globalAum-sum(h[,"aum"]),1-sum(h[,"aum"]/globalAum))
colnames(i) <- c("company","type","aum","marketShare")
j <- rbind(h,i)
k <- j[order(j[,"marketShare"],decreasing =TRUE),]
k[,"marketShare"] <- round(100*as.numeric(k[,"marketShare"]),1)

# Bar plot with ggplot2
ggplot(data=k, aes(x=company, y=marketShare, fill=company)) +
 geom_bar(stat="identity") +
 theme(axis.ticks = element_blank(), axis.text.y = element_blank(), text = element_text(size = 20)) +
 coord_flip() +
 labs(title = "Robots Advisors \n Market Shares as of end 2014") +
 ylab("Market Share (%)") +
 xlab(" ") +

R financial time series tips everyone should know about

There are many R time series tutorials floating around on the web this post is not designed to be one of them. Instead I want to introduce a list of the most useful tricks I came across when dealing with financial time series in R. Some of the functions presented here are incredibly powerful but unfortunately buried in the documentation hence my desire to create a dedicated post. I only address daily or lower frequency times series. Dealing with higher frequency data requires specific tools: data.table or highfrequency packages are some of them.

xts: The xts package is the must have when it comes to times series in R. The example below loads the package and creates a daily time series of 400 days normaly distributed returns

myTs <- xts(rnorm(400)/100,,Sys.Date(),"days"))

merge.xts (package xts): This is incredibly powerful when it comes to binding two or more times series together whether they have the same length or not. The join argument does the magic! it determines how the binding is done

myOtherTs <- xts(rnorm(200)/100,,Sys.Date(),"days"))

# dates intersection
tsInter <- merge.xts(myTs,myOtherTs,join="inner")
# dates union and blanks filled with NAs
tsUnion <- merge.xts(myTs,myOtherTs,join="outer")

apply.yearly/apply.monthly (package xts): Apply a specified function to each distinct period in a given time series object. The example below calculates monthly and yearly returns of the second series in the tsInter object. Note that I use  the sum of returns (no compounding)

monthlyRtn <- apply.monthly(tsInter[,2], sum)
yearlyRtn <- apply.yearly(tsInter[,2], sum)

endpoints (package xts): Extract index values of a given xts object corresponding to the last observations given a period specified by on. The example gives the last day of the month returns for each series in the tsInter object using endpoint to select the date.

newTs <- allMyTs[endpoints(tsInter, on="months"),]

na.locf (package zoo): Generic function for replacing each NA with the most recent non-NA prior to it. Extremely useful when dealing with a time series with a few “holes” and when this time series is subsequently used as input for an R functions that does not accept arguments with NAs. In the example I create a time series of random prices then artificially includes a few NAs in it and replace them with the most recent value.

n <- 100
k <- 5
N <- k*n
prices <- cumsum(rnorm(N))
theSample <- sample(c(1:length(prices)),10)
prices[theSample] <- NA
prices <- na.locf(prices)

charts.PerformanceSummary (package PerformanceAnalytics): For a set of returns, create a wealth index chart, bars for per-period performance, and underwater chart for drawdown. This is incredibly useful as it displays on a single window all the relevant information for a quick visual inspection of a trading strategy. The example below turns the prices series into an xts object then displays a window with the 3 charts described above.

prices <- xts(prices,,Sys.Date(),"days"))
charts.PerformanceSummary(Return.calculate(prices, method="discrete"))

The list above is by no means exhaustive but once you master the functions describe in this post it makes the manipulation of financial time series a lot easier, the code shorter and the readability of the code better.

As usual any comments welcome

Factor Evaluation in Quantitative Portfolio Management

When it comes to managing a portfolio of stocks versus a benchmark the problem is very different from defining an absolute return strategy. In the former one has to hold more stocks than in the later where no stocks at all can be held  if there is not good enough opportunity.  The reason for that is the tracking error. This is defined as the standard deviation of  the portfolio return minus the benchmark return. The less stocks is held  vs. a benchmark the higher the tracking error (e.g higher risk).

The analysis that follows is largely inspired by the book  “Active Portfolio Management” by Grinold & Kahn. This is the bible for anyone interested in running a portfolio against a benchmark. I strongly encourage anyone with an interest in the topic to read the book from the beginning to the end.  It’s very well written and lays the foundations of systematic active portfolio management (I have no affiliation to the editor or the authors).

1 – Factor Analysis

Here we’re trying to rank as accurately as possible the stocks in the investment universe on a forward return basis. Many people came up with many tools and countless variant of those tools have been developed to achieve this. In this post I focus on two simple and widely used metrics: Information Coefficient (IC) and Quantiles Return (QR).

1.1 – Information Coefficient

The IC gives an overview of the factor forecasting ability. More precisely, this is a measure of how well the factor ranks the stocks on a forward return basis. The IC is defined as the rank correlation (ρ) between the metric (e.g. factor) and the forward return. In statistical terms the rank correlation is a nonparametric measure of dependance between two variables. For a sample of size n, the n raw scores X_i , Y_i are converted to ranks x_i , y_i, and ρ is computed from:  rho = 1 - {6 Sigma (x_i - y_i)^2}/{n(n^2-1)}

The horizon for the forward return has to be defined by the analyst and it’s a function of  the strategy’s turnover and the alpha decay (this has been the subject of extensive research). Obviously ICs must be as high as possible in absolute terms.

For the keen reader, in the book by Grinold & Kahn a formula linking Information Ratio (IR) and IC is given: IC = IR * sqrt{breadth} with breadth being the number of independent bets (trades).  This formula is known as the fundamental law of active management. The problem is that often, defining breadth accurately is not as easy as it sounds.

1.2 – Quantiles Return

In order to have a more accurate estimate of the factor predictive power it’s necessary to go a step further and group stocks by quantile of factor values then analyse the average forward return (or any other central tendency metric) of each of those quantiles. The usefulness of this tool is straightforward. A factor can have a good IC but its predictive power might be limited to a small number of stocks. This is not good as a portfolio manager will have to pick stocks within the entire universe in order to meet its tracking error constraint. Good quantiles return are characterised by a monotonous relationship between the individual quantiles and forward returns.

2 – Data and code

All the stocks in the S&P500 index (at the time of writing). Obviously there is a survival ship bias: the list of stocks in the index has changed significantly between the start and the end of the sample period, however it’s good enough for illustration purposes only.

The code below downloads individual stock prices in the S&P500 between Jan 2005 and today (it takes a while) and turns the raw prices into return over the last 12 months and the last month. The former is our factor, the latter will be used as the forward return measure.

# Factor Evaluation in Quantitative Portfolio Management
# - Mar. 2015

startDate <- "2005-01-01"
tables <- readHTMLTable("")
tickers <- as.matrix(tables[[1]]["Ticker symbol"])

instrumentRtn <- function(instrument=instrument,startDate=startDate,lag=lag){
 price <- get.hist.quote(instrument, quote="Adj", start=startDate, retclass="zoo")
 monthlyPrice <- aggregate(price, as.yearmon, tail, 1)
 monthlyReturn <- diff(log(monthlyPrice),lag=lag)
 monthlyReturn <- exp(monthlyReturn)-1

dataFactor <- list()
dataRtn <- list()

for (i in 1:length(tickers)) {
 dataFactor[[i]] <- instrumentRtn(tickers[i],startDate,lag=12)
 dataRtn[[i]] <- instrumentRtn(tickers[i],startDate,lag=1)

Below is the code to compute Information Coefficient and Quantiles Return. Note that I used quintiles in  this example but any other grouping method (terciles, deciles etc…) can be used. it really depends on the sample size, what you want to capture and wether you want to have a broad overview or focus on distribution tails.  For estimating returns within each quintile, median has been used as the central tendency estimator. This measure is much less sensitive to outliers than arithmetic mean.

theDates <- as.yearmon(seq(as.Date(startDate), to=Sys.Date(), by="month"))

findDateValue <- function(x=x,theDate=theDate){
 pos <- match(as.yearmon(theDate),index(x))

factorStats <- NULL

for (i in 1:(length(theDates)-1)){
 factorValue <- unlist(lapply(dataFactor,findDateValue,theDate=as.yearmon(theDates[i])))
 if (length(which(! > 10){
 bucket <- cut(factorValue,breaks=quantile(factorValue,probs=seq(0,1,0.2),na.rm=TRUE),labels=c(1:5),include.lowest = TRUE)
 rtnValue <- unlist(lapply(dataRtn,findDateValue,theDate=as.yearmon(theDates[i+1])))

 ic <- cor(factorValue,rtnValue,method="spearman",use="pairwise.complete.obs")

 quantilesRtn <- NULL

 for (j in sort(unique(bucket))){
   pos <- which(bucket == j)
   quantilesRtn <- cbind(quantilesRtn,median(rtnValue[pos],na.rm=TRUE))
 factorStats <- rbind(factorStats,cbind(quantilesRtn,ic))

colnames(factorStats) <- c("Q1","Q2","Q3","Q4","Q5","IC")

qs <- apply(factorStats[,c("Q1","Q2","Q3","Q4","Q5")],2,median,na.rm=TRUE)
ic <- round(median(factorStats[,"IC"],na.rm=TRUE),4)

And finally the code to produce the Quantiles Return chart.

bplot <- barplot(qs,
  col="royal blue",
  main="S&P500 Universe \n 12 Months Momentum Return - IC and QS")


   paste("Information Coefficient = ",ic,sep=""),
   bty = "n")

 ICandQS - Mar2015

3 – How to exploit the information above?

In the chart above Q1 is lowest past 12 months return and Q5 highest. There is an almost monotonic increase in the quantiles return between Q1 and Q5 which clearly indicates that stocks falling into Q5 outperform those falling into Q1 by about 1% per month. This is very significant and powerful for such a simple factor (not really a surprise though…). Therefore there are greater chances to beat the index by overweighting the stocks falling into Q5 and underweighting those falling into Q1 relative to the benchmark.

An IC of 0.0206 might not mean a great deal in itself but it’s significantly different from 0 and indicates a good predictive power of the past 12 months return overall. Formal significance tests can be evaluated but this is beyond the scope of this article.

4 – Practical limitations

The above framework is excellent for evaluating investments factor’s quality however there are a number of practical limitations that have to be addressed for real life implementation:

  • Rebalancing: In the description above, it’s assumed that at the end of each month the portfolio is fully rebalanced. This means all stocks falling in Q1 are underweight and all stocks falling in Q5 are overweight relative to the benchmark. This is not always possible for practical reasons: some stocks might be excluded from the investment universe, there are constraints on industry or sector weight, there are constraints on turnover etc…
  • Transaction Costs: This has not be taken into account in the analysis above and this is a serious brake to real life implementation. Turnover considerations are usually implemented in real life in a form of penalty on factor quality.
  • Transfer coefficient: This is an extension of the fundamental law of active management and it relaxes the assumption of Grinold’s model that managers face no constraints which preclude them from translating their investments insights directly into portfolio bets.

And finally, I’m amazed by what can be achieved in less than 80 lines of code with R…

As usual any comments welcome


Risk as a “Survival Variable”

I come across a lot of strategies on the blogosphere some are interesting some are a complete waste of time but most share a common feature: people developing those strategies do their homework in term of analyzing the return but much less attention is paid to the risk side its random nature. I’ve seen comment like “a 25% drawdown in 2011 but excellent return overall”. Well my bet is that no one on earth will let you experience a 25% loss with their money (unless special agreements are in place). In the hedge fund world people have very low tolerance for drawdown. Generally, as a new trader in a hedge fund, assuming that you come with no reputation, you have very little time to prove yourself. You should make money from day 1 and keep on doing so for a few months before you gain a bit of credibility.

First let’s say you have a bad start and you lose money at the beginning. With a 10% drawdown you’re most certainly out but even with a 5% drawdown the chances of seeing your allocation reduced are very high. This has significant implications on your strategies. Let’s assume that if you lose 5% your allocation is divided by 2 and you come back to your initial allocation only when you passed the high water mark again (e.g. the drawdown comes back to 0). In the chart below I simulated the experiment with one of my strategies.


You start trading in 1st June 2003 and all goes well until 23rd Jul. 2003 where your drawdown curve hits the -5% threshold (**1**). Your allocation is cut by 50% and you don’t cross back the high water mark level until 05th Dec. 2003 (**3**). If you have kept the allocation unchanged, the high water mark level would have been crossed on 28th Oct. 2003 (**2**) and by the end of the year you would have made more money.

But let’s push the reasoning a bit further. Still on the chart above, assume you get really unlucky and you start trading toward mid-June 2003. You hit the 10% drawdown limit by the beginning of August and you’re most likely out of the game. You would have started in early August your allocation would not have been cut at all and you end up doing a good year in only 4 full months of trading. In those two examples nothing has changed but your starting date….

The trading success of any individual has some form of path dependency and there is not much you can do about it. However you can control the size of a strategy’s drawdown and this should be addressed with great care.  A portfolio should be diversified in every possible dimension: asset classes, investment strategies, trading frequencies etc…. From that perspective risk is your “survival variable”. If managed properly you have a chance to stay in the game long enough to realise the potential of your strategy. Otherwise you won’t be there next month to see what happens.

As usual any comments welcome

Installing R/RStudio on Ubuntu 14.04

My last experience with Linux was back in 2002/2003. At that time pretty much everything on Linux was done in the console. I remmember struggling for days with a simple Wifi connection because drivers were not readily available. Things have changed dramatically since then. Last week I installed Linux (Ubuntu 14.04)  on an old Windows laptop. It took me about 20 mins to erase completely Windows, install Linux and start playing with R/Rstudio: simply amazing…. In this post I explain step by step what I did: bear in mind that I’m a Linux absolute beginner.

1 – Install Linux

  • Go to Ubuntu website and download the version that matches your system
  • Create a bootable USB key with the file downloaded above. I used a small utility called Rufus for this. Just follow the instructions on the website it’s very simple.

2 – Install R

Ubuntu 14.04 ships with R but it’s not the latest version. The latest version can be obtained from CRAN. An entry like :  http://<my.favorite.cran.mirror>/bin/linux/ubuntu trusty/ has to be added to the /etc/apt/sources.list file, replacing <my.favorite.cran.mirror> by the actual URL of your favorite CRAN mirror (see for the list of CRAN mirrors). Actually this is a bit  tricky because you need admin rights to modify the sources.list file. I used a small utility called gksudo to open and modify the sources.list file. In the command line type the following:

gksudo gedit /etc/apt/sources.list

This will open the sources.list file in gedit. You just need to add the repository above then save and close

You can then install the complete R system, by typing the following in the console:

sudo apt-get update
sudo apt-get install r-base

There are other ways of doing this but adding an entry to the sources.list file is apparently the prefered option. Ubuntu uses apt for package management. Apt stores a list of repositories (software channels) in the sources.list file. By editing this file from the command line, software repositories can be added or removed.

3 – Install RStudio

  • Go to RStudio website, choose and download the right package for your system
  • Open this file in Ubuntu Software Center
  • Click install and you’re done

if you want to have RStudio icon on the launcher (all the icons on the left hand side of the screen)

  • Go to Search and type RStudio, the RStudio icon should appear
  • Drag and Drop RStudio icon to the launcher

All this might not be perfect but it worked for me without a glitch. I wanted to share my experience because  I’m trully amazed by the improvements brought to Linux over the last few years.

As usual any comments welcome.

A Simple Shiny App for Monitoring Trading Strategies – Part II

This is a follow up on my previous post “A Simple Shiny App for Monitoring Trading Strategies“.  I added a few improvements that make the app a bit better (at least for me!). Below is the list of new features :

  • A sample  .csv file (the one that contains the raw data)
  • A “EndDate”  drop down box allowing to specify the end of the period.
  • A “Risk” page containing a VaR analysis and a chart of worst performance over various horizons
  • A “How To” page explaining how to use and tailor the app to individual needs

I also made the app totally self contained. It is now available as a stand alone product and there is no need to have R/RStudio installed on your computer to run it. It can be downloaded from the R Trader Google drive account. This version of the app runs using portable R and portable Chrome. For the keen reader, this link explains in full details how to package a Shiny app into a desktop app (Windows only for now).

1 – How to install & run the app on your computer

  • Create a specific folder
  • Unzip the contain of the .zip file onto that new folder.
  • Change the paths in the runShinyApp file to match your setings
  • To run the app, you just have launch the run.vbs file. I also included an icon (RTraderTradingApp.ico) should you want to create a shortcut on your desktop.

2 – How to use the app as it is?

The app uses as input several csv files (one for each strategy). Each file has two columns: date and daily return. There is an example of such a file in the Github repository. The code is essentially made of 3 files.
  • ui.R: controls the layout and appearance of the app
  • server.R: contains the instructions needed to build the app. You can load as much strategies as you want as long as the corresponding csv file has the right format (see below).
  • shinyStrategyGeneral.R: loads the required packages and launches the app
put ui.R and server.R file in a separate directory
In the server.R file change the inputPath, inputFile and keepColumns parameters to match your settings. The first two are self explanatory the third one is a list of column names within the csv file. Keep only date and daily return.

3 – How to add a trading strategy?

  • Create the corresponding .csv file in the right directory
  • Create a new input in the data reactive function (within the server.R file)
  • Add an extra element to the choice parameter in the first selectInput in the sidebarPanel (within the ui.R file). The element’s name should match the name of the new input above.

4 – How to remove a trading strategy?

  • Remove the input in the data reactive function corresponding to the strategy you want to remove (within the server.R file)
  • Remove the element in the choice parameter in the first selectInput in the sidebarPanel corresponding to the strategy you want to remove (within the ui.R file).

Please feel free to get in touch should you have any suggestion.



A Simple Shiny App for Monitoring Trading Strategies

In a previous post I showed how to use  R, Knitr and LaTeX to build a template strategy report. This post goes a step further by making  the analysis  interactive. Besides the interactivity, the Shiny App also solves two problems :

  • I can now access all my trading strategies from a single point regardless of the instrument traded. Coupled with the Shiny interactivity, it allows easier comparison.
  • I can focus on a specific time period.

The code used in this post is available on a Gist/Github repository. There are essentially 3 files.

  • ui.R:  controls the layout and appearance of the app.
  • server.R: contains the instructions needed to build the app.  It loads the data and format it. There is one csv file per strategy each containing at least two columns: date and return with the following format: (“2010-12-22″,”0.04%”  ). You can load as much strategies as you want as long as they have the right format.
  • shinyStrategyGeneral.R: loads the required packages and launches the app.

This app is probably far from perfect and I will certainly improve it in the future. Feel free to get in touch should you have any suggestion.



A big thank you to the RStudio/Shiny team for such a great tool.


Date formating in R

As I often manipulate time series from different sources, I rarely come across the same date format twice. Having to reformat the dates every time is a real waste of time because I never remember the syntax of the as.Date function. I put below a few examples that turn strings into standard R date format.

Besides the usual transformations, two tricks are worth mentioning:

  • When dates are given in two digits format, R century has to be adjusted depending on whether it is before or after 1969 (example 4 below).
  • When data is coming from Excel as an integer number (I am on Windows, it might be different for Mac users) the origin has to be specified in the as.Date function (example 9 below).

I usually refer to those examples when I have to create R dates. The code below is self explanatory.

rawDate1 <- "6aug2005"
date1 <- as.Date(rawDate1, format = "%d%B%Y")

rawDate2 <- "aug061999"
date2 <- as.Date(rawDate2, format = "%B%d%Y")

rawDate3 <- "12-05-2001"
date3 <- as.Date(rawDate3, format = "%m-%d-%Y")

rawDate4 <- "05/27/25"
## if you mean 2025
date4 <- as.Date(rawDate4, format = "%m/%d/%y")
## if you mean 1925
date4 <- as.Date(format(as.Date(rawDate4, format = "%m/%d/%y"), "19%y/%m/%d"),"%Y/%m/%d")

rawDate5 <- "May 27 1984"
date5 <- as.Date(rawDate5, format = "%B %d %Y")

rawDate6 <- "1998-07-22"
date6 <- as.Date(rawDate6, format = "%Y-%m-%d")

rawDate7 <- "20041024"
date7 <- as.Date(rawDate7, format = "%Y%m%d")

rawDate8 <- "22.10.2004"
date8 <- as.Date(rawDate8, format = "%d.%m.%Y")

## Excel on windows date format (origin as of December 30, 1899)
rawDate9 <- 36529
date9 <- as.Date(rawDate9, origin = "1899-12-30")

For those of you who wish to go further, I recommend the following link: Dates and Times in R. It is also worth mentioning the lubridate package and the date package. Both of them provide advanced functions for handling dates and times.

Using Genetic Algorithms in Quantitative Trading

The question one should always asked him/herself when using technical indicators is what would be an objective criteria to select indicators parameters (e.g., why using a 14 days RSI rather than 15 or 20 days?). Genetic algorithms (GA) are well suited tools to answer that question. In this post I’ll show you how to set up the problem in R. Before I proceed the usual reminder: What I present in this post is just a toy example and not an invitation to invest. It’s not a finished strategy either but a research idea that needs to be further researched, developed and tailored to individual needs.

What are genetic algorithms?

The best description of GA I came across comes from Cybernatic Trading a book by Murray A. Ruggiero. “Genetic Algorithms were invented by John Holland in the mid-1970 to solve hard optimisation problems. This method uses natural selection, survival of the fittest”. The general process follows the steps below:

  1. Encode the problem into chromosomes
  2. Using the encoding, develop a fitness function for use in evaluating each chromosome’s value in solving a given problem
  3. Initialize a population of chromosomes
  4. Evaluate each chromosome in the population
  5. Create new chromosomes by mating two chromosomes. This is done by  muting and recombining two parents to form two children (parents are selected randomly but biased by their fitness)
  6. Evaluate the new chromosome
  7. Delete a member of the population that is less fit than the new chromosome and insert the new chromosome in the population.
  8. If the stop criteria is reached (maximum number of generations, fitness criteria is good enough…) then return the best chromosome alternatively go to step 4

From a trading perspective GA are very useful because they are good at dealing with highly nonlinear problems. However they exhibit some nasty features that are worth mentioning:

  • Over fitting: This is the main problem and it’s down to the analyst to set up the problem in a way that minimises this risk.
  • Computing time: If the problem isn’t properly defined, it can be extremely long to reach a decent solution and the complexity increases exponentially with the number of variables. Hence the necessity to carefully select the parameters.

There are several R packages dealing with GA, I chose to use the most common one: rgenoud

Data & experiment design

Daily closing prices for most liquid ETFs from Yahoo finance going back to January 2000. The in sample period goes from January 2000 to December 2010. The Out of sample period starts on January 2011.

The logic is as following: the fitness function is optimised  over the in sample period to obtain a set of optimal parameters for the selected technical indicators. The performance of those indicators is then evaluated  in the out of sample period. But before doing so the technical indicators have to be selected.

The equity market exhibits two main characteristics that are familiar to anyone with some trading experience. Long term momentum and short term reversal. Those features can be translated in term of technical indicators by: moving averages cross over and RSI. This represents a set of 4 parameters: Look-back periods for long and short term moving averages, look-back period for RSI and RSI threshold. The sets of parameters are the chromosomes. The other key element is the fitness function. We might want to use something like: maximum return or Sharpe ratio or minimum average Drawdown. In what follows, I chose to maximise the Sharpe ratio.

The R implementation is a set of 3 functions:

  1. fitnessFunction: defines the fitness function (e.g., maximum Sharpe ratio) to be used within the GA engine
  2. tradingStatistics: summary of trading statistics for the in and out of sample periods for comparison purposes
  3. genoud: the GA engine from the rgenoud package

The genoud function is rather complex but I’m not going to explain what each parameter means as I want to keep this post short (and the documentation is really good).


In the table below I present for each instrument the optimal parameters (RSI look-back period, RSI threshold, Short Term Moving Average, and Long Term Moving Average) along with the in and out of sample trading statistics.

Instrument/Parameters In Sample Out Of Sample
SPY c(31,62,32,76) total Return = 14.4%
Number of trades = 60
Hit ratio = 60%
total Return = 2.3%
Number of trades = 8
Hit ratio = 50%
EFA c(37,60,36,127) total Return = 27.6%
Number of trades = 107
Hit ratio = 57%
total Return = 2.5%
Number of trades = 11
Hit ratio = 64%
EEM c(44,55,28,90) total Return = 39.1%
Number of trades = 85
Hit ratio = 58%
total Return = 1.0%
Number of trades = 17
Hit ratio = 53%
EWJ c(44,55,28,90) total Return = 15.7%
Number of trades = 93
Hit ratio = 54%
total Return = -13.1%
Number of trades = 31
Hit ratio = 45%

Before commenting the above results, I want to explain a few important points. To match the logic defined above, I bounded the parameters to make sure the look-back period for the long term moving average is always longer that the shorter moving average. I also constrained the optimiser to choose only the solutions with more than 50 trades in the in sample period (e.g;, statistical significance).

Overall the out of sample results are far from impressive. The returns are low even if the number of trades is small to make the outcome really significant. However there’s a significant loss of efficiency between in and out of sample period for Japan (EWJ) which very likely means over fitting.


This post is intended to give the reader the tools to properly use GA in a quantitative trading framework. Once again, It’s just an example that needs to be further refined. A few potential improvement to explore would be:

  • fitness function: maximising the Sharpe ratio is very simplistic. A “smarter” function would certainly improve the out of sample trading statistics
  • pattern: we try to capture a very straightforward pattern. A more in depth pattern research is definitely needed.
  • optimisation: there are many ways to improve the way the optimisation is conducted. This would improve both the computation speed and the rationality of the results.

The code used in this post is available on a Gist repository.

As usual any comments welcome

Using CART for Stock Market Forecasting

There is an enormous body of literature both academic and empirical about market forecasting. Most of the time it mixes two market features: Magnitude and Direction. In this article I want to focus on identifying the market direction only. The goal I set myself, is to identify market conditions when the odds are significantly biased toward an up or a down market. This post gives an example of how CART (Classification And Regression Trees) can be used in this context. Before I proceed the usual reminder: What I present in this post is just a toy example and not an invitation to invest. It’s not a finished strategy either but a research idea that needs to be further researched, developed and tailored to individual needs.

1 – What is CART and why using it?

From, CART are a set of techniques for classification and prediction. The technique is aimed at producing rules that predict the value of an outcome (target) variable from known values of predictor (explanatory) variables. There are many different implementations but they are all sharing a general characteristic and that’s what I’m interested in. From Wikipedia, “Algorithms for constructing decision trees usually work top-down, by choosing a variable at each step that best splits the set of items. Different algorithms use different metrics for measuring “best”. These generally measure the homogeneity of the target variable within the subsets. These metrics are applied to each candidate subset, and the resulting values are combined (e.g., averaged) to provide a measure of the quality of the split”.

CART methodology exhibits some characteristics that are very well suited for market analysis:

  • Non parametric: CART can handle any type of statistical distributions
  • Non linear: CART can handle a large spectrum of dependency between variables (e.g., not limited to linear relationships)
  • Robust to outliers

There are various R packages dealing with Recursive Partitioning, I use here rpart for trees estimation and rpart.plot for trees drawing.

2 – Data & Experiment Design

Daily OHLC prices for most liquid ETFs from January 2000 to December 2013 extracted from Google finance. The in sample period goes from January 2000 to December 2010;  the rest of the dataset is the out of sample period. Before running any type of analysis the dataset has to be prepared for the task.

The target variable is the ETF weekly forward return defined as a two states of the world  outcome (UP or DOWN). If weekly forward return > 0 then the market in the UP state, DOWN state otherwise

The explanatory variables are a set of technical indicators derived from the initial daily OHLC dataset. Each indicator represents a well-documented market behavior.  In order to reduce the noise in the data and to try to identify robust relationships, each independent variable is considered to have a binary outcome.

  • Volatility (VAR1): High volatility is usually associated with a down market and low volatility with an up market. Volatility is defined as the 20 days raw ATR (Average True Range) spread to its moving average (MA).  If raw ATR > MA then VAR1 = 1, else VAR1 = -1.
  • Short term momentum (VAR2): The equity market exhibits short term momentum behavior  captured here by a 5 days simple moving averages (SMA). If  Price > SMA  then VAR2 = 1 else VAR2 = -1
  • Long term momentum (VAR3): The equity market exhibits long term momentum behavior  captured here by a 50 days simple moving averages (LMA). If Price > LMA then VAR3 = 1 else VAR3  = -1
  • Short term reversal (VAR4): This is captured by the CRTDR which stands for Close Relative To Daily Range and calculated as following:  CRTDR = {Close - Low }/ {High - Low}. If CRTDR > 0.5, then VAR4 = 1 else VAR4 = -1
  • Autocorrelation regime (VAR5):  The equity market tends to go through periods of negative and positive autocorrelation regimes. If returns autocorrelation over the last 5 days  > 0 then VAR5 = 1 else VAR5 = -1

I put below a tree example with some explanations


In the tree above, the path to reach node #4 is: VAR3 >=0 (Long Term Momentum >= 0)  and  VAR4 >= 0 (CRTDR >= 0).  The red rectangle indicates this is a DOWN leaf (e.g., terminal node) with a probability of 58% (1 – 0.42). In market terms this means that if Long Term Momentum is Up and CRTDR is > 0.5 then the probability of a positive return next week is 42% based on the in sample sample data. 18% indicates the proportion of the data set that falls into that terminal node (e.g., leaf).

There are many ways to use the above approach, I chose to estimate and combine all possible trees. From the in sample data, I collect all leaves from all possible trees and I gather them into a matrix. This is the “rules matrix”  giving the probability of next week beeing UP or DOWN.

3 – Results

I apply the rules in the above matrix to the out of sample data  (Jan 2011 – Dec 2013) and I compare the results to the real outcome. The problem with this approach is that a single point (week) can fall into several rules and even belong to UP and DOWN rules simultaneously. Therefore I apply a voting scheme. For a given week I sum up all the rules that apply to that week giving a +1 for an UP rule and -1 for a DOWN rule. If the sum is greater than 0 the week is classified as UP, if the sum is negative it’s a DOWN week and if the sum is equal to 0 there will be no position taken that week (return = 0)

The above methodology is applied to a set of very liquid ETFs. I plot below the out of sample equity curves along with the buy and hold strategy over the same period.


4 – Conclusion

Initial results seem encouraging even if the quality of the outcome varies greatly by instrument. However there is a huge room for improvement. I put below some directions for further analysis

  • Path optimality: The algorithm used here for defining the trees is optimal at each split but it doesn’t guarantee the optimality of the path. Adding a metric to measure the optimality of the path would certainly improve the above results.
  • Other variables: I chose the explanatory variables solely based on experience. It’s very likely that this choice is neither good nor optimal.
  • Backtest methodology: I used a simple In and Out of sample methodology. In a more formal backtest I would rather use a rolling or expanding window of in and out sample sub-periods (e.g., walk forward analysis)

As usual, any comments welcome