Archive for the ‘Data Science’ Category.

Visualizing Time Series Data in R

I’m very pleased to announce my DataCamp course on Visualizing Time Series Data in R. This course is also part of the  Time Series with R skills track. Feel free to have a look, the first chapter is free!

Course Description

As the saying goes, “A chart is worth a thousand words”. This is why visualization is the most used and powerful way to get a better understanding of your data. After this course you will have a very good overview of R time series visualization capabilities and you will be able to better decide which model to choose for subsequent analysis. You will be able to also convey the message you want to deliver in an efficient and beautiful way.

Course Outline

Chapter 1: R Time Series Visualization Tools
This chapter will introduce you to basic R time series visualization tools.

Chapter 2: Univariate Time Series
Univariate plots are designed to learn as much as possible about the distribution, central tendency and spread of the data at hand. In this chapter you will be presented with some visual tools used to diagnose univariate times series.

Chapter 3: Multivariate Time Series
What to do if you have to deal with multivariate time series? In this chapter, you will learn how to identify patterns in the distribution, central tendency and spread over pairs or groups of data.

Chapter 4: Case study: Visually selecting a stock that improves your existing portfolio
Let’s put everything you learned so far in practice! Imagine you already own a portfolio of stocks and you have some spare cash to invest, how can you wisely select a new stock to invest your additional cash? Analyzing the statistical properties of individual stocks vs. an existing portfolio is a good way of approaching the problem.

Linking R to IQFeed with the QuantTools package

IQFeed provides streaming data services and trading solutions that cover the Agricultural, Energy and Financial marketplace. It is a well known and recognized data feed provider geared toward retail users and small institutions. The subscription price starts at around $80/month.

Stanislav Kovalevsky has developed a package called QuantTools. It is an all in one package designed to enhance quantitative trading modelling. It allows to download and organize historical market data from multiple sources like Yahoo, Google, Finam, MOEX and IQFeed. The feature that interests me the most is the ability to link IQFeed to R. I’ve been using IQFeed for a few years and I’m happy with it (I’m not affiliated to the company in any way). More information can be found here. I’ve been looking for an integration within R for a while and here it is. As a result, after I ran a few tests, I moved my code that was still in Python into R. Just for completeness, here’s a link that explains how to download historical data from IQFeed using Python.

QuantTools offers four main functionalities: Get market data, Store/Retrieve market data, Plot time series data and  Back testing

  • Get Market Data

First make sure that IQfeed is open. You can either download daily or intraday data. The below code downloads daily prices (Open, High, Low, Close) for SPY from 1st Jan 2017 to 1st June 2017.

## Generic parameters 
from = '2016-01-01' 
to = '2016-06-01' 
symbol = 'SPY' 

## Request data 
get_iqfeed_data(symbol, from, to) 

The below code downloads intraday data from 1st May 2017 to 3rd May 2017.

## Generic parameters
from = '2016-05-01' 
to = '2016-05-03'
symbol = 'SPY'

## Request data 
get_iqfeed_data(symbol, from, to, period = 'tick') 

Note the period parameter. It can take any of the following values: tick, 1min, 5min, 10min, 15min, 30min, hour, day, week, month, depending on the frequency you need.

  • Store/Retrieve Market Data

QuantTools makes the process of managing and storing tick market data easy. You just setup storage parameters and you are ready to go. The parameters are where, since what date and which symbols you would like to be stored. Any time you can add more symbols and if they are not present in a storage, QuantTools tries to get the data from specified start date. The code below will save the data in the following directory: “C:/Users/Arnaud/Documents/Market Data/iqfeed”.  There is one sub folder by instrument and the data is aved in .rds  files.

settings = list(
 iqfeed_storage = paste( path.expand('~') , 'Market Data', 'iqfeed', sep = '/'),
 iqfeed_symbols = c('SPY', 'QQQ'),
 iqfeed_storage_from = format(Sys.Date() - 3)

# Update storage with data from last date available until today

You can also store data between specific dates. Replace the last line of code above with one of the below

# Update storage with data from last date available until specified date
store_iqfeed_data(to = format(Sys.Date()))

# Update storage with data between from and to dates,
store_iqfeed_data(from = format(Sys.Date() - 3), to = format(Sys.Date()))

Now should you want to get back some of the data you stored, just run something like:

get_iqfeed_data(symbol = 'SPY', from = '2017-06-01', to = '2017-06-02', period = 'tick', local = TRUE)

Note that only ticks are supported in local storage so period must be ‘tick’

  • Plot time series data

QuantTools provides plot_ts function to plot time series data without weekend, holidays and overnight gaps. In the example below, I first retrieve the data stored above, then select the first 100 price observations and finally draw the chart

## Retrieve previously stored data
spy = get_iqfeed_data(symbol = 'SPY', 
 from = '2017-06-01', 
 to = '2017-06-02', 
 period = 'tick', 
 local = TRUE)

## Select the first 100 rows
spy_price = spy[,.(time,price)][1:100]

## Plot

Two things to notice: First spy is a data.table object hence the syntax above. To get a  quick overview of data.table capabilities have a look at this excellent cheat sheet from DataCamp. Second the local parameter is TRUE as the data is retrieved from internal storage.

  • Back testing

QuantTools allows to write your own trading strategy using its C++ API. I’m not going to elaborate on this as this is basically C++ code. You can refer to the Examples section on QuantTools website.

Overall I find the package extremely useful and well documented. The only missing bit is the live feed between R and IQFeed which will make the package a real end to end solution.

As usual any comments welcome

BERT: a newcomer in the R Excel connection

A few months ago a reader point me out this new way of connecting R and Excel. I don’t know for how long this has been around, but I never came across it and I’ve never seen any blog post or article about it. So I decided to write a post as the tool is really worth it and before anyone asks, I’m not related to the company in any way.

BERT stands for Basic Excel R Toolkit. It’s free (licensed under the GPL v2)  and it has been developed by Structured Data LLC. At the time of writing the current version of BERT is 1.07. More information can be found here. From a more technical perspective, BERT is designed to support running R functions from Excel spreadsheet cells. In Excel terms, it’s for writing User-Defined Functions (UDFs) in R.

In this post I’m not going to show you how R and Excel interact via BERT. There are very good tutorials  here, here and here. Instead I want to show you how I used BERT to build a “control tower” for my trading.

How do I use BERT?

My trading signals are generated using a long list of R files but I need the flexibility of Excel to display results quickly and efficiently. As shown above BERT can do this for me but I also want to tailor the application to my needs. By combining the power of XML, VBA, R and BERT I can create a good looking yet powerful application in the form of an Excel file with minimum VBA code. Ultimately I have a single Excel file gathering all the necessary tasks to manage my portfolio: database update, signal generation, orders submission etc… My approach  could be broken down in the 3 steps below:

  1. Use XML to build user defined menus and buttons  in an Excel file.
  2. The above menus and buttons are essentially calls to VBA functions.
  3. Those VBA functions are wrapup around R functions defined using BERT.

With this approach I can keep a clear distinction between the core of my code kept in R, SQL and Python and everything used to display and format results kept in Excel, VBA & XML. In the next sections I present the prerequisite to developed such an approach and a step by step guide that explains how BERT could be used for simply passing data from R to Excel with minimal VBA code.


1 – Download and install BERT from this link. Once the installation has completed you should have a new Add-Ins menu in Excel with the buttons as shown below. This is how BERT materialized in Excel.


2 – Download and install Custom UI editor: The Custom UI Editor allows to create user defined menus and buttons in Excel ribbon. A step by step procedure is available here.

Step by step guide

1 – R Code: The below R function is a very simple piece of code for illustration purposes only. It calculates and return the residuals from a linear regression. This is what we want to retrieve in Excel. Save this in a file called myRCode.R (any other name is fine) in a directory of your choice.

myFunction <- function(){
 aa <- rnorm(200)
 bb <- rnorm(200)
 res <- lm(aa~bb)$res

2 – functions.R in BERT: From Excel select Add-Ins -> Home Directory and open the file called functions.R. In this file paste the following code. Make sure you insert the correct path.


This is just sourcing into BERT the R file you created above. Then save and close the file functions.R. Should you want to make any change to the R file created in step 1 you will have to reload it using the BERT button “Reload Startup File” from the Add-Ins menu in Excel

3 – In Excel: Create and save a file called myFile.xslm (any other name is fine). This is a macro-enabled file that you save in the directory of your choice. Once the file is saved close it.

4 – Open the file created above in Custom UI editor: Once the file is open, paste the below code.

<customUI xmlns="">
 <ribbon startFromScratch="false">
 <tab id="RTrader" label="RTrader">
 <group id="myGroup" label="My Group">
 <button id="button1" label="New Button" size="large" onAction="myRCode" imageMso="Chart3DColumnChart" />

You should have something like this in the XML editor:


Essentially this piece of XML code creates an additional menu (RTrader), a new group (My Group) and a user defined button (New Button) in the Excel ribbon. Once you’re done, open myFile.xslm in Excel and close the Custom UI Editor. You should see something like this.


5 – Open VBA editor: In myFile.xlsm insert a new module. Paste the code below in the newly created module.

Sub myRCode(control As IRibbonControl)
   Dim a As Variant
   Dim theLength As Integer
   a = Application.Run("BERT.Call", "myFunction") 
   theLength = UBound(a, 1) + 1 
   ThisWorkbook.Sheets("Sheet1").Range("B1:B" & theLength).Value = a 
End Sub 

This erases previous results in the worksheet prior to coping new ones.

6 – Click New Button: Now go back to the spreadsheet and in the RTrader menu click the “New Button” button. You should see something like the below appearing.


You’re done!

The guide above is a very basic version of what can be achieved using BERT but it shows you how to combine the power of several specific tools to build your own custom application. From my perspective the interest of such an approach is the ability to glue together R and Excel obviously but also to include via XML (and batch) pieces of code from Python, SQL and more. This is exactly what I needed. Finally I would be curious to know if anyone has any experience with BERT?

Trading strategy: Making the most of the out of sample data

When testing trading strategies a common approach is to divide the initial data set into in sample data: the part of the data designed to calibrate the model and out of sample data: the part of the data used to validate the calibration and ensure that the performance created in sample will be reflected in the real world. As a rule of thumb around 70% of the initial data can be used for calibration (i.e. in sample) and 30% for validation (i.e. out of sample). Then a comparison of the in and out of sample data help to decide whether the model is robust enough. This post aims at going a step further and provides a statistical method to decide whether the out of sample data is in line with what was created in sample.

In the chart below the blue area represents the out of sample performance for one of my strategies.


A simple visual inspection reveals a good fit between the in and out of sample performance but what degree of confidence do I have in this? At this stage not much and this is the issue. What is truly needed is a measure of similarity between the in and out of sample data sets. In statistical terms this could be translated as the likelihood that the in and out of sample performance figures coming from the same distribution. There is a non-parametric statistical test that does exactly this: the Kruskall-Wallis Test. A good definition of this test could be found on R-Tutor “A collection of data samples are independent if they come from unrelated populations and the samples do not affect each other. Using the Kruskal-Wallis Test, we can decide whether the population distributions are identical without assuming them to follow the normal distribution.” The added benefit of this test is not assuming a normal distribution.

It exists other tests of the same nature that could fit into that framework. The Mann-Whitney-Wilcoxon test or the Kolmogorov-Smirnov tests would perfectly suits the framework describes here however this is beyond the scope of this article to discuss the pros and cons of each of these tests. A good description along with R examples can be found here.

Here’s the code used to generate the chart above and the analysis:

## Making the most of the OOS data
## - Aug. 2016

thePath <- "myPath" #change this
theFile <- "data.csv" 
data <- read.csv(paste0(thePath,theFile),header=TRUE,sep=",")
data <- xts(data[,2],[,1]),format = "%d/%m/%Y"))

##----- Strategy's Chart
thePeriod <- c("2012-02/2016-05")
 main = "System 1",
 period.areas = thePeriod,
 grid.color = "lightgray",
 period.color = "slategray1")

##----- Kruskal tests
pValue <- NULL
i <- 1
while (i < 1000){
 isSample <- sample(isData,length(osData))
 pValue <- rbind(pValue,kruskal.test(list(osData, isSample))$p.value)
 i <- i + 1

##----- Mean of p-values 

In the example above the in sample period is longer than the out of sample period therefore I randomly created 1000 subsets of the in sample data each of them having the same length as the out of sample data. Then I tested each in sample subset against the out of sample data and I recorded the p-values. This process creates not a single p-value for the Kruskall-Wallis test but a distribution making the analysis more robust. In this example the mean of the p-values is well above zero (0.478) indicating that the null hypothesis should be accepted: there are strong evidences that the in and out of sample data is coming from the same distribution.

As usual what is presented in this post is a toy example that only scratches the surface of the problem and should be tailored to individual needs. However I think it proposes an interesting and rational statistical framework to evaluate out of sample results.

This post is inspired by the following two papers:

Vigier Alexandre, Chmil Swann (2007), “Effects of Various Optimization Functions on the Out of Sample Performance of Genetically Evolved Trading Strategies”, Forecasting Financial Markets Conference

Vigier Alexandre, Chmil Swann (2010), « An optimization process to improve in/out of sample consistency, a Stock Market case», JP Morgan Cazenove Equity Quantitative Conference, London October 2010


Introducing fidlr: FInancial Data LoadeR

fidlr is an RStudio addin designed to simplify the financial data downloading process from various providers. This initial version is a wrapper around the getSymbols function in the quantmod package and only Yahoo, Google, FRED and Oanda are supported. I will probably add functionalities over time. As usual with those things just a kind reminder: “THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND…”

How to install and use fidlr?

  1. You can get the addin/package from its Github repository here (I will register it on CRAN later on)
  2. Install the addin. There is an excellent tutorial to install RStudio Addins here.
  3. Once the addin is installed it should appear in the Addin menu. Just chose fidlr in the menu and a window as pictured below should appear.
  4. Choose a data provider from the  the Source dropdown menu.
  5. Select a date range from the Date menu
  6. Enter the symbol you wish to download in the instrument text box. To download several symbols just enter the symbols separated by commas.
  7. Use the Radio buttons to choose whether you want to download the instrument in a csv file or in the global environment. The csv file will be saved in the working directory and there will be one csv file per instrument.
  8. Press Run to get the data or Close to close down the addin

fidlr screenshot

Error messages and warnings are handled by the underlying packages (quantmod and  Shiny) and can be read from the console

This is a very first version of the project so do not expect perfection but hopefully it will get better over time. Please report any comment, suggestion, bug etc… to:


Maintaining a database of price files in R

Doing quantitative research implies a lot of data crunching and one needs clean and reliable data to achieve this. What is really needed is clean data that is easily accessible (even without an internet connection). The most efficient way to do this for me has been to maintain a set of csv files. Obviously this process can be handled in many ways but I found very efficient and simple overtime to maintain a directory where I store and update csv files. I have one csv file per instrument and each file is named after the instrument it contains. The reason I do so is twofold: First, I don’t want to download (price) data from Yahoo, Google etc… every time I want to test a new idea but more importantly once I identified and fixed a problem, I don’t want to have to do it again the next time I need the same instrument. Simple yet very efficient so far. The process is summarized in the chart below.


In everything that follows, I assume that data is coming from Yahoo. The code will have to be amended for data from Google, Quandl etc… In addition I present the process of updating daily price data. The setup will be different for higher frequency data and other type of dataset (i.e. different from prices).

1 – Initial data downloading (listOfInstruments.R & historicalData.R)

The file listOfInstruments.R is a file containing only the list of all instruments.

## List of securities (Yahoo tickers) 
## - Nov. 2015
theInstruments = c("^GSPC",

If an instrument isn’t part of my list (i.e. no csv file in my data folder) or if you do it for the very first time you have to download the initial historical data set. The example below downloads a set of ETFs daily prices from Yahoo Finance back to January 2000 and store the data in a csv file.

## Daily prices from Yahoo 
## - Nov. 2015

startDate = "2000-01-01"
thePath = "D:\\daily\\data\\"

for (ii in theInstruments){
 data = getSymbols(Symbols = ii, 
                   src = "yahoo", 
                   from = startDate, 
                   auto.assign = FALSE)
 colnames(data) = c("open","high","low","close","volume","adj.")

2 – Update existing data (updateData.R)

The below code starts from existing files in the dedicated folder and updates all of them one after the other. I usually run this process everyday except when I’m on holiday. To add a new instrument, simply run step 1 above for this instrument alone.

## Update data files 
## - Nov. 2015

lookback = 60
startDate = Sys.Date() - lookback
thePath = "D:\\daily\\data\\"
theFiles = list.files(path=thePath,pattern=".csv")

for (ii in theFiles){
 data = read.csv(paste(thePath,ii,sep=""))
 data = xts(data[,c("open","high","low","close","volume","adj.")], = as.Date(data[,"Index"],format="%Y-%m-%d"))
 lastHistoricalDate = index(data[nrow(data),])
 recent = getSymbols(Symbols = substr(ii,1,nchar(ii)-4), 
                      src = "yahoo", 
                      from = startDate, 
                      auto.assign = FALSE)
 colnames(recent) = c("open","high","low","close","volume","adj.")

 pos = match(as.Date(lastHistoricalDate,format="%Y-%m-%d"),index(recent))
 if (!{ 
  if (pos == nrow(recent))
   print("File already up-to-date")
  if (pos < nrow(recent)){
   dt = NULL
   dt = rbind(data,recent[(pos+1):nrow(recent),])
 if (
  print("Error: dates do not match")

3 – Create a batch file (updateDailyPrices.bat)

Another important part of the job is creating a batch file that automates the updating process above (I’m a Windows user). This avoids opening R/RStudio and run the code from there. The code below is placed on a .bat file (the path has to be amended with the reader’s setup). Note that I added an output file (updateLog.txt) to track the execution.

cd ../..
C:\progra~1\R\R-3.1.2\bin\R.exe CMD BATCH --vanilla --slave "D:\daily\data\code\updateHistoricalData.R" "D:\daily\data\code\updateLog.txt"

The process above is extremely simple because it only describes how to update daily price data.  I’ve been using this for a while and it has been working very smoothly for me so far. For more advanced data and/or higher frequencies, things can get much trickier.

As usual any comments welcome

Factor Evaluation in Quantitative Portfolio Management

When it comes to managing a portfolio of stocks versus a benchmark the problem is very different from defining an absolute return strategy. In the former one has to hold more stocks than in the later where no stocks at all can be held  if there is not good enough opportunity.  The reason for that is the tracking error. This is defined as the standard deviation of  the portfolio return minus the benchmark return. The less stocks is held  vs. a benchmark the higher the tracking error (e.g higher risk).

The analysis that follows is largely inspired by the book  “Active Portfolio Management” by Grinold & Kahn. This is the bible for anyone interested in running a portfolio against a benchmark. I strongly encourage anyone with an interest in the topic to read the book from the beginning to the end.  It’s very well written and lays the foundations of systematic active portfolio management (I have no affiliation to the editor or the authors).

1 – Factor Analysis

Here we’re trying to rank as accurately as possible the stocks in the investment universe on a forward return basis. Many people came up with many tools and countless variant of those tools have been developed to achieve this. In this post I focus on two simple and widely used metrics: Information Coefficient (IC) and Quantiles Return (QR).

1.1 – Information Coefficient

The IC gives an overview of the factor forecasting ability. More precisely, this is a measure of how well the factor ranks the stocks on a forward return basis. The IC is defined as the rank correlation (ρ) between the metric (e.g. factor) and the forward return. In statistical terms the rank correlation is a nonparametric measure of dependance between two variables. For a sample of size n, the n raw scores X_i , Y_i are converted to ranks x_i , y_i, and ρ is computed from:  rho = 1 - {6 Sigma (x_i - y_i)^2}/{n(n^2-1)}

The horizon for the forward return has to be defined by the analyst and it’s a function of  the strategy’s turnover and the alpha decay (this has been the subject of extensive research). Obviously ICs must be as high as possible in absolute terms.

For the keen reader, in the book by Grinold & Kahn a formula linking Information Ratio (IR) and IC is given: IC = IR * sqrt{breadth} with breadth being the number of independent bets (trades).  This formula is known as the fundamental law of active management. The problem is that often, defining breadth accurately is not as easy as it sounds.

1.2 – Quantiles Return

In order to have a more accurate estimate of the factor predictive power it’s necessary to go a step further and group stocks by quantile of factor values then analyse the average forward return (or any other central tendency metric) of each of those quantiles. The usefulness of this tool is straightforward. A factor can have a good IC but its predictive power might be limited to a small number of stocks. This is not good as a portfolio manager will have to pick stocks within the entire universe in order to meet its tracking error constraint. Good quantiles return are characterised by a monotonous relationship between the individual quantiles and forward returns.

2 – Data and code

All the stocks in the S&P500 index (at the time of writing). Obviously there is a survival ship bias: the list of stocks in the index has changed significantly between the start and the end of the sample period, however it’s good enough for illustration purposes only.

The code below downloads individual stock prices in the S&P500 between Jan 2005 and today (it takes a while) and turns the raw prices into return over the last 12 months and the last month. The former is our factor, the latter will be used as the forward return measure.

# Factor Evaluation in Quantitative Portfolio Management
# - Mar. 2015

startDate <- "2005-01-01"
tables <- readHTMLTable("")
tickers <- as.matrix(tables[[1]]["Ticker symbol"])

instrumentRtn <- function(instrument=instrument,startDate=startDate,lag=lag){
 price <- get.hist.quote(instrument, quote="Adj", start=startDate, retclass="zoo")
 monthlyPrice <- aggregate(price, as.yearmon, tail, 1)
 monthlyReturn <- diff(log(monthlyPrice),lag=lag)
 monthlyReturn <- exp(monthlyReturn)-1

dataFactor <- list()
dataRtn <- list()

for (i in 1:length(tickers)) {
 dataFactor[[i]] <- instrumentRtn(tickers[i],startDate,lag=12)
 dataRtn[[i]] <- instrumentRtn(tickers[i],startDate,lag=1)

Below is the code to compute Information Coefficient and Quantiles Return. Note that I used quintiles in  this example but any other grouping method (terciles, deciles etc…) can be used. it really depends on the sample size, what you want to capture and wether you want to have a broad overview or focus on distribution tails.  For estimating returns within each quintile, median has been used as the central tendency estimator. This measure is much less sensitive to outliers than arithmetic mean.

theDates <- as.yearmon(seq(as.Date(startDate), to=Sys.Date(), by="month"))

findDateValue <- function(x=x,theDate=theDate){
 pos <- match(as.yearmon(theDate),index(x))

factorStats <- NULL

for (i in 1:(length(theDates)-1)){
 factorValue <- unlist(lapply(dataFactor,findDateValue,theDate=as.yearmon(theDates[i])))
 if (length(which(! > 10){
 bucket <- cut(factorValue,breaks=quantile(factorValue,probs=seq(0,1,0.2),na.rm=TRUE),labels=c(1:5),include.lowest = TRUE)
 rtnValue <- unlist(lapply(dataRtn,findDateValue,theDate=as.yearmon(theDates[i+1])))

 ic <- cor(factorValue,rtnValue,method="spearman",use="pairwise.complete.obs")

 quantilesRtn <- NULL

 for (j in sort(unique(bucket))){
   pos <- which(bucket == j)
   quantilesRtn <- cbind(quantilesRtn,median(rtnValue[pos],na.rm=TRUE))
 factorStats <- rbind(factorStats,cbind(quantilesRtn,ic))

colnames(factorStats) <- c("Q1","Q2","Q3","Q4","Q5","IC")

qs <- apply(factorStats[,c("Q1","Q2","Q3","Q4","Q5")],2,median,na.rm=TRUE)
ic <- round(median(factorStats[,"IC"],na.rm=TRUE),4)

And finally the code to produce the Quantiles Return chart.

bplot <- barplot(qs,
  col="royal blue",
  main="S&P500 Universe \n 12 Months Momentum Return - IC and QS")


   paste("Information Coefficient = ",ic,sep=""),
   bty = "n")

 ICandQS - Mar2015

3 – How to exploit the information above?

In the chart above Q1 is lowest past 12 months return and Q5 highest. There is an almost monotonic increase in the quantiles return between Q1 and Q5 which clearly indicates that stocks falling into Q5 outperform those falling into Q1 by about 1% per month. This is very significant and powerful for such a simple factor (not really a surprise though…). Therefore there are greater chances to beat the index by overweighting the stocks falling into Q5 and underweighting those falling into Q1 relative to the benchmark.

An IC of 0.0206 might not mean a great deal in itself but it’s significantly different from 0 and indicates a good predictive power of the past 12 months return overall. Formal significance tests can be evaluated but this is beyond the scope of this article.

4 – Practical limitations

The above framework is excellent for evaluating investments factor’s quality however there are a number of practical limitations that have to be addressed for real life implementation:

  • Rebalancing: In the description above, it’s assumed that at the end of each month the portfolio is fully rebalanced. This means all stocks falling in Q1 are underweight and all stocks falling in Q5 are overweight relative to the benchmark. This is not always possible for practical reasons: some stocks might be excluded from the investment universe, there are constraints on industry or sector weight, there are constraints on turnover etc…
  • Transaction Costs: This has not be taken into account in the analysis above and this is a serious brake to real life implementation. Turnover considerations are usually implemented in real life in a form of penalty on factor quality.
  • Transfer coefficient: This is an extension of the fundamental law of active management and it relaxes the assumption of Grinold’s model that managers face no constraints which preclude them from translating their investments insights directly into portfolio bets.

And finally, I’m amazed by what can be achieved in less than 80 lines of code with R…

As usual any comments welcome


A Simple Shiny App for Monitoring Trading Strategies

In a previous post I showed how to use  R, Knitr and LaTeX to build a template strategy report. This post goes a step further by making  the analysis  interactive. Besides the interactivity, the Shiny App also solves two problems :

  • I can now access all my trading strategies from a single point regardless of the instrument traded. Coupled with the Shiny interactivity, it allows easier comparison.
  • I can focus on a specific time period.

The code used in this post is available on a Gist/Github repository. There are essentially 3 files.

  • ui.R:  controls the layout and appearance of the app.
  • server.R: contains the instructions needed to build the app.  It loads the data and format it. There is one csv file per strategy each containing at least two columns: date and return with the following format: (“2010-12-22″,”0.04%”  ). You can load as much strategies as you want as long as they have the right format.
  • shinyStrategyGeneral.R: loads the required packages and launches the app.

This app is probably far from perfect and I will certainly improve it in the future. Feel free to get in touch should you have any suggestion.



A big thank you to the RStudio/Shiny team for such a great tool.


Using Genetic Algorithms in Quantitative Trading

The question one should always asked him/herself when using technical indicators is what would be an objective criteria to select indicators parameters (e.g., why using a 14 days RSI rather than 15 or 20 days?). Genetic algorithms (GA) are well suited tools to answer that question. In this post I’ll show you how to set up the problem in R. Before I proceed the usual reminder: What I present in this post is just a toy example and not an invitation to invest. It’s not a finished strategy either but a research idea that needs to be further researched, developed and tailored to individual needs.

What are genetic algorithms?

The best description of GA I came across comes from Cybernatic Trading a book by Murray A. Ruggiero. “Genetic Algorithms were invented by John Holland in the mid-1970 to solve hard optimisation problems. This method uses natural selection, survival of the fittest”. The general process follows the steps below:

  1. Encode the problem into chromosomes
  2. Using the encoding, develop a fitness function for use in evaluating each chromosome’s value in solving a given problem
  3. Initialize a population of chromosomes
  4. Evaluate each chromosome in the population
  5. Create new chromosomes by mating two chromosomes. This is done by  muting and recombining two parents to form two children (parents are selected randomly but biased by their fitness)
  6. Evaluate the new chromosome
  7. Delete a member of the population that is less fit than the new chromosome and insert the new chromosome in the population.
  8. If the stop criteria is reached (maximum number of generations, fitness criteria is good enough…) then return the best chromosome alternatively go to step 4

From a trading perspective GA are very useful because they are good at dealing with highly nonlinear problems. However they exhibit some nasty features that are worth mentioning:

  • Over fitting: This is the main problem and it’s down to the analyst to set up the problem in a way that minimises this risk.
  • Computing time: If the problem isn’t properly defined, it can be extremely long to reach a decent solution and the complexity increases exponentially with the number of variables. Hence the necessity to carefully select the parameters.

There are several R packages dealing with GA, I chose to use the most common one: rgenoud

Data & experiment design

Daily closing prices for most liquid ETFs from Yahoo finance going back to January 2000. The in sample period goes from January 2000 to December 2010. The Out of sample period starts on January 2011.

The logic is as following: the fitness function is optimised  over the in sample period to obtain a set of optimal parameters for the selected technical indicators. The performance of those indicators is then evaluated  in the out of sample period. But before doing so the technical indicators have to be selected.

The equity market exhibits two main characteristics that are familiar to anyone with some trading experience. Long term momentum and short term reversal. Those features can be translated in term of technical indicators by: moving averages cross over and RSI. This represents a set of 4 parameters: Look-back periods for long and short term moving averages, look-back period for RSI and RSI threshold. The sets of parameters are the chromosomes. The other key element is the fitness function. We might want to use something like: maximum return or Sharpe ratio or minimum average Drawdown. In what follows, I chose to maximise the Sharpe ratio.

The R implementation is a set of 3 functions:

  1. fitnessFunction: defines the fitness function (e.g., maximum Sharpe ratio) to be used within the GA engine
  2. tradingStatistics: summary of trading statistics for the in and out of sample periods for comparison purposes
  3. genoud: the GA engine from the rgenoud package

The genoud function is rather complex but I’m not going to explain what each parameter means as I want to keep this post short (and the documentation is really good).


In the table below I present for each instrument the optimal parameters (RSI look-back period, RSI threshold, Short Term Moving Average, and Long Term Moving Average) along with the in and out of sample trading statistics.

Instrument/Parameters In Sample Out Of Sample
SPY c(31,62,32,76) total Return = 14.4%
Number of trades = 60
Hit ratio = 60%
total Return = 2.3%
Number of trades = 8
Hit ratio = 50%
EFA c(37,60,36,127) total Return = 27.6%
Number of trades = 107
Hit ratio = 57%
total Return = 2.5%
Number of trades = 11
Hit ratio = 64%
EEM c(44,55,28,90) total Return = 39.1%
Number of trades = 85
Hit ratio = 58%
total Return = 1.0%
Number of trades = 17
Hit ratio = 53%
EWJ c(44,55,28,90) total Return = 15.7%
Number of trades = 93
Hit ratio = 54%
total Return = -13.1%
Number of trades = 31
Hit ratio = 45%

Before commenting the above results, I want to explain a few important points. To match the logic defined above, I bounded the parameters to make sure the look-back period for the long term moving average is always longer that the shorter moving average. I also constrained the optimiser to choose only the solutions with more than 50 trades in the in sample period (e.g;, statistical significance).

Overall the out of sample results are far from impressive. The returns are low even if the number of trades is small to make the outcome really significant. However there’s a significant loss of efficiency between in and out of sample period for Japan (EWJ) which very likely means over fitting.


This post is intended to give the reader the tools to properly use GA in a quantitative trading framework. Once again, It’s just an example that needs to be further refined. A few potential improvement to explore would be:

  • fitness function: maximising the Sharpe ratio is very simplistic. A “smarter” function would certainly improve the out of sample trading statistics
  • pattern: we try to capture a very straightforward pattern. A more in depth pattern research is definitely needed.
  • optimisation: there are many ways to improve the way the optimisation is conducted. This would improve both the computation speed and the rationality of the results.

The code used in this post is available on a Gist repository.

As usual any comments welcome

Using CART for Stock Market Forecasting

There is an enormous body of literature both academic and empirical about market forecasting. Most of the time it mixes two market features: Magnitude and Direction. In this article I want to focus on identifying the market direction only. The goal I set myself, is to identify market conditions when the odds are significantly biased toward an up or a down market. This post gives an example of how CART (Classification And Regression Trees) can be used in this context. Before I proceed the usual reminder: What I present in this post is just a toy example and not an invitation to invest. It’s not a finished strategy either but a research idea that needs to be further researched, developed and tailored to individual needs.

1 – What is CART and why using it?

From, CART are a set of techniques for classification and prediction. The technique is aimed at producing rules that predict the value of an outcome (target) variable from known values of predictor (explanatory) variables. There are many different implementations but they are all sharing a general characteristic and that’s what I’m interested in. From Wikipedia, “Algorithms for constructing decision trees usually work top-down, by choosing a variable at each step that best splits the set of items. Different algorithms use different metrics for measuring “best”. These generally measure the homogeneity of the target variable within the subsets. These metrics are applied to each candidate subset, and the resulting values are combined (e.g., averaged) to provide a measure of the quality of the split”.

CART methodology exhibits some characteristics that are very well suited for market analysis:

  • Non parametric: CART can handle any type of statistical distributions
  • Non linear: CART can handle a large spectrum of dependency between variables (e.g., not limited to linear relationships)
  • Robust to outliers

There are various R packages dealing with Recursive Partitioning, I use here rpart for trees estimation and rpart.plot for trees drawing.

2 – Data & Experiment Design

Daily OHLC prices for most liquid ETFs from January 2000 to December 2013 extracted from Google finance. The in sample period goes from January 2000 to December 2010;  the rest of the dataset is the out of sample period. Before running any type of analysis the dataset has to be prepared for the task.

The target variable is the ETF weekly forward return defined as a two states of the world  outcome (UP or DOWN). If weekly forward return > 0 then the market in the UP state, DOWN state otherwise

The explanatory variables are a set of technical indicators derived from the initial daily OHLC dataset. Each indicator represents a well-documented market behavior.  In order to reduce the noise in the data and to try to identify robust relationships, each independent variable is considered to have a binary outcome.

  • Volatility (VAR1): High volatility is usually associated with a down market and low volatility with an up market. Volatility is defined as the 20 days raw ATR (Average True Range) spread to its moving average (MA).  If raw ATR > MA then VAR1 = 1, else VAR1 = -1.
  • Short term momentum (VAR2): The equity market exhibits short term momentum behavior  captured here by a 5 days simple moving averages (SMA). If  Price > SMA  then VAR2 = 1 else VAR2 = -1
  • Long term momentum (VAR3): The equity market exhibits long term momentum behavior  captured here by a 50 days simple moving averages (LMA). If Price > LMA then VAR3 = 1 else VAR3  = -1
  • Short term reversal (VAR4): This is captured by the CRTDR which stands for Close Relative To Daily Range and calculated as following:  CRTDR = {Close - Low }/ {High - Low}. If CRTDR > 0.5, then VAR4 = 1 else VAR4 = -1
  • Autocorrelation regime (VAR5):  The equity market tends to go through periods of negative and positive autocorrelation regimes. If returns autocorrelation over the last 5 days  > 0 then VAR5 = 1 else VAR5 = -1

I put below a tree example with some explanations


In the tree above, the path to reach node #4 is: VAR3 >=0 (Long Term Momentum >= 0)  and  VAR4 >= 0 (CRTDR >= 0).  The red rectangle indicates this is a DOWN leaf (e.g., terminal node) with a probability of 58% (1 – 0.42). In market terms this means that if Long Term Momentum is Up and CRTDR is > 0.5 then the probability of a positive return next week is 42% based on the in sample sample data. 18% indicates the proportion of the data set that falls into that terminal node (e.g., leaf).

There are many ways to use the above approach, I chose to estimate and combine all possible trees. From the in sample data, I collect all leaves from all possible trees and I gather them into a matrix. This is the “rules matrix”  giving the probability of next week beeing UP or DOWN.

3 – Results

I apply the rules in the above matrix to the out of sample data  (Jan 2011 – Dec 2013) and I compare the results to the real outcome. The problem with this approach is that a single point (week) can fall into several rules and even belong to UP and DOWN rules simultaneously. Therefore I apply a voting scheme. For a given week I sum up all the rules that apply to that week giving a +1 for an UP rule and -1 for a DOWN rule. If the sum is greater than 0 the week is classified as UP, if the sum is negative it’s a DOWN week and if the sum is equal to 0 there will be no position taken that week (return = 0)

The above methodology is applied to a set of very liquid ETFs. I plot below the out of sample equity curves along with the buy and hold strategy over the same period.


4 – Conclusion

Initial results seem encouraging even if the quality of the outcome varies greatly by instrument. However there is a huge room for improvement. I put below some directions for further analysis

  • Path optimality: The algorithm used here for defining the trees is optimal at each split but it doesn’t guarantee the optimality of the path. Adding a metric to measure the optimality of the path would certainly improve the above results.
  • Other variables: I chose the explanatory variables solely based on experience. It’s very likely that this choice is neither good nor optimal.
  • Backtest methodology: I used a simple In and Out of sample methodology. In a more formal backtest I would rather use a rolling or expanding window of in and out sample sub-periods (e.g., walk forward analysis)

As usual, any comments welcome