A Simple Shiny App for Monitoring Trading Strategies

In a previous post I showed how to use  R, Knitr and LaTeX to build a template strategy report. This post goes a step further by making  the analysis  interactive. Besides the interactivity, the Shiny App also solves two problems :

  • I can now access all my trading strategies from a single point regardless of the instrument traded. Coupled with the Shiny interactivity, it allows easier comparison.
  • I can focus on a specific time period.

The code used in this post is available on a Gist/Github repository. There are essentially 3 files.

  • ui.R:  controls the layout and appearance of the app.
  • server.R: contains the instructions needed to build the app.  It loads the data and format it. There is one csv file per strategy each containing at least two columns: date and return with the following format: (“2010-12-22″,”0.04%”  ). You can load as much strategies as you want as long as they have the right format.
  • shinyStrategyGeneral.R: loads the required packages and launches the app.

This app is probably far from perfect and I will certainly improve it in the future. Feel free to get in touch should you have any suggestion.



A big thank you to the RStudio/Shiny team for such a great tool.


Date formating in R

As I often manipulate time series from different sources, I rarely come across the same date format twice. Having to reformat the dates every time is a real waste of time because I never remember the syntax of the as.Date function. I put below a few examples that turn strings into standard R date format.

Besides the usual transformations, two tricks are worth mentioning:

  • When dates are given in two digits format, R century has to be adjusted depending on whether it is before or after 1969 (example 4 below).
  • When data is coming from Excel as an integer number (I am on Windows, it might be different for Mac users) the origin has to be specified in the as.Date function (example 9 below).

I usually refer to those examples when I have to create R dates. The code below is self explanatory.

rawDate1 <- "6aug2005"
date1 <- as.Date(rawDate1, format = "%d%B%Y")

rawDate2 <- "aug061999"
date2 <- as.Date(rawDate2, format = "%B%d%Y")

rawDate3 <- "12-05-2001"
date3 <- as.Date(rawDate3, format = "%m-%d-%Y")

rawDate4 <- "05/27/25"
## if you mean 2025
date4 <- as.Date(rawDate4, format = "%m/%d/%y")
## if you mean 1925
date4 <- as.Date(format(as.Date(rawDate4, format = "%m/%d/%y"), "19%y/%m/%d"),"%Y/%m/%d")

rawDate5 <- "May 27 1984"
date5 <- as.Date(rawDate5, format = "%B %d %Y")

rawDate6 <- "1998-07-22"
date6 <- as.Date(rawDate6, format = "%Y-%m-%d")

rawDate7 <- "20041024"
date7 <- as.Date(rawDate7, format = "%Y%m%d")

rawDate8 <- "22.10.2004"
date8 <- as.Date(rawDate8, format = "%d.%m.%Y")

## Excel on windows date format (origin as of December 30, 1899)
rawDate9 <- 36529
date9 <- as.Date(rawDate9, origin = "1899-12-30")

For those of you who wish to go further, I recommend the following link: Dates and Times in R. It is also worth mentioning the lubridate package and the date package. Both of them provide advanced functions for handling dates and times.

Using Genetic Algorithms in Quantitative Trading

The question one should always asked him/herself when using technical indicators is what would be an objective criteria to select indicators parameters (e.g., why using a 14 days RSI rather than 15 or 20 days?). Genetic algorithms (GA) are well suited tools to answer that question. In this post I’ll show you how to set up the problem in R. Before I proceed the usual reminder: What I present in this post is just a toy example and not an invitation to invest. It’s not a finished strategy either but a research idea that needs to be further researched, developed and tailored to individual needs.

What are genetic algorithms?

The best description of GA I came across comes from Cybernatic Trading a book by Murray A. Ruggiero. “Genetic Algorithms were invented by John Holland in the mid-1970 to solve hard optimisation problems. This method uses natural selection, survival of the fittest”. The general process follows the steps below:

  1. Encode the problem into chromosomes
  2. Using the encoding, develop a fitness function for use in evaluating each chromosome’s value in solving a given problem
  3. Initialize a population of chromosomes
  4. Evaluate each chromosome in the population
  5. Create new chromosomes by mating two chromosomes. This is done by  muting and recombining two parents to form two children (parents are selected randomly but biased by their fitness)
  6. Evaluate the new chromosome
  7. Delete a member of the population that is less fit than the new chromosome and insert the new chromosome in the population.
  8. If the stop criteria is reached (maximum number of generations, fitness criteria is good enough…) then return the best chromosome alternatively go to step 4

From a trading perspective GA are very useful because they are good at dealing with highly nonlinear problems. However they exhibit some nasty features that are worth mentioning:

  • Over fitting: This is the main problem and it’s down to the analyst to set up the problem in a way that minimises this risk.
  • Computing time: If the problem isn’t properly defined, it can be extremely long to reach a decent solution and the complexity increases exponentially with the number of variables. Hence the necessity to carefully select the parameters.

There are several R packages dealing with GA, I chose to use the most common one: rgenoud

Data & experiment design

Daily closing prices for most liquid ETFs from Yahoo finance going back to January 2000. The in sample period goes from January 2000 to December 2010. The Out of sample period starts on January 2011.

The logic is as following: the fitness function is optimised  over the in sample period to obtain a set of optimal parameters for the selected technical indicators. The performance of those indicators is then evaluated  in the out of sample period. But before doing so the technical indicators have to be selected.

The equity market exhibits two main characteristics that are familiar to anyone with some trading experience. Long term momentum and short term reversal. Those features can be translated in term of technical indicators by: moving averages cross over and RSI. This represents a set of 4 parameters: Look-back periods for long and short term moving averages, look-back period for RSI and RSI threshold. The sets of parameters are the chromosomes. The other key element is the fitness function. We might want to use something like: maximum return or Sharpe ratio or minimum average Drawdown. In what follows, I chose to maximise the Sharpe ratio.

The R implementation is a set of 3 functions:

  1. fitnessFunction: defines the fitness function (e.g., maximum Sharpe ratio) to be used within the GA engine
  2. tradingStatistics: summary of trading statistics for the in and out of sample periods for comparison purposes
  3. genoud: the GA engine from the rgenoud package

The genoud function is rather complex but I’m not going to explain what each parameter means as I want to keep this post short (and the documentation is really good).


In the table below I present for each instrument the optimal parameters (RSI look-back period, RSI threshold, Short Term Moving Average, and Long Term Moving Average) along with the in and out of sample trading statistics.

Instrument/Parameters In Sample Out Of Sample
SPY c(31,62,32,76) total Return = 14.4%
Number of trades = 60
Hit ratio = 60%
total Return = 2.3%
Number of trades = 8
Hit ratio = 50%
EFA c(37,60,36,127) total Return = 27.6%
Number of trades = 107
Hit ratio = 57%
total Return = 2.5%
Number of trades = 11
Hit ratio = 64%
EEM c(44,55,28,90) total Return = 39.1%
Number of trades = 85
Hit ratio = 58%
total Return = 1.0%
Number of trades = 17
Hit ratio = 53%
EWJ c(44,55,28,90) total Return = 15.7%
Number of trades = 93
Hit ratio = 54%
total Return = -13.1%
Number of trades = 31
Hit ratio = 45%

Before commenting the above results, I want to explain a few important points. To match the logic defined above, I bounded the parameters to make sure the look-back period for the long term moving average is always longer that the shorter moving average. I also constrained the optimiser to choose only the solutions with more than 50 trades in the in sample period (e.g;, statistical significance).

Overall the out of sample results are far from impressive. The returns are low even if the number of trades is small to make the outcome really significant. However there’s a significant loss of efficiency between in and out of sample period for Japan (EWJ) which very likely means over fitting.


This post is intended to give the reader the tools to properly use GA in a quantitative trading framework. Once again, It’s just an example that needs to be further refined. A few potential improvement to explore would be:

  • fitness function: maximising the Sharpe ratio is very simplistic. A “smarter” function would certainly improve the out of sample trading statistics
  • pattern: we try to capture a very straightforward pattern. A more in depth pattern research is definitely needed.
  • optimisation: there are many ways to improve the way the optimisation is conducted. This would improve both the computation speed and the rationality of the results.

The code used in this post is available on a Gist repository.

As usual any comments welcome

Using CART for Stock Market Forecasting

There is an enormous body of literature both academic and empirical about market forecasting. Most of the time it mixes two market features: Magnitude and Direction. In this article I want to focus on identifying the market direction only. The goal I set myself, is to identify market conditions when the odds are significantly biased toward an up or a down market. This post gives an example of how CART (Classification And Regression Trees) can be used in this context. Before I proceed the usual reminder: What I present in this post is just a toy example and not an invitation to invest. It’s not a finished strategy either but a research idea that needs to be further researched, developed and tailored to individual needs.

1 – What is CART and why using it?

From statistics.com, CART are a set of techniques for classification and prediction. The technique is aimed at producing rules that predict the value of an outcome (target) variable from known values of predictor (explanatory) variables. There are many different implementations but they are all sharing a general characteristic and that’s what I’m interested in. From Wikipedia, “Algorithms for constructing decision trees usually work top-down, by choosing a variable at each step that best splits the set of items. Different algorithms use different metrics for measuring “best”. These generally measure the homogeneity of the target variable within the subsets. These metrics are applied to each candidate subset, and the resulting values are combined (e.g., averaged) to provide a measure of the quality of the split”.

CART methodology exhibits some characteristics that are very well suited for market analysis:

  • Non parametric: CART can handle any type of statistical distributions
  • Non linear: CART can handle a large spectrum of dependency between variables (e.g., not limited to linear relationships)
  • Robust to outliers

There are various R packages dealing with Recursive Partitioning, I use here rpart for trees estimation and rpart.plot for trees drawing.

2 – Data & Experiment Design

Daily OHLC prices for most liquid ETFs from January 2000 to December 2013 extracted from Google finance. The in sample period goes from January 2000 to December 2010;  the rest of the dataset is the out of sample period. Before running any type of analysis the dataset has to be prepared for the task.

The target variable is the ETF weekly forward return defined as a two states of the world  outcome (UP or DOWN). If weekly forward return > 0 then the market in the UP state, DOWN state otherwise

The explanatory variables are a set of technical indicators derived from the initial daily OHLC dataset. Each indicator represents a well-documented market behavior.  In order to reduce the noise in the data and to try to identify robust relationships, each independent variable is considered to have a binary outcome.

  • Volatility (VAR1): High volatility is usually associated with a down market and low volatility with an up market. Volatility is defined as the 20 days raw ATR (Average True Range) spread to its moving average (MA).  If raw ATR > MA then VAR1 = 1, else VAR1 = -1.
  • Short term momentum (VAR2): The equity market exhibits short term momentum behavior  captured here by a 5 days simple moving averages (SMA). If  Price > SMA  then VAR2 = 1 else VAR2 = -1
  • Long term momentum (VAR3): The equity market exhibits long term momentum behavior  captured here by a 50 days simple moving averages (LMA). If Price > LMA then VAR3 = 1 else VAR3  = -1
  • Short term reversal (VAR4): This is captured by the CRTDR which stands for Close Relative To Daily Range and calculated as following:  CRTDR = {Close - Low }/ {High - Low}. If CRTDR > 0.5, then VAR4 = 1 else VAR4 = -1
  • Autocorrelation regime (VAR5):  The equity market tends to go through periods of negative and positive autocorrelation regimes. If returns autocorrelation over the last 5 days  > 0 then VAR5 = 1 else VAR5 = -1

I put below a tree example with some explanations


In the tree above, the path to reach node #4 is: VAR3 >=0 (Long Term Momentum >= 0)  and  VAR4 >= 0 (CRTDR >= 0).  The red rectangle indicates this is a DOWN leaf (e.g., terminal node) with a probability of 58% (1 – 0.42). In market terms this means that if Long Term Momentum is Up and CRTDR is > 0.5 then the probability of a positive return next week is 42% based on the in sample sample data. 18% indicates the proportion of the data set that falls into that terminal node (e.g., leaf).

There are many ways to use the above approach, I chose to estimate and combine all possible trees. From the in sample data, I collect all leaves from all possible trees and I gather them into a matrix. This is the “rules matrix”  giving the probability of next week beeing UP or DOWN.

3 – Results

I apply the rules in the above matrix to the out of sample data  (Jan 2011 – Dec 2013) and I compare the results to the real outcome. The problem with this approach is that a single point (week) can fall into several rules and even belong to UP and DOWN rules simultaneously. Therefore I apply a voting scheme. For a given week I sum up all the rules that apply to that week giving a +1 for an UP rule and -1 for a DOWN rule. If the sum is greater than 0 the week is classified as UP, if the sum is negative it’s a DOWN week and if the sum is equal to 0 there will be no position taken that week (return = 0)

The above methodology is applied to a set of very liquid ETFs. I plot below the out of sample equity curves along with the buy and hold strategy over the same period.


4 – Conclusion

Initial results seem encouraging even if the quality of the outcome varies greatly by instrument. However there is a huge room for improvement. I put below some directions for further analysis

  • Path optimality: The algorithm used here for defining the trees is optimal at each split but it doesn’t guarantee the optimality of the path. Adding a metric to measure the optimality of the path would certainly improve the above results.
  • Other variables: I chose the explanatory variables solely based on experience. It’s very likely that this choice is neither good nor optimal.
  • Backtest methodology: I used a simple In and Out of sample methodology. In a more formal backtest I would rather use a rolling or expanding window of in and out sample sub-periods (e.g., walk forward analysis)

As usual, any comments welcome


A million ways to connect R and Excel

In quantitative finance both R and Excel are the basis tools for any type of analysis. Whenever one has to use Excel in conjunction with R, there are many ways to approach the problem and many solutions. It depends on what you really want to do and the size of the dataset you’re dealing with. I list some possible connections in the table below.

I want to… R function/package
Read Excel spreadsheet in R gdata
Read R output in Excel write.table
Execute R code in VBA Custom function
Execute R code from Excel spreadsheet RExcel
Execute VBA code in R Custom function
Fully integrate R and Excel RExcel


1 – Read Excel spreadsheet in R

  • gdata: it requires you to install additional Perl libraries on Windows platforms but it’s very powerful.
myDf <- read.xls ("myfile.xlsx"), sheet = 1, header = TRUE)
  • RODBC: This is reported for completeness only. It’s rather dated; there are better ways to interact with Excel nowadays.
  • XLConnect:  It might be slow for large dataset but very powerful otherwise.
wb <- loadWorkbook("myfile.xlsx")
myDf <- readWorksheet(wb, sheet = "Sheet1", header = TRUE)
  • xlsx:  Prefer the read.xlsx2() over read.xlsx(), it’s significantly faster for large dataset.
read.xlsx2("myfile.xlsx", sheetName = "Sheet1")
  • xlsReadWrite: Available for Windows only. It’s rather fast but doesn’t support .xlsx files which is a serious drawback. It has been removed from CRAN lately.
  • read.table(“clipboard”):  It allows to copy data from Excel and read it directly in R. This is the quick and dirty R/Excel interaction but it’s very useful in some cases.
myDf <- read.table("clipboard")

2 – Read R output in Excel
First create a csv output from an R data.frame then read this file in Excel. There is one function that you need to know it’s write.table. You might also want to consider: write.csv which uses “.” for the decimal point and a comma for the separator and write.csv2 which uses a comma for the decimal point and a semicolon for the separator.

x <- cbind(rnorm(20),runif(20))
colnames(x) <- c("A","B")

3 – Execute R code in VBA
RExcel is from my perspective the best suited tool but there is at least one alternative. You can run a batch file within the VBA code.  If R.exe is in your PATH, the general syntax for the batch file (.bat) is:

R CMD BATCH [options] myRScript.R

Here’s an example of how to integrate the batch file above within your VBA code.

4 - Execute R code from an Excel spreadsheet
Rexcel is the only tool I know for the task. Generally speaking once you installed RExcel you insert the excel code within a cell and execute from RExcel spreadsheet menu. See the RExcel references below for an example.

5 – Execute VBA code in R 
This is something I came across but I never tested it myself. This is a two steps process. First write a VBscript wrapper that calls the VBA code. Second run the VBscript in R with the system or shell functions. The method is described in full details here.

6 – Fully integrate R and Excel
RExcel is a project developped by Thomas Baier and Erich Neuwirth, “making R accessible from Excel and allowing to use Excel as a frontend to R”. It allows communication in both directions: Excel to R and R to Excel and covers most of what is described above and more. I’m not going to put any example of RExcel use here as the topic is largely covered elsewhere but I will show you where to find the relevant information. There is a wiki for installing RExcel and an excellent tutorial available here.  I also recommand the following two documents: RExcel – Using R from within Excel and High-Level Interface Between R and Excel. They both give an in-depth view of RExcel capabilities.

The list above is probably not exhaustive. Feel free to come back to me for any addition or modification you might find useful. All code snipets have been created by Pretty R at inside-R.org

Overnight vs. Intraday ETF Returns

I haven’t done much “googling” before posting, so this topic might have been covered elsewhere but I think it’s  really worth sharing or repeating anyway.

A lot has been written about the source of  ETF returns (some insights might be found here). In a nutshell some analysis found that the bulk of the return is made overnight (return between close price at t and open price at t+1). This is only partially true as it hides some major differences across asset classes and regions. The table below displays the sum of daily returns (close to close) , intraday returns (open to close) and overnight returns (close to open) for most liquid ETF over a period going from today back to January 1st 2000 when data is available. The inception date of the ETF is used when no data is available prior to January 1st 2000.

ETF Daily Rtn Intraday Rtn Overnight Rtn
SPY 53.7% -8.1% 59.2%
QQQ 10.7% -84.3% 93.3%
IWN 81.8% 30.4% 52.1%
EEM 51.5% -42.5% 83.8%
EFA 13.2% 73.3% -61.5%
EWG 77.7% 143.1% -62.6%
EWU 41.2% 132.3% -84.5%
EWL 109.4% 229.9% -110.3%
EWJ 10.4% 115% -107.9%
FXI 72.8% 13.8% 45.3%
EWS 89.7% -83.9% 175.9%
GLD 120.9% 18.7% 101.1%
GDX 29% -270.2% 293.5%
SLV -2.8% -36.6% 39.1%
USO -21.6% 56.7% -79.5%
SHY 4% 10.7% -6.5%
IEF 23.5% 37.4% -13.4%
TLT 37.1% 50.6% -13.5%
LQD 16.7% -36.3% 54.3%

A few obvious features clearly appear

  • For US Equity markets (SPY, QQQ, IWM), Emerging Equity Markets (EEM), Metals (GLD,GDX,SLV) and Investment Grades (LQD) the bulk of the return is definitely made overnight. Intraday returns tend to deteriorate the overall performance (intraday return < 0)
  • The exact opposite is true for European Equity Markets (EFA,EWG,EWU,EWL), US Bonds (SHY,IEF,TLT) and Oil (USO). Overnight returns are detracting significantly from the overall performance.

I didn’t manage to come up with a decent explanation about why this is happening but I’m keen on learning if someone is willing to share! I’m not too sure at this stage how this information can be used but it has to be taken into account somehow.

Below is the code for generating the analysis above.

## thertrader@gmail.com - Jan 2014

symbolList <- c("SPY","QQQ","IWN","EEM","EFA","EWG","EWU","EWL","EWJ","FXI","EWS","GLD","GDX","SLV","USO","SHY","IEF","TLT","LQD")

results <- NULL

for (ii in symbolList){
  data <- getSymbols(Symbols = ii, 
                     src = "yahoo", 
                     from = "2000-01-01", 
                     auto.assign = FALSE)

  colnames(data) <- c("open","high","low","close","volume","adj.")

  dailyRtn <- (as.numeric(data[2:nrow(data),"close"])/as.numeric(data[1:(nrow(data)-1),"close"])) - 1
  intradayRtn <- (as.numeric(data[,"close"])/as.numeric(data[,"open"]))-1
  overnightRtn <- (as.numeric(data[2:nrow(data),"open"])/as.numeric(data[1:(nrow(data)-1),"close"])) - 1

  results <- rbind(results,cbind(
    paste(round(100 * sum(dailyRtn,na.rm=TRUE),1),"%",sep=""),
    paste(round(100 * sum(intradayRtn,na.rm=TRUE),1),"%",sep=""),
    paste(round(100 * sum(overnightRtn,na.rm=TRUE),1),"%",sep="")))
colnames(results) <- c("dailyRtn","intradayRtn","overnightRtn")
rownames(results) <- symbolList

As usual any comments welcome

Introduction to R for Quantitative Finance – Book Review

I used some spare time I had over the christmas break to review a book I came across: Introduction to R for Quantitative Finance. An introduction to the book by the authors can be found here.

Introduction to R for Quantitative Finance - cover picture

The book targets folks with some finance knowledge but no or little experience with R. Each chapter is organised around a quant finance topic. Step by step, financial models are built with the associated R code allowing the reader to fully understand the transition from theory to implementation. It also includes some real life examples. The following concepts are covered:

Chap 1: Time Series Analysis

Chap 2: Portfolio Optimisation

Chap 3: Asset Pricing Model

Chap 4: Fixed Income Securities

Chap 5: Estimating the Term Structure of Interest Rates

Chap 6: Derivatives pricing

Chap 7: Credit Risk Management

Chap 8: Extreme Value Theory

Chap 9: Financial Networks

As an experimented R user, I didn’t expect to learn much but I was wrong. I didn’t know about the GUIDE package: a GUI for derivatives pricing, the evir package which gathers functions for extreme value theory and I also learned a few programming tricks.

All in all, this is an excellent book for anyone keen on learning R in a quantitative finance framework. I think it would have benefited from a formal introduction to R and a data Export/Import capabilities review but both topics are extensively covered in many other R resources.

As usual, any comments welcome

Financial Data Accessible from R – part IV

DataMarket is the latest data source of financial data accessible from R I came across. A good tutorial can be found here. I updated the table and the descriptions below.

Source R Package Free Access Available on CRAN Provider url
Yahoo, FRED, Oanda, Google Quantmod Yes Yes Quantmod
Quandl Quandl Yes Yes Quandl
TrueFX TFX Yes Yes TrueFX
Bloomberg Rbbg No No findata
Interactive Broker IBrokers No Yes InteractiveBrokers
Datastream rdatastream No No Datastream
Penn World Table pwt Yes Yes Penn World Table
Yahoo, FRED, Oanda fImport Yes Yes Rmetrics
ThinkNum Thinknum Yes Yes ThinkNum
DataMarket rdatamarket Yes Yes DataMarket

Data Description

  • Yahoo: Free stock quotes, up to date news, portfolio management resources, international market data, message boards, and mortgage rates that help you manage your financial life
  • FRED: Download, graph, and track 149,000 economic time series from 59 sources
  • Oanda: Currency information, tools, and resources for investors, businesses, and travelers
  • Google: Stock market quotes, news, currency conversions & more
  • Quandl: Futures prices, daily. Quandl is a search engine for numerical data. The site offers access to several million financial, economic and social datasets
  • TrueFX: Tick-By-Tick Real-Time And Historical Market Rates, Clean, Aggregated, Dealer Prices
  • Bloomberg: Financial news, business news, economic news, stock quotes, markets quotes, finance stocks, financial markets, stock futures, personal finance, personal finance advice, mutual funds, financial calculators, world business, small business, financial trends, forex trading, technology news, bloomberg financial news
  • Interactive Broker: Interactive Brokers Group, Inc. is an online discount brokerage firm in the United States
  • Datastream: Datastream Professional is a powerful tool that integrates economic research and strategy with cross asset analysis to seamlessly bring together top down and bottom up in one single, integrated application
  • pwt: The Penn World Table provides purchasing power parity and national income accounts converted to international prices for 189 countries/territories for some or all of the years 1950-2010
  • Thinknum: Thinknum brings financial data from a variety of useful sources together on one platform. We use this data to develop applications
  • DataMarket: DataMarket brings complex and diverse data together so you can search, visualize and share data in one place and one format

Package Detail

  • Quantmod: Specify, build, trade, and analyse quantitative financial trading strategies
  • Quandl: This package interacts directly with the Quandl API to offer data in a number of formats usable in R, as well as the ability to upload and search
  • TFX: Connects R to TrueFX(tm) for free streaming real-time and historical tick-by-tick market data for dealable interbank foreign exchange rates with millisecond detail
  • Rbbg: Handles fetching data from the Bloomberg financial data application
  • IBrokers: Provides native R access to Interactive Brokers Trader Workstation API
  • rdatastream: RDatastream is a R interface to the Thomson Dataworks Entreprise SOAP API (non free), with some convenience functions for retrieving Datastream data specifically. This package requires valid credentials for this API
  • pwt: The Penn World Table provides purchasing power parity and national income accounts converted to international prices for 189 countries/territories for some or all of the years 1950-2010
  • fImport: Rmetrics is the premier open source software solution for teaching and training quantitative finance. fImport is the package for Economic and Financial Data Import
  • Thinknum: This package interacts directly with the Thinknum API to offer data in a number of formats usable in R
  • rdatamarket: Fetches data from DataMarket.com, either as timeseries in zoo form (dmseries) or as long-form data frames (dmlist). Metadata including dimension structure is fetched with dminfo, or just the dimensions with dmdims.


Evaluating Quandl Data Quality – part II

This post is a more in depth analysis of Quandl futures data vs. Bloomberg data. Since my last post Quandl has updated its futures database to 200+ contracts from 68 contracts originally. For practical reasons, I limit myself here to the initial list of 60+ contracts. I’m still comparing the “Front Month” contract between the two sources. When evaluating the differences, I want the following:

  • Evaluate the scale of the differences
  • Evaluate the time localization of the differences (if any)
  • A single number that captures both features above
  • A measure that is comparable across instruments

After a bit of thinking, I came up with the below metric:

 D_t  =  {P(Quandl)_t  -  P(Bloomberg)_t }/ {Tick Size}

As an example, below is the chart of the above formula over time for the E-mini S&P 500 contract.


I plotted the same chart for each of the 60 contracts in the list of my previous post. Interested readers can find all the charts here.

From my perspective there are essentially two main sources of differences. First, plain wrong data points largely off compared to the reality and second a difference in the data building process (i.e. construction methodology for the front month contract). A mix of both is very likely to happen here.  In order to quantify this, I defined one additional metric: Mean Absolute Differences (MAD).

 MAD=sum{t=1}{n}{Abs(D_t)}/n for D_t <> 0″ title=”MAD=sum{t=1}{n}{Abs(D_t)}/n for D_t <> 0″/></p>
<p style=

Instrument Quandl Symbol Bloomberg Ticker MAD
Soybean Oil OFDP/FUTURE_BO1 BO1 Comdty 12254897
Russian Ruble OFDP/FUTURE_RU1 RU1 Curncy 29653
DJ-UBS Commodity Index OFDP/FUTURE_AW1 DNA Index 3041
S&P500 Volatility Index OFDP/FUTURE_VX1 UX1 Index 2453
Cocoa OFDP/FUTURE_CC1 CC1 Comdty 1552
Lean Hogs OFDP/FUTURE_LN1 LH1 Comdty 391

Ranking the 60+ contracts on MAD allows to identify immediately large differences which are: Soybean Oil, Russian Ruble, DJ-UBS Commodity Index, S&P500 Volatility Index, Cocoa, and Lean Hogs. Those are the obvious candidates for immediate checking.

I put together what I think is the basis for a systematic data checking approach. It can obviously be refined in many ways but those refinements are largely dependent upon what one want to do with the data and which contracts are relevant to the analyst. As an example I assume that it is more relevant for most people to have accurate data for the E-mini S&P 500 contract than for the Milk contract.

As usual any comments welcome

Evaluating Quandl Data Quality

Quandl has indexed millions of time-series datasets from over 400 sources. All of Quandl’s datasets are open and free. This is great news but before performing any backtest using Quandl data, I want to compare it with a trusted source: Bloomberg for the purpose of this post. I will focus only on daily Futures data here (front month contract).

In the table below, the first few columns are self explanatory, the last two columns are:

  • Intersec. the number of common dates between Quandl and Bloomberg as of November 14, 2013
  • Mismatch the number of different closing prices among those common dates (Intersec. ≥ Mismatch).

Instrument Quandl Symbol Bloomberg Ticker Quandl StartDate Bloomberg StartDate Intersec. Mismatch
Australian Dollar OFDP/FUTURE_AD1 AD1 Curncy 13/01/1987 19/12/1989 5933 2640
DJ-UBS Commodity Index OFDP/FUTURE_AW1 DNA Index 03/10/2011 19/12/2008 532 42
Brent Crude Oil OFDP/FUTURE_B1 CO1 Comdty 17/08/1990 23/06/1988 5906 53
Soybean Oil OFDP/FUTURE_BO1 BO1 Comdty 01/07/1959 02/01/1975 8965 816
British Pound OFDP/FUTURE_BP1 BP1 Curncy 13/02/1975 20/03/1990 5879 2959
Brazilian Real OFDP/FUTURE_BR1 BR1 Curncy 08/11/1995 29/11/1995 2023 1526
Corn OFDP/FUTURE_C1 C 1 Comdty 01/07/1959 01/07/1959 13311 4407
Cocoa OFDP/FUTURE_CC1 CC1 Comdty 05/01/1970 01/07/1959 10826 903
Canadian Dollar OFDP/FUTURE_CD1 CD1 Curncy 17/01/1977 04/04/1986 6666 3055
CER Emissions OFDP/FUTURE_CER1 ICEDCER Index 18/03/2008 13/03/2009 1173 965
WTI Crude Oil OFDP/FUTURE_CL1 CL1 Comdty 30/03/1983 30/03/1983 7620 2549
Cotton OFDP/FUTURE_CT1 CT1 Comdty 01/07/1959 01/07/1959 13559 1361
Milk OFDP/FUTURE_DA1 DA1 Comdty 03/10/2011 11/01/1996 531 24
Full-size Dow Jones OFDP/FUTURE_DJ1 DJ1 Index 02/01/1998 03/04/1998 3918 683
US Dollar Index OFDP/FUTURE_DX1 DX1 Curncy 20/11/1985 07/04/1986 6872 285
Euro FX OFDP/FUTURE_EC1 EC1 Curncy 04/01/1999 16/12/1998 3735 1739
3-Month Eurodollar OFDP/FUTURE_ED1 ED1 Comdty 01/02/1982 01/04/1986 6800 2347
Ethanol OFDP/FUTURE_EH1 DL1 Comdty 03/10/2011 16/05/2005 532 0
1-Month Eurodollar OFDP/FUTURE_EM1 EM1 Comdty 01/10/2007 05/04/1990 1020 19
E-mini S&P 500 Index OFDP/FUTURE_ES1 ES1 Index 09/09/1997 09/09/1997 4112 1112
Feeder Cattle OFDP/FUTURE_FC1 FC1 Comdty 06/09/1973 30/11/1971 9995 4188
30-day Fed Funds OFDP/FUTURE_FF1 FF1 Comdty 03/10/1988 06/12/1988 5251 958
5-year Treasury Note OFDP/FUTURE_FV1 FV1 Comdty 20/05/1988 20/05/1988 6206 2565
Gold OFDP/FUTURE_GC1 GC1 Comdty 31/12/1974 02/01/1975 9214 3006
Copper OFDP/FUTURE_HG1 HG1 Comdty 01/07/1959 06/12/1988 5486 3785
Heating Oil OFDP/FUTURE_HO1 HO1 Comdty 06/03/1979 01/07/1986 6835 2806
Japanese Yen OFDP/FUTURE_JY1 JY1 Curncy 02/08/1976 19/12/1989 6012 2548
Coffee OFDP/FUTURE_KC1 KC1 Comdty 17/08/1973 16/08/1972 9891 324
Lumber OFDP/FUTURE_LB1 LB1 Comdty 16/11/1972 07/04/1986 6922 245
Live Cattle OFDP/FUTURE_LC1 LC1 Comdty 30/11/1964 30/11/1964 12314 5364
Lean Hogs OFDP/FUTURE_LN1 LH1 Comdty 25/06/1969 01/04/1986 6937 3951
UK Natural Gas OFDP/FUTURE_M1 FN1 Comdty 02/01/2007 30/01/1997 1766 3
S&P 400 Midcap Index OFDP/FUTURE_MD1 MD1 Index 13/02/1992 13/02/1992 3980 614
Mexican Peso OFDP/FUTURE_MP1 PEA Curncy 25/04/1995 11/05/2012 382 340
New Zealand Dollar OFDP/FUTURE_NE1 NV1 Curncy 02/09/2004 08/05/1997 1937 189
Natural Gas OFDP/FUTURE_NG1 NG1 Comdty 03/04/1990 03/04/1990 5682 2460
Nikkei 225 Index OFDP/FUTURE_NK1 NX1 Index 25/09/1990 02/10/1990 5643 331
E-mini Nasdaq 100 OFDP/FUTURE_NQ1 NQ1 Index 22/06/1999 21/06/1999 3635 652
Oats OFDP/FUTURE_O1 O 1 Comdty 01/07/1959 01/07/1959 12770 3600
Orange Juice OFDP/FUTURE_OJ1 JO1 Comdty 01/02/1967 01/02/1967 11684 560
Palladium OFDP/FUTURE_PA1 PA1 Comdty 05/01/1977 01/04/1986 5596 1840
Russell 1000 OFDP/FUTURE_RF1 RM1 Index 11/05/2007 20/09/2002 1657 18
Russell Growth OFDP/FUTURE_RG1 RGAA Index 03/05/2010 24/12/2012 230 192
Rough Rice OFDP/FUTURE_RR1 RR1 Comdty 20/08/1986 06/12/1988 5112 363
Russian Ruble OFDP/FUTURE_RU1 RU1 Curncy 03/10/2011 23/04/1998 533 2
Russell Value OFDP/FUTURE_RV1 RVBA Index 03/05/2010 24/12/2012 230 192
Soybeans OFDP/FUTURE_S1 S 1 Comdty 01/07/1959 01/07/1959 13401 4299
Sugar OFDP/FUTURE_SB1 SB1 Comdty 04/01/1961 03/01/1961 13105 814
Swiss Franc OFDP/FUTURE_SF1 SF1 Curncy 13/02/1975 19/12/1989 6010 2741
Silver OFDP/FUTURE_SI1 SI1 Comdty 13/06/1963 02/01/1975 8848 3058
Soybean Meal OFDP/FUTURE_SM1 SM1 Comdty 02/07/1959 02/01/1975 9234 913
Full-size S&P500 Index OFDP/FUTURE_SP1 SP1 Index 21/04/1982 21/04/1982 7580 2688
Russell Small-Cap OFDP/FUTURE_TF1 RTA1 Index 20/08/2007 17/08/2007 1613 8
2-year Treasury Note OFDP/FUTURE_TU1 TU1 Comdty 01/10/2007 25/06/1990 1521 279
10-year Treasury Note OFDP/FUTURE_TY1 TY1 Comdty 03/05/1982 03/05/1982 7817 2044
Ultra Treasury Bond OFDP/FUTURE_UL1 WN1 Comdty 30/10/2012 11/01/2010 262 0
30-year Treasury Bond OFDP/FUTURE_US1 US1 Comdty 22/08/1977 22/08/1977 9067 5198
S&P500 Vol. Index OFDP/FUTURE_VX1 UX1 Index 03/05/2004 26/03/2004 2270 610
Wheat OFDP/FUTURE_W1 W 1 Comdty 01/07/1959 01/07/1959 13407 4033
E-mini Dow Jones OFDP/FUTURE_YM1 DM1 Index 01/10/2007 20/09/2002 1526 15

Main findings

  • I managed to find Bloomberg tickers for 60 contracts out of 68 available on Quandl
  • At first glance differences seem scary but digging a bit deeper reveals a different picture.
  • For some contracts Bloomberg and Quandl don’t use the same number of decimals which inflate artificially the mismatch numbers
  • Some contracts are not expressed in the same unit between Bloomberg and Quandl. I had to apply a multiplier to make an “apple to apple” comparison.
  • Differences seem to cluster in time. It seems like there are different rules for building front month contract (or data source changes?)
  • Differences are often more pronounced at the beginning of the sample.
  • I used PX_LAST field in Bloomberg which might not be the settlement price for all contracts (I have to check this). Quandl uses a settlement price. This might explain some differences as well.


This post isn’t a fully-fledged analysis but it highlights potential issues and it’s up to the interested reader to solve them. I don’t think there are major issues with Quandl data but I wouldn’t use it for serious analysis until I understand better the differences with Bloomberg. This means going through each contract individually and trying to understand the source of the differences. Finally I put below my Quandl wish list:

  • Add  European Equity Indices Futures coverage (I’m not sure a free source even exists)
  • Add an ETF category to Quandl list of instruments
  • Provide more accurate availability schedule for all instruments

For anyone keen on reproducing this analysis, I saved the code used in this post on a Gist/Github repository.

Any comments welcome