Maintaining a database of price files in R

Doing quantitative research implies a lot of data crunching and one needs clean and reliable data to achieve this. What is really needed is clean data that is easily accessible (even without an internet connection). The most efficient way to do this for me has been to maintain a set of csv files. Obviously this process can be handled in many ways but I found very efficient and simple overtime to maintain a directory where I store and update csv files. I have one csv file per instrument and each file is named after the instrument it contains. The reason I do so is twofold: First, I don’t want to download (price) data from Yahoo, Google etc… every time I want to test a new idea but more importantly once I identified and fixed a problem, I don’t want to have to do it again the next time I need the same instrument. Simple yet very efficient so far. The process is summarized in the chart below.

processFlow

In everything that follows, I assume that data is coming from Yahoo. The code will have to be amended for data from Google, Quandl etc… In addition I present the process of updating daily price data. The setup will be different for higher frequency data and other type of dataset (i.e. different from prices).

1 – Initial data downloading (listOfInstruments.R & historicalData.R)

The file listOfInstruments.R is a file containing only the list of all instruments.

##########################################
## List of securities (Yahoo tickers) 
## thertrader@gmail.com - Nov. 2015
##########################################
theInstruments = c("^GSPC",
                   "SPY",
                   "QQQ",
                   "DDM",
                   "EFA",
                   "EEM",
                   "EWJ")

If an instrument isn’t part of my list (i.e. no csv file in my data folder) or if you do it for the very first time you have to download the initial historical data set. The example below downloads a set of ETFs daily prices from Yahoo Finance back to January 2000 and store the data in a csv file.

##########################################
## Daily prices from Yahoo 
## thertrader@gmail.com - Nov. 2015
##########################################
library(quantmod)

startDate = "2000-01-01"
thePath = "D:\\daily\\data\\"
source(paste(thePath,"code\\listOfInstruments.r",sep=""))

for (ii in theInstruments){
 print(ii)
 data = getSymbols(Symbols = ii, 
                   src = "yahoo", 
                   from = startDate, 
                   auto.assign = FALSE)
 colnames(data) = c("open","high","low","close","volume","adj.")
 write.zoo(data,paste(thePath,ii,".csv",sep=""),sep=",",row.names=FALSE)
}

2 – Update existing data (updateData.R)

The below code starts from existing files in the dedicated folder and updates all of them one after the other. I usually run this process everyday except when I’m on holiday. To add a new instrument, simply run step 1 above for this instrument alone.

##########################################
## Update data files 
## thertrader@gmail.com - Nov. 2015
##########################################
library(quantmod)

lookback = 60
startDate = Sys.Date() - lookback
thePath = "D:\\daily\\data\\"
theFiles = list.files(path=thePath,pattern=".csv")

for (ii in theFiles){
 data = read.csv(paste(thePath,ii,sep=""))
 data = xts(data[,c("open","high","low","close","volume","adj.")],
 order.by = as.Date(data[,"Index"],format="%Y-%m-%d"))
 lastHistoricalDate = index(data[nrow(data),])
 
 recent = getSymbols(Symbols = substr(ii,1,nchar(ii)-4), 
                      src = "yahoo", 
                      from = startDate, 
                      auto.assign = FALSE)
 colnames(recent) = c("open","high","low","close","volume","adj.")

 pos = match(as.Date(lastHistoricalDate,format="%Y-%m-%d"),index(recent))
 
 if (!is.na(pos)){ 
  if (pos == nrow(recent))
   print("File already up-to-date")
 
  if (pos < nrow(recent)){
   dt = NULL
   dt = rbind(data,recent[(pos+1):nrow(recent),])
   write.zoo(dt,paste(thePath,ii,sep=""),sep=",",row.names=FALSE) 
  }
 }
 
 if (is.na(pos))
  print("Error: dates do not match")

3 – Create a batch file (updateDailyPrices.bat)

Another important part of the job is creating a batch file that automates the updating process above (I’m a Windows user). This avoids opening R/RStudio and run the code from there. The code below is placed on a .bat file (the path has to be amended with the reader’s setup). Note that I added an output file (updateLog.txt) to track the execution.

cd ../..
C:\progra~1\R\R-3.1.2\bin\R.exe CMD BATCH --vanilla --slave "D:\daily\data\code\updateHistoricalData.R" "D:\daily\data\code\updateLog.txt"

The process above is extremely simple because it only describes how to update daily price data.  I’ve been using this for a while and it has been working very smoothly for me so far. For more advanced data and/or higher frequencies, things can get much trickier.

As usual any comments welcome

17 Responses to “Maintaining a database of price files in R”


  • Living dangerously, then?

    If for some reason your dt = rbind(data, recent) step fails (several reasons this may happen, e.g. getSymbols() fails) then dt = NULL will overwrite your entire data file.

    I’d recommend at least adding an archiving step prior to loading in the data.

    If you have the time/inclination, an actual database (e.g. MySQL/SQLite/Access) could be of benefit and is easy enough to integrate into an R workflow. Running this sort of thing as a daily cron job on Linux makes timing more consistent too. Lastly, dplyr chaining would make a lot of that code simpler (feel free to get in touch to discuss).

    • Jonathan,

      You’re right about the dt = NULL. It never happened to me but this is a risk risk. I should have probably mentioned that I a run a separate process that duplicates all the data files daily and yes it could be added to the code I posted.
      Regarding MySQL/SQLite/Access I think it’s an overkill for such a simple data (low volume very simple structure). I used SQL like db extensively in the past and there is a real benefit of using those tools only when data is complex or pre-processing is needed/useful this is not the case here.
      I’ll be happy to have a look at a dplyr implementation if your willing to share it? I never took the time to have a real look at this package.

  • Thank you so much, this is super useful!!!

  • Interesting post!

    Two things you might consider:

    1. paste0 comes handy, avoids sep=””

    2. In order to avoid clicking on the batch file every day, why not set up an automated task in the Windows Task scheduler? In that case you should redirect output via sink: http://www.statmethods.net/interface/io.html

    • Simon,

      Thanks for the tip. I didn’t know about paste0.
      Yes the scheduler a possibility. It’s just that running the .bat file is part of my daily routine

  • Hi,

    How do you handle dividends? I tried something similar but had problems when a stock issued a dividend.

    For example, for the past 5 days the adjusted price is the same as the close price. But if today a dividend is distributed, Yahoo with adjust the that last 5 days of prices. So what I was collecting will now be different than if I downloaded the whole series today. Just wondering if you deal with that.

    Thanks.

    • Hi,

      I actually download both closing prices and adjusted prices. I can use any of those series depending on what I’m doing.

      HTH

  • I seems like in the Update Data File script, your startDate is spelled statDate.

  • Hi,

    thanks for your example code!

    I got it working on Linux with some minor modifications. I advise you to use file.path to assemble file names as in:
    file.path(thePath, ii)
    this makes your code more compatible.

    I have to use the sep argument in read.csv to correctly read the zoo output file as in:
    data = read.csv(file.path(thePath, ii), sep=” “)

    Greetings,
    –Alberto

  • Thanks for the informative post. I’ve been working on this as a side project and your overview was very helpful. One thing I’ve been trying to sort out was getting a “valid” list of stock symbols. Pulling them with the function, stockSymbols, gets a lengthy list but many of them return errors using getSymbols. I’m not too concerned about the errors at this point, but want to make building the list more automated. For example, i wrote the script to generate the list of ticker symbols and then remove all the hyphenated symbols (e.g. preferred stocks, etc.) but some of these five-character symbols continue to generate errors when i run the script to make the csv files that will store the price data. It would be very helpful to have the loop that pulls the data check to see if the query to yahoo returns valid data, and if not then save the symbol elsewhere (e.g. badTickers.csv). Is there any way to do this easily? I’m still learning R but haven’t found much for ‘error-handling’ like this. Thanks again!

  • Nice compact solution!

    Adjusted prices probably need some refining: If a stock goes ex-dividend in general all past adjusted prices change. This might create gaps in your adjusted series if only the last couple days are concatenated at the end and no further adjusting calculations are made.

  • Its good Script, But updateData.R need some correction, seems like we need to have one more closing flower brackets at the end.

  • This was a helpful piece of code. Also in updateData.R, Line no. 7(lookback = 60) & 8 (startDate = Sys.Date() – lookback) are not required if you change line no. 20 to (from = lastHistoricalDate,)

    Just to make the code succinct.

Leave a Reply