quandl blog Menu
Cory Lesmeister

Cory Lesmeister

Cory is a hard-bitten and well-weathered veteran of 15 years in the pharmaceutical industry, having been in sales, market research and currently forecasting. During the day, he uses data, advanced modeling techniques and simulation to create accurate and effective forecasts to support critical business decisions, some worth hundreds of millions of dollars. His spare time is spent in the relentless pursuit of all things Data. A major part of this pursuit is the learning of R software and its various packages. Some of his exploration on the subject is captured in his blog, “Fear and Loathing in Data Science, A Savage Journey to the Heart of Big Data”.

Cory's blog can be found at: http://r-datameister.blogspot.com/


Mythbusting – Dr. Copper

Fairy Circles by Justin Reznick

Image by Justin Reznick


“An economist is an expert who will know tomorrow why the things he predicted yesterday didn't happen today.” Laurence J. Peter (author and creator of the Peter Principle)

If you were paying attention to financial sites last month, you probably noticed a number of articles on “Dr. Copper”. Here is just a small sample of some of the headlines:

So what is all the fuss about this so-called Doctor Copper? Well, it turns out that at some point in time Copper price action was believed to be a leading indicator for the overall equities market. Thus, it was said that Copper had a PhD in Economics (one can forgive the puns linking Copper to MD).

This raises a number of questions. Was this actually the case? If so, when was it a valid indicator? When did the signal fall apart? As always, R gives us the ability to see if this mythology is confirmed, plausible or busted.

The process to explore the question will be as follows:

  • examination of Copper's price action over time
  • Econometric analysis of Copper and US equity index prices, subsetting historical periods
  • Granger Causality

Using quandl.com, we can download the time series data we need to get started.

## Loading required package: Quandl
## Warning: package 'Quandl' was built under R version 3.0.3
# download monthly copper price per metric ton and subset from 1980 forward,
copper = Quandl("WORLDBANK/WLD_COPPER", type = "ts")

plot of chunk unnamed-chunk-3

cop = window(copper, start = c(1980, 1), end = c(2014, 3))

plot of chunk unnamed-chunk-3

We can see that something dramatic happened around 2004/5 to drive prices significantly higher, prior to a precipitous collapse and eventual recovery. I will cut this up into a number of subsets, but for now I will play around with changepoints.

## Loading required package: changepoint
## Loading required package: zoo
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##     as.Date, as.Date.numeric
## Successfully loaded changepoint package version 1.1
## Created on 2013-06-19
##  Substantial changes to the structure of the package have occured between version 0.8 and 1.0.2.  Please see the package NEWS for details.
mean.cop = cpt.mean(cop, method = "BinSeg")
## Warning: The number of changepoints identified is Q, it is advised to
## increase Q to make sure changepoints have not been missed.
## [1] 289 312 345 355 367

plot of chunk unnamed-chunk-4

Here are a couple of points I think are noteworthy. The first changepoint at observation 289 equates to January 2004. The collapse of price, exemplified by observation 345, is September of '08. Perhaps a reason to observe elevated price levels in copper that last decade or so, is the fact of China's economic growth and that country's use of copper as collateral. Take this quote for example, “A Reuters report this week that Chinese companies have used something between 60 and 80 percent of the country's copper imports as collateral to finance projects has unsettled markets, triggered a fresh wave of selling.” “This is an amazing estimate; it would change the perception of 'Dr. Copper' as a gauge of the Chinese economy, as it's not being used for industrial production, but rather as a financing tool for whatever reason,” said Chris Weston, chief market strategist at IG.“ (http://www.cnbc.com/id/101486303)

Financing tool!? If that is the case then Dr. Copper is currently no better than a strip mall psychic at market prognostication. But, what about the past? In a prior financial epoch was the 'Red Metal' the 'Jimmy the Greek' of US equitites? A side-by-side comparison with US stocks is in order. Again, we can quickly get our data from quandl.com…

sandp500 = Quandl("YAHOO/INDEX_SPY", type = "ts", collapse = "monthly", column = 6)
SandP = log(window(sandp500, start = c(1995, 1), end = c(2014, 3)))

plot of chunk unnamed-chunk-5

# match copper time series to S&P 500
cop2 = log(window(cop, start = c(1995, 1), end = c(2014, 3)))

Now, let's compare the charts together.

compare.ts = cbind(SandP, cop2)

plot of chunk unnamed-chunk-6

Taking a look at this from '95 to '05, can anyone see anything here that would lead us to conclude that copper is a leading indicator? Maybe from 2003 until 2011 there is something. So, let's cut this down into that subset.

sp.sub = window(SandP, start = c(2003, 1), end = c(2011, 12))
cop.sub = window(cop2, start = c(2003, 1), end = c(2011, 12))
compare2.ts = cbind(sp.sub, cop.sub)

plot of chunk unnamed-chunk-7

Perhaps we are on to something here, a period of time where copper truly was a leading indicator.

In previous posts, I looked at vector autoregression and granger causality with differenced data to achieve stationarity. However, here I will apply VAR using "levels”, with the techniques spelled out in this two blog posts:

We first determine the maximum amount of integration.

## Loading required package: forecast
## This is forecast 5.0
ndiffs(sp.sub, alpha = 0.05, test = c("kpss"))
## [1] 1
ndiffs(cop.sub, alpha = 0.05, test = c("kpss"))
## [1] 1

Second, run a VAR model selection to determine the correct number of lags. This is done on levels not the difference.

## Loading required package: vars
## Loading required package: MASS
## Loading required package: strucchange
## Loading required package: sandwich
## Loading required package: urca
## Loading required package: lmtest
VARselect(compare2.ts, lag = 24, type = "both")
## $selection
## AIC(n)  HQ(n)  SC(n) FPE(n) 
##      2      2      2      2 
## $criteria
##                 1          2          3          4          5          6
## AIC(n) -1.096e+01 -1.126e+01 -1.122e+01 -1.123e+01 -1.119e+01 -1.111e+01
## HQ(n)  -1.086e+01 -1.112e+01 -1.104e+01 -1.100e+01 -1.091e+01 -1.079e+01
## SC(n)  -1.073e+01 -1.092e+01 -1.076e+01 -1.066e+01 -1.049e+01 -1.030e+01
## FPE(n)  1.743e-05  1.285e-05  1.338e-05  1.324e-05  1.390e-05  1.502e-05
##                 7          8          9         10         11         12
## AIC(n) -1.106e+01 -1.101e+01 -1.100e+01 -1.096e+01 -1.089e+01 -1.087e+01
## HQ(n)  -1.069e+01 -1.060e+01 -1.053e+01 -1.045e+01 -1.033e+01 -1.027e+01
## SC(n)  -1.014e+01 -9.972e+00 -9.842e+00 -9.692e+00 -9.500e+00 -9.369e+00
## FPE(n)  1.582e-05  1.669e-05  1.703e-05  1.774e-05  1.928e-05  1.977e-05
##                13         14         15         16         17         18
## AIC(n) -1.084e+01 -1.080e+01 -1.071e+01 -1.075e+01 -1.071e+01 -1.071e+01
## HQ(n)  -1.019e+01 -1.010e+01 -9.968e+00 -9.956e+00 -9.871e+00 -9.821e+00
## SC(n)  -9.218e+00 -9.061e+00 -8.861e+00 -8.779e+00 -8.625e+00 -8.506e+00
## FPE(n)  2.070e-05  2.184e-05  2.414e-05  2.373e-05  2.516e-05  2.582e-05
##                19         20         21         22         23         24
## AIC(n) -1.068e+01 -1.073e+01 -1.071e+01 -1.070e+01 -1.063e+01 -1.058e+01
## HQ(n)  -9.746e+00 -9.756e+00 -9.685e+00 -9.631e+00 -9.516e+00 -9.418e+00
## SC(n)  -8.362e+00 -8.303e+00 -8.162e+00 -8.039e+00 -7.854e+00 -7.687e+00
## FPE(n)  2.728e-05  2.656e-05  2.817e-05  2.947e-05  3.298e-05  3.647e-05
# lag2 is optimal for VAR

This is where it gets tricky. To get the linear models I will need to create the lagged variables. The max number of lags is equal to the VAR selection plus the maximum differences, so the magic number here is 3.

sp = zoo(sp.sub)
lag.sp = lag(sp, -(0:3), na.pad = T)
copper = zoo(cop.sub)
lag.copper = lag(copper, -(0:3), na.pad = T)

Here, the two linear models are built. Note that the index is used to account for the trend. You also need to sequence the variables the same in each lm, so that the Wald test makes sense.

lm1 = lm(sp ~ lag.sp[, 2:4] + lag.copper[, 2:4] + index(sp))
lm2 = lm(copper ~ lag.sp[, 2:4] + lag.copper[, 2:4] + index(sp))

With the linear models built it is time to concoct the Wald Tests and examine the chi-squared p-values.
Again, this is tricky, but follow along… We are testing the significance of the lagged coefficients of copper on S&P and vice versa. However, this method does not include the one lag greater than the optimal VAR lag (in this case 2). This model was built with 3 lags!

## Loading required package: aod
## Warning: package 'aod' was built under R version 3.0.3
wald.test(b = coef(lm1), Sigma = vcov(lm1), Terms = c(5:6), df = 2)
## Wald test:
## ----------
## Chi-squared test:
## X2 = 9.7, df = 2, P(> X2) = 0.0079
## F test:
## W = 4.8, df1 = 2, df2 = 2, P(> W) = 0.17
wald.test(b = coef(lm2), Sigma = vcov(lm2), Terms = c(2:3), df = 2)
## Wald test:
## ----------
## Chi-squared test:
## X2 = 13.2, df = 2, P(> X2) = 0.0014
## F test:
## W = 6.6, df1 = 2, df2 = 2, P(> W) = 0.13

Given the significant p-values for both tests, what can we conclude? For Granger-Causality when you have significant tests 'both ways', it is believed you are missing some important variable and casuality is rejected. If we accept that as the case here, then the prognostic value of copper even during this limited time frame is 'Busted'. However, before we throw the proverbial baby out with the bath water it is important to discuss an important limitation of this work. This analysis was done with monthly data and as a result may not be sensitive enough to capture a signal of causality. All things considered, I have to say that the Dr. Copper Myth is not totally Busted for the given timeframe we examined. It is certainly not Confirmed, so at best it is Plausible.


  • Thanks for leaving a comment, please keep it clean. HTML allowed is strong, code and a href.
Francis Smart

Francis Smart

Rambling rogue researcher Francis Smart is a PhD student of econometrics and psychometrics with a focus on simulation methods. He has degrees in Economics (BS) and Applied Economics (MA) from Montana State University and is studying Measurement and Quantitative Methods at Michigan State University. Currently living in Mozambique, he has happily traded reliable internet for year round sun.

Francis' blog can be found at: http://www.EconometricsbySimulation.com/


Investigating the relationship between gold and bitcoin prices with R.

Reine by Ennio Pozzetti

Image by Ennio Pozzetti

In this post I will explore some of the movements in markets in recent years, these movements have caught many by surprise resulting in some people unexpectedly striking it rich while others have lost a great deal. I am no financial advisor, nor do I have a background in financial analysis, so please take everything with a grain of salt. If anything is true about financial markets, they are inherently unpredictable.

I will investigate the relationship between the price of gold and the price of bitstamps, with two competing hypothesizes. One hypothesis is that both goods represent investments that people seek because they are “safe” and “risk minimal”. The mantra “gold always has value” and “bit money does not rely upon government support” both seem to imply this. If this is true then both markets will move together. When the total economy seems uncertain, both will gain in price. When the economy does well, both will lose value as investors shift from safe investments to investments that provide higher expected return. An alternative hypothesis investigated here is that they are seen as competing investments. Thus when the price of gold goes down investors will move into bit currency which will drive the price of bit currency up. Likewise, if the price of bit coins goes down investors will shift to gold which will drive the price of gold up.
Continue reading…

  • Christian

    Sorry, I didn’t expect my reply to lose the formatting. In short, here are the plots: http://tinypic.com/r/2jbwr2w/8 and http://tinypic.com/r/2h38m07/8. Here is the code with output: http://pastebin.com/8mPLHvv4

  • JorgeStolfi

    Dear Francis,
    (1) it is not clear whether the “price variations” are pecentual or in USD. IN teh second case you will have the same problem as in the first plot: data before Apr/2013 will eb swamped by the data after that.

    (2) your interpretation of p seems reversed, I think that there is 10 in 11 chances of finding a correlation of 0.3% or greater between two variables that are in fact uncorrleated.

    (3) The BTC price history after Apr/2013 was defined entirely by the opening of the Chiense market (which is almost entirely day trading, not long-term) and the Chinese government restrictions to bitcoin trade. Those events are not related to the perceived value of BTC as a safe stroe of value, so any correlation with the gold prices is bound to be accidental.

  • •••
  • Thanks for leaving a comment, please keep it clean. HTML allowed is strong, code and a href.
Eran Raviv

Eran Raviv

Eran holds a BA in Economics from Ben-Gurion University, two MSc degrees: in Applied Statistics from Tel-Aviv University and in Quantitative Finance from Erasmus University, and a Ph.D in Econometrics from Erasmus University. His research interests focus on applied forecasting, dimension reduction, shrinkage techniques and data mining. Eran currently holds a Quantitative Analyst position with the Economics and Financial Markets team at the pension fund APG Asset-Management in Amsterdam, Netherlands.

Eran's blog can be found at: http://eranraviv.com/category/blog


Using R to model the classic 60/40 investing rule

Treelife by Timothy Poulton

Image by Timothy Poulton

A long-standing paradigm among savers and investors is to favor a mixture of 40% bonds and 60% equities. The simple rationale is that stocks will provide greater returns while bonds will serve as a diversifier when if equities fall. If you are saving for your pension, you probably heard this story before, but do you believe it?

At least in part, this makes sense. Stocks are more volatile and thus should yield more as compensation. Regarding diversification, we can take a stab at it and try to model the correlation between stocks and bonds, but for now let’s assume it holds that bonds will ‘defend’ us during crisis. Today we zoom in on the pain this 60/40 mixture can cause you over the years, and compare it to other alternatives. We use numbers from the last two decades to show that you may want to reconsider this common paradigm.

Continue reading…

  • Prestone Adie

    please insert the following lines after the library(“quandl”) line
    library(“ggplot2) #ggplot
    library(“xts”) #xts
    library(“quantmod”) #yearlyReturn

    the functions shown after the library calls require the libraries to be loaded.

  • Prestone Adie

    register on the Quandl website and under your account tab, look for the Authentication token and insert it as indicated on Eran’s code.
    Furthermore, you need to install and load the following libraries: ggplot2,xts, quantmod.
    I hope this helps.

  • •••
  • Thanks for leaving a comment, please keep it clean. HTML allowed is strong, code and a href.
Tammer Kamel

Tammer Kamel

Foot soldier for the open data movement; founder of Quandl.


Quandl Open Data

Synopsis: This is, we think, the best source of historical stock price data on the internet because it is accurate, complete and 100% open.

We added a new source to the site today called Quandl Open Data. We launched it with historical daily stock price data for 500 of the largest US stocks, but we hope to get that number to 4000 this month. This new “source” on Quandl is significant for four reasons.

1 – The Data is Better

This price data is better than anything we have had before (and anything we know of elsewhere on the internet) because it includes dividends, splits and adjustments in one dataset. We calculate adjustments using the CRSP methodology, but the raw dividend and split information empowers any other adjustment methodology you may wish to employ. We update the data as quickly as we can each day.

2 – The Data is Original

Most data on Quandl is sourced from elsewhere on the internet (which we do with zealous transparency.) This new data source is different because it is “original”; the data is manufactured by us and Quandl users. The definitive version of the data actually lives on Quandl and not elsewhere. (This is a first for us.)

3 – The Data is Open

Note our terms of use for this data:

You may copy, distribute, disseminate or include the data in other products for commercial and/or noncommercial purposes. There are no restrictions whatsoever on the use of this data.

Thus we now provide the internet’s first and only totally unencumbered source of historical stock price data.

4 – It’s a Wiki

Quandl Open Data has been assigned the source code “WIKI” for good reason: this data is and will be maintained by our community. We are very excited about this project which is currently being spearheaded by us and a small set of Quandl users. We are inspired of course by Wikipedia: We want nothing less than to permanently place as much financial information as possible into the public domain with absolutely no restrictions on its use.

This project is just getting started. Our thanks to everyone who helped by contributing backfill and helping to clean. There is much more work to be done on this front. The next step is to expand coverage to more stocks. We will eventually open this process up to the entire Quandl community. In the interim it only takes an email to me to get involved right now.

  • •••
  • Thanks for leaving a comment, please keep it clean. HTML allowed is strong, code and a href.
Sean Crawford

Sean Crawford

Empowering people by improving the accessibility of data.


Quandl R tutorial now on DataCamp


There is now an excellent tutorial on using Quandl via R at DataCamp.com. Datacamp offers really well designed (and free) in-browser tutorials for learning more effective data analysis. Their focus thus far has been on R.

The free interactive Quandl course introduces you to the main functionality in the Quandl R package. In two short chapters you’ll learn how to search through Quandl’s data sets, how to access them, and how you can easily manipulate them for your own purposes. All exercises are based on real-life examples (e.g. Bitcoin exchange rates).

Datacamp is quite impressive; definitely worth a visit!

  • Thanks for leaving a comment, please keep it clean. HTML allowed is strong, code and a href.
Quandl Team

Quandl Team

Striving to make numerical data easy to find and easy to use.


Superset URLs Have Changed

If you use supersets on Quandl, please note that the URLs for your supersets have changed.  Instead of existing under www.quandl.com/USER_XX, they now exist under new source which is your exact username on Quandl.

Taking myself as an example, my superset used to exist at:


but it now exists at


If you are accessing your supersets by API, you need to be aware of this change.

If you have any questions or problems, just drop us a line connect@quandl.com

  • Adam Sussman

    Can you create a superset using the API?

  • Thanks for leaving a comment, please keep it clean. HTML allowed is strong, code and a href.