Here at Quandl we get a steady stream of inbound inquiries from companies keen to monetize their data assets to our audience of investment professionals. These data suppliers are not just peddling stock prices or futures or FX data. They are startups and companies born in an era of data ubiquity. They may have satellite data, smartphone or sensor data, logistics data, or business operations data. This is now known as “alternative data” in industry parlance. One belief unites them all: that their data holds value for capital markets. Unfortunately, this is not always the case. Because in life, nothing is ever easy. And so extracting actionable intelligence from alternative data is a tricky problem.
In summary, there are ten ways the path to monetizing data can go wrong. Potential data suppliers would be wise to keep these in mind as they look to sell their data to capital markets:
- Mistaking contemporaneous correlation for actionable prediction
- Lookahead bias
- Poor choice of predicted variable
- The fallacy of data mining
- Statistical accident
- Information already known to the market
- Low upper bound to capital deployment
- Widely diffused information content
- Limited history for robust out-of-sample testing
- Audience mistmatch
In addition to these traps, suppliers should note that it takes effort and expertise to transform raw data into an asset of interest to the investment community. Data in its raw form does not likely have value for investors, unless they are one of the few shops that has staffed their own data science team.
This article surfaces the top ten pitfalls of data monetization within the context of capturing, assessing, and productizing a dataset for the finance industry.
Capturing the Data
Let’s begin with the basics.
For data to be useful, it needs to be clean, accurate, consistent, timely, well-documented and methodologically sound. These are table stakes; without these attributes, no dataset can live up to its full potential for value. Many of these attributes can be subsumed under the label data quality, which we have talked about before.
Another important basic feature is usability. The data needs to be primed for financial analysis by Wall Street quants. Massive blobs of primary data will not suffice; they don’t fit into analyst workflows and do not help analyst decisions. Typically, such blobs need to be reduced into smaller sets of clean time series indexes or indicators – without compromising their integrity or signalling power – before analysts can extract any value from then.
A final basic requirement is comprehensiveness. Data needs to be complete, or there is a risk of selection bias. Small samples also increase the likelihood of mistaking random variations for genuine signal.
Much of the data offered for sale fails to meet these criteria. But with the proper domain expertise one can create high-quality, well-documented, accessible and complete data, empowering the analyst to begin the next stage of their process: hunting for signal.
Assessing the Data
Alternative data is valuable to Wall Street only insofar as it predicts the behavior of some market variable better or faster than anything else.
The key word here is “predict”. The alternative dataset must offer a signal that is a genuine leading indicator. Contemporaneous movement is not enough. The signal must be available sufficiently ahead of market action so that one can trade, and profit, from it. Mistaking contemporaneous correlation for an actionable prediction is a common pitfall in alternative data analysis. A related pitfall is lookahead bias: building a historical indicator that incorporates information not actually available at the time of use.
Another, subtler, pitfall is the choice of predicted variable. The question is not whether one can predict the value of an indicator in absolute terms; it is whether one can predict that value better than the market consensus. Traders only make money when they are both contrarian and right; hence calibrating against consensus estimates is vital.
A fourth trap is perhaps the easiest to fall into: data-mining fallacy. Given enough candidate datasets to test, sooner or later one of them will exhibit the desired predictive correlation. Careful and thoughtful statistical analysis is required to rule out spurious results of this nature.
Spurious signals can arise in many ways, not just from data-mining. Apart from statistical robustness methods, a good way to inoculate against false signals is to always ask ‘meaningful’ questions of the data. These are questions that are economically sound and intuitively reasonable. Correlations with no economic rationale are more likely to be statistical accidents, while correlations that make economic sense are more likely to be genuine. In other words, if it seems too good to be true, it probably is.
Statistical robustness can of course be quantified. High R-squareds, large T-stats, strong P-values: these are old-fashioned measures but no less valuable for that; they are what gets data sold.
Most alternative datasets fail to clear one or more of the above hurdles; they do not hold statistically significant predictive information. This should not come as a surprise. Markets are, by and large, very efficient at incorporating and adjusting to relevant information; hence datasets that can beat the market through “better information” ought to be very rare – and they are.
The few datasets that do hold information are still not guaranteed to be saleable, for reasons the next section will make clear.
The Proof is in the Pudding
Predictive ability is a necessary condition for an alternative dataset to be valuable, but it is not sufficient. For Wall Street to buy alternative data, traders need to be able to make money from it that they could not make otherwise. Practically speaking, this means building an uncorrelated and purely deterministic investment strategy around the alternative data.
A flaw that often manifests itself at this stage is complexity. It’s all too easy to take a mediocre trading strategy and ‘improve’ it, by adding trading rules, discretionary parameters, or degrees of freedom. But this is a trap; it leads to curve-fitting. An overly complex trading strategy may generate spurious profits not actually driven by the data. In any case, hedge funds do not need input on trade design or strategy construction; they are experts at those facets of the investment cycle. Hedge funds want hard data that they can interpret for themselves; the role of the data provider should be confined to providing this data.
The ideal backtest strategy should implement the simplest trading rule possible. If, even after doing this, there is a positive return, then that strongly suggests that the alternative dataset holds informational value for the market.
Even if profitable, a dataset’s value may be limited if its information content is already known to the market. For example, an alternative dataset may indicate a strategy of systematically buying small cap stocks and selling large caps. This is not a new result; the market’s small-cap bias has been known for decades. In this case, the alternative dataset is not adding any new ammunition to the trader’s arsenal, and as such, will not be considered particularly valuable.
Unique, uncorrelated, predictive and “new” information on the market may still have a low value if there is an upper bound to how much capital can be deployed against this information. For example, a dataset that predicts, with perfect accuracy, the behavior of a single penny stock is of limited utility; it won’t move the needle for a multi-billion dollar hedge fund. The best alternative datasets hold information about large and liquid securities, which can absorb investments from major institutional players. (Of course, large and liquid securities also tend to be the most efficient part of the market, which makes the task of finding new predictive information that much harder).
Closely linked to capital capacity is width of diffusion. Alternative data that is easily available or widely diffused is unlikely to be valuable: any alpha that it may have once held is almost certainly fully dissipated.
By the same token, the most valuable alternative datasets are not easily replicable; they rely on new or proprietary information sources, enhanced by algorithmic IP and deep domain expertise. This creates barriers to entry and prevents the alpha from being dissipated too soon. Datasets that can be easily copied are datasets with limited value.
Good strategies perform consistently both in-sample and out-of-sample. However, many alternative datasets are new and have limited history for robust, out-of-sample testing. Proxy strategies can sometimes mitigate this flaw, but at the cost of adding complexity. Fortunately, most hedge funds are sophisticated enough to recognize that one cannot simultaneously have the newest, most innovative alternative data AND a long history to back test. Hence this flaw is not necessarily a show-stopper.
Finally, even the very best alternative dataset can fail to command value if it suffers from audience mismatch. Predictive information on specific stocks is of no use to global macro funds; conversely, predictive information on macroeconomic variables is of no use to equity long/short funds. Data needs to be matched to the data user, or both vendor and customer will end up disappointed.
Productizing and Monetizing the Data
The potential flaws that can bedevil an alternative dataset are intimidating. Fortunately, there are ways to alleviate these flaws. Experienced data curators can improve quality, usability and organization. Expert quants can examine any dataset, evaluate its predictive ability, backtest trading strategies using the data, measure information content, novelty, capital capacity, replicability, robustness and much more. And well-connected networks can ensure that the right data reaches the right audience.
Most potential alternative data vendors lack the expertise, technology and distribution networks to carry out these actions by themselves. That’s where Quandl comes in.
Quandl is the leading marketplace for alternative data in the world. We work closely with vendors – established, startup, entrepreneurial or latent – to productize their data.
- Our data curation team transforms raw data into usable information. They acquire, organize, structure and document raw data. They also manage data quality, integrity and comprehensiveness.
- Our data science team searches for predictive signals in data. They reduce complex datasets into simpler, cleaner indicators. They evaluate these indicators against both micro and macro factors, for predictive power.
- Our quant finance team builds trading strategies around predictive data. They search for scalable, uncorrelated excess returns while being rigorous about eliminating biases in all their forms.
- Our distribution team identifies suitable hedge funds and asset managers who might be interested in consuming the data.
- Our delivery team builds and maintains the infrastructure that sees the data being made available to its customers, at a high level of speed, reliability and convenience.
The presence of different types of expertise at different stages in the pipeline ensures that our alternative datasets tick all the necessary boxes: quality, predictive ability and profit-generation ability. This enables them to command very high prices with our Wall Street audience.
A similar version of this article was published on LinkedIn, by Quandl co-founder and chief data officer Abraham Thomas.