One thing we are often asked here at Quandl is how to go from raw data to a salable data product. We’re the first to admit that the process of data monetization is complex and tedious, but ultimately worthwhile for your company. Whether you’re a start-up or a publicly traded corporation, taking the time and upfront investment to transform your existing data into a market-ready product can pay handsome dividends in the long-run.
Our specialty lies in developing data products and marketing them to a Wall Street audience. By this, we mean any institutional investor who is interested in consuming data to make trading decisions. This is a lucrative and sophisticated audience in the data buying space, who require particularly high-quality data products in order to augment their trading strategies. This is the first post in a five-part series that will guide you through the process of data monetization and getting your product into the hands of these financial professionals.
In this post we will cover:
- The importance of data hygiene and
- Identifying whether your data has predictive power
Data is always a means, never an end for the professional investors. In the specific case of institutional investors, the end is making better trading and investment decisions. Your data is valuable insofar as it can help Wall Street clients — traders, investors, portfolio managers — make better decisions about stocks, bonds, currencies, commodities, economic variables and more.
But most raw data assets, no matter how meticulous you are about collecting data, are very far from offering direct insight into market decisions. To offer that value, they must be “productized”: converted from their original raw form, into a new form that has clear and current value for the investment community.
In the absence of high-quality data, even the best analysis fails. Hence a basic requirement for salable data is that it fulfills certain quality criteria. Consider the following criteria:
Accuracy: The data must be as accurate as possible. Gaps, spikes, errors and outliers in the data all make it less trustworthy, and hence less valuable.
Consistency: The data must be methodologically consistent and uniform both across space (i.e., across companies, geographies, products or whatever subject is covered) and across time (i.e., history). A dataset whose underlying calculation methodology changes from month to month, or from company to company, is hard to work with.
Documentation: All fields and indicators need to be precisely defined. Clients do not know your data as well as you do; educate them. Precise definitions help testers correlate your data against the right “target” variables.
Ticker Mappings: Many datasets contain information about public companies. Ideally, this information should be labeled with the tickers of these companies, to allow easy analysis across multiple datasets. The ticker is a unique identifier (e.g., AAPL) and is much easier to map across datasets than a company name (e.g., Apple, Apple Corp., Apple Inc., Apple Computer). Note that CUSIP and ISIN are acceptable in place of tickers.
Historical Publication Timestamps: When testing the historical performance of a dataset, investors need to know exactly when each data point became available, to avoid look-ahead bias. For an investor to use weekly sales data, for example, she needs to know when each week’s number becomes available — which may be before, during or after the week in question. This is one of the most common reasons that prospective data publishers fall short.
Assuming an acceptable level of data quality, the next step is to find out if the data has “predictive power”. Intuitively, we want to know if the data holds some pattern or signal, which can be used to predict the subsequent movements of an asset price or other similar financial indicator. This prediction must then hold up to a variety of statistical tests.
Some of the factors that determine a dataset’s predictive power are described below:
Correlation: The basic indicator of predictive power is correlation. If a dataset exhibits a relationship with a stock price, economic indicator or some other market variable, then it has the potential to be valuable. The correlation must be statistically significant, which will be measured through rigorous analysis and testing.
But beware of false correlations. For example, new car insurance policies are correlated with new car sales; this is an obvious, intuitive correlation, which justifies using the former to predict the latter. Rainfall in Mongolia might be correlated with IBM sales; however, without an intuitive explanation, such a correlation will likely be discarded as a statistical coincidence.
Leading indicator: Alternative data is valuable to Wall Street only insofar as it predicts the behavior of some market variable. Contemporaneous correlations are insufficient; the signal must be available sufficiently ahead of market action so that investors can trade and profit from it. Unfortunately, many of the correlations we discover are either coincident or lagging and thus useless for investors.
Length of History: All else being equal, a dataset with a longer history is preferable to one without. A long data history allows users to statistically test the predictive powers of the data. Less than two years of historical data is usually insufficient unless your signal is extremely powerful. Four years is generally considered acceptable; eight years is the gold standard.
Completeness of Coverage: The best datasets cover as much as possible of the relevant “universe” — whether that’s an asset class, a product or some other unit. A dataset with info about stocks should cover all stocks in a defined market. A dataset with information about products should cover as many products as possible. If you have a dataset that predicts the performance of just one retailer, you will only have a handful of customers for it.
No Revisions or Backfilling: Data that is revised after the original publication date is worthless for backtesting. When simulating behavior at a given time, only the data available at that time can be used. Revisions and back-filling negate this purpose. Take consumer transaction datasets, for example. When new consumers are added to the dataset’s “survey base”, their past transactions are included. This changes the historical data and makes rigorous analysis more difficult. If this data is to be added, it must be properly version controlled.
No Biases: Different types of biases can exist in a dataset. Here are a few examples to avoid:
- Survivorship bias: Historical data that includes only those companies that still exist today, omitting shuttered businesses
- Look-ahead bias: Running a test or simulation that uses data that was unavailable at the time of use
- Selection bias: Choosing a subset of your space (list of assets, list of indicators, time period) not at random; the subset may be non-representative of the whole.
Statistical Significance: Length of history, strength of correlation, width of coverage and intuitive economic priors are all tools to evaluate statistical significance. The predictions made by alternative data are inevitably noisy, but the more we can prove significance for a given dataset, the more trustworthy and valuable that dataset becomes.
If you think your data is clean and predictive, you are well on your way to possessing a very valuable dataset — one coveted by Wall Street. The next step is to deliver the data in a package that is both familiar and convenient for your audience, a topic that we will discuss in Part 2 of the Data Monetization series.