This post is the first in a series called “Alternative Data In Action” (ADIA). Each article in this series will examine a different alternative dataset. We will discuss how the data is sourced, structured and, most importantly, some simple methods for leveraging the data for investment insight. This ADIA looks at electronic receipts. Data scientists Ray McTaggart and Lilian Lau contributed to this post.
The Dataset: Electronic Receipts in Email
Where and when people spend their money is invaluable insight for everything from measuring the state of the economy as a whole to studying the success of a single product produced by a single company. One conceptually simple way to track spending is to monitor the receipts people get by email.
To that end, we have been partnering with companies that have “lookthrough” to consumer emails. Companies like Mail.com at one end of the size spectrum to Google (Alphabet) at the other, have natural visibility into email. Any software company offering email clients or add-ons of some sort often get lookthrough as well. Productivity, accounting and note management app creators also often have visibility into email receipts.
Partnerships with firms like these give us access to a continually updated corpus of millions of anonymized emails. Meticulous parsing of this corpus leads to a beautifully granular database of single transactions. Intelligent querying of this database can then offer amazing, near real-time insights on all kinds of things.
Case Study #1: Amazon
One obvious question to ask of this data is: how are sales going (right now!) for a certain company? This dataset can yield answers to such a question, provided the company in question does a meaningful amount of electronic business. So, if it is insights on ExxonMobil you seek, this is not the alternative dataset for you. At the other extreme, a company like Amazon derives a very large fraction of its total revenue from e-commerce. This dataset would very much be the right place to seek insights on such a company.
One approach to tracking Amazon sales with this data is a simple regression model where the independent variable is quarter-over-quarter revenue changes as implied by the data, and the dependent variable is actual quarter-over-quarter revenue changes as reported by Amazon. (Given a large enough history, quarterly year-over-year changes would be a better choice for this model, but, like so much alternative data, history is minimal because the data simply did not exist in the distant past.)
Actual q/q changes are of course available from 10Q filings. But how to imply q/q changes from this alternative dataset? The secret is cohort analysis:
1) find the set of consumers who existed in this data during the prior quarter
2) measure their total Amazon spend during that quarter
3) measure the total spend of this same cohort 1 quarter later
The result is a large-sample, accurate and data-driven consumer spending survey:
|Quarter||Implied Revenue Changes (q/q)||Actual Revenue Changes (q/q)|
Look what happens when you fit a simple line to this data:
The fit is excellent. But what of Amazon Web Services, the massive revenue stream that this data is ignorant of? Well, it is implicitly (and imperfectly) captured by the (non-zero) intercept and the (non-unity) slope of the regression line.
Obviously, this is a very simple approach to the problem of inferring Amazon sales before they are announced. We’ve added some improvements in our continuous Amazon update, accounting for seasonal effects, calendar days and more. And those analysts following the company will, obviously, do much better than us here: they can incorporate all kinds of additional information into this model including guidance and exogenous insights on the AWS line.
But the takeaway here — and this is true for all the alternative data we get excited about — is just how accurate the insights are from this data even when employed in a most simplistic manner.
Our database allows to answer questions not just about the magnitude of Amazon’s revenue, but also about the quality of that revenue, and hence the prospects for future growth.
For example, we can look at shopping basket composition. How much of Amazon’s growth comes from existing customers paying more, versus new customer acquisition? If existing customers do pay more, is it because they buy more items, or is it because they pay higher prices for the same number of items? This graph sheds some light on these questions:
(The Y-axis is normalized to 100 for Jan 2014).
We can even use the dataset to compare the performance of different brands within the Amazon marketplace. Amazon’s transaction volume is large enough to deliver meaningful results for dozens of third-party retailers. For example, here’s a chart comparing the monthly sales growth of speciality headphone manufacturers Sennheiser and Shure:
Case Study #2: Uber
Building useful models for private companies like Uber is tricky, because there are no standardized, audited numbers to regress against. Nonetheless, the wealth of data available in this dataset allows us to make some interesting inferences.
Uber has a well-known city-by-city expansion strategy. Hence we sliced our Uber receipt data into a separate time series for each city. We then used the same cohort analysis technique as above, to examine whether recently added cities are doing as well as older ones.
Uber continues to grow steadily if not spectacularly in its oldest markets. Over the last 2 years, Uber grew revenue by an average of 5% per month in those cities.
This is encouraging for Uber; it suggests that even its very oldest markets are not yet saturated.
Percentage growth is stronger in newer cities – as expected, given the lower denominator for the return calculation. Uber is averaging 12% per month growth in US cities entered in the last 2 years, as graphed below:
While new cities are growing fast, the absolute numbers are concerning. Old cities like New York and Los Angeles contribute 10x the revenue of even the best-performing new cities like Miami and Pittsburgh.
And that’s unlikely to change, because of inherent size differences: all the low-hanging fruit (large cities with dense network effects) may have already been gathered. This explains why Uber was so keen to capture the Chinese market; China has 25 cities bigger than Los Angeles.
Uber is well-known for an “act first, seek permission later” approach to regulations. What happens when this backfires?
Uber ceased operations in San Antonio in April 2015, after failing to come to terms with the city government. Six months later, a revised agreement between the government and Uber saw the car company re-entering that market.
How did Uber’s revenue behave during this exit and re-entry?
What’s fascinating about the re-entry is that Uber’s revenue trajectory made a full recovery all the way to the original path: the temporary suspension seemed to make no difference. There was no evidence of any “go back to square one” effect – it seemed that both drivers and riders were happy to jump back on the Uber bandwagon.
Another question often raised by Uber analysts is profitability. Uber is well-known for offering generous rider and driver subsidies and discounts, operating at a loss in order to capture market share, especially in new cities. What happens when the discounts run out? Will riders or drivers leave the platform? What does this imply for the sustainability of the business model?
San Francisco offers a natural experiment to test this. In Sep 2014, Uber raised its commission on each ride to 25%. A few months later in May 2015, Uber changed its commission to a tiered structure, with top drivers paying only 20%. Here’s what happened to revenue over that period:
It looks like drivers left the platform after the initial commission hike, only to return after the subsequent reversal. Riders followed drivers, and revenue followed riders.
This graph, and similar analyses carried out over dozens of price and fare and commission changes across dozens of cities, paints a fascinating picture of the strengths and weaknesses of Uber’s business: rider and driver price sensitivity; the efficacy of subsidies; the rewards of winner-takes-all, and much more.
A Wealth of Data
The above two case studies are just the tip of the iceberg. For Amazon, we could investigate Prime, Kindles, and Amazon Marketplace. For Uber, we could investigate surge pricing, UberPool, UberEats, and other services. And there are 100s of other companies in the email corpus, each with their own distinct business models and nuances, waiting to be analyzed.
The sheer wealth of data in this dataset is quite amazing. We’re excited to see what other conclusions a skilled analyst can reach with this data!
This entire email dataset is available for sale via Quandl right now. Derived reports and analyses (like the case studies above) are also available, priced on a per-company basis. Please contact us if you’d like to learn more.