Having wound our way through all of the sub-steps of the previous stages of the data factory (acquire, transform, apply), it becomes clearer how much specialization is required when building a modern alternative data practice.
Of course, if you really wanted to, you could have one person do the entire process. But it’s highly unlikely that doing so would be the best use of their time—or the most scalable option for the business.
As a result of the above challenges, the data factory model is likely to become more and more popular. In this last installment of our series, we’ll be reviewing the final part of the data factory, deploying your data, as well as taking a look at what enables every stage of the process: technology.
Deploying your data
Hunting down obscure and valuable datasets is interesting work, and good data science has its glamour. There are unexpected insights to uncover and clever ways to extract value from data.
By contrast, deploying the data is not overly glamorous work. It is an exercise in being careful and meticulous, covering data operations, platform engineering and customer support.
And yet, it is also highly critical that organizations deploy their data optimally.
The data operations stage entails taking the code written by data scientists—often code that has been iterated and is thus not that robust or scalable—and converting it into production-quality code. Further, another part of data operations is performing production-level quality assurance on the data to make sure that all the tests that you did earlier are being done constantly.
There is also the issue of timing and reliability. It is important to have service-level agreements on latency and reliability for the sake of business continuity. Hand in hand with making sure that you have established SLAs is the actual monitoring of data operations. As with any system, things will inevitably go wrong. You should know how any disruptions will be captured and addressed before anything ever goes awry.
Think of your data infrastructure as a finely-tuned machine with its various moving parts. The data operations function is the oil that keeps every constituent part running smoothly. While not always the most evident or glamorous, the data operations function is critical to the overall success of any data infrastructure system.
Take, for example, how we at Nasdaq’s Quandl manage our FX Settlements Volume & Flow products. With these datasets, as with many, delayed data can have as negative an impact on your operations as any other data hiccup.
That is why we employ a system that monitors for hourly data updates and catches any delays. The system has specific rules for escalation processes should a delay occur: what to do, when to do it and who to alert. On-call operations engineers receive alerts by email, text and/or phone and, if necessary, the system also notifies the data vendor and users subscribed to the data feed.
The above example is only one thread of data operations: timing. Data operations as a discipline needs extensive and critical logic to ensure that all systems are running smoothly.
A critical flaw of many ambitious data science initiatives is that they exist in a vacuum.
A data science team will take data and build something impressive, but only the data science team itself ever knows about it. A siloed approach like this one can be a waste of possible resources. It creates a single point of failure, and it is overall not the most efficient. With the significant price tag associated with quality data products, it seems a waste to keep the fruit of data scientists’ labor inaccessible.
That’s why we’ve dedicated time to building out a platform that makes the result of all of this work accessible to everyone.
The first step is hosting the data in a universally accessible location; in our case, a cloud platform. Then it is a matter of making the platform universally accessible, so we built RESTful APIs and connectors as well as making data available via Excel, Python and R. Once the data is available, it should also be easily findable. In our case, we deployed a catalogue that made searching through data seamless. It’s also important to track who is using data and managing permissions and entitlements in line with compliance and commercial restrictions—in short, everything should be auditable.
In addition to all of the above, any data platform needs to be secure and scalable in order to grow with your firm’s needs. These are all platform features that exist independently of any particular dataset—they are universally useful and critical, so it’s worth investing in them.
Our Earnings Quality Rankings dataset, for example, is possibly one of our simplest. It provides a weekly earnings quality ranking for over 3000 US public equities. It is based solely on accounting-related indicators; there are no fundamental overlays or price/valuation-based metrics.
Even with a simple dataset like the above, you need a platform that makes the data available and reliable. Things like documentation, sample data, user permissions, usage guides, field definitions, schemas are all important, because even a relatively simple dataset benefits immensely from having operational support.
The final consideration of note when it comes to deploying your data is a robust customer support system. In addition to providing extensive documentation outlining how to access data through our APIs and in the form that users want, we find that having a dedicated customer support team to assist users helps us deploy data more efficiently.
Technology: making it all sing
Throughout the data factory process, you need a layer of technology to make it all sing. You can do all of the steps we’ve discussed without it, but the technology allows you to increase the productivity of associated teams and processes.
Just as the overall data factory approach is repeatable with most new datasets, there are also certain tools that you can use again and again. The reusability of tools and processes within the data factory approach saves time and money while driving quality and efficiency.
When it comes to data flow management, data quality assurance, symbology and other maintenance-related tasks, there are universal rules and approaches that will apply regardless of the dataset. With tasks like these, it isn’t necessary to reinvent the wheel every time you onboard a new dataset. Establishing principles around how you accomplish these tasks will enable your firm to repeat successes and avoid failures arising from inconsistency.
Tasks like vendor, team and user management and the actual data delivery platform itself should ideally be nearly data-agnostic. Customizations may occasionally have to be made, but data is data—accessing and sharing it should be largely a standard process.
The Data Factory in sum
We’ve now completed an in-depth exploration of every stage of the data management approach that I call “The Data Factory.”
We’ve covered how to acquire the right data, how to transform it into a usable state, how to apply it to solve your specific challenges and how to deploy it so that it is accessible and shareable.
It must be stressed that throughout this process, technology and reusable tools and approaches are important to the success of the data factory approach. Further, specialized expertise can truly make the difference in terms of efficiency.
Data management in general, when you invest in doing it well, rewards you by enabling you to get more out of your data and to focus on your primary objective: uncovering alpha.
Revisit the other installments of the Data Factory series: