Your Free Data is Costing You Money

Garbage in. Garbage out. This old adage holds for all areas of decision sciences, including backtesting your investment strategies.

Years of working with financial data — and directly in the data industry — has revealed deep issues with many data providers, and especially the freely available sources. Errors are everywhere, which could make a great strategy look terrible, or worse, a losing strategy look highly profitable.

The Financial Data Pipeline

If you open up your favorite trading platform or brokerage account, you’re confronted with a series of quotes composed of red or green numbers updating by the second. For individual stocks and securities, they represent to price the most recent transaction was settled at. For indices (such as shown below) they are the aggregation of the most recent transactions of all of the securities that make up the index.

These transactions are recorded by the exchanges and sold to data providers, who in turn offer the data along with their APIs or software packages to traders, institutions, and others. There are a lot of free data providers out there as well, which is often where most algorithmic traders start.

Pitfalls of Free Data Sources

Free data is a great place to start — we use free data sources in our tutorials because it’s easy and accessible for people — but we would never trust our hard-earned cash to a strategy operating on an algorithm that relies on free data. There are a few reasons for this.

Many free data sources have limited histories. For a good algorithmic approach, we want as much data as possible, which means going back in time as far as possible so that we’re able to test our approaches against a wide-range of markets. 5 or 10 years of data just doesn’t cut it.

Free data sources may become obsolete or move to a premium model. If this happens, your algorithm is suddenly going to be cut off which could lead to missed trades. Most professional sources are loath to change their systems because their customers depend on consistent and reliable data feeds to build their businesses (this can be seen when sampling professional data systems and finding a lot of UIs that were clearly built for Windows 95…but they still work!).

Data inaccuracies are frequent. It’s hard to keep up with thousands of companies and their corporate changes, so stock splits, dividend payments, and the like which need to get propagated into historical data frequently get passed over. Additionally, rounding errors can compound the farther back in time you look.

Stock tickers often get re-purposed after de-listing and many free sources either don’t keep records of these de-listed stocks or only allow look-ups via the ticker. If this isn’t properly accounted for, then you could introduce survivorship bias into your backtests by only testing strategies against companies that have survived over the years. This has the effect of inflating your results and hiding risk.

Free-data stalwarts like Yahoo! Finance have gone through all of these issues, restricting data by changing business models; having APIs suddenly break with new updates; miscalculating dividends, splits, and the like; rounding payouts which leads to errors as data gets propagated into the past; and dropping de-listed stocks causing survivorship bias in backtests.

Professional Data Sources

This isn’t an ad for buying data from a vendor — triangulating multiple free data sources and making regular updates can help fix these issues, but that’s a lot of work that may be better spent doing research and running tests. It’s better to start with good, high quality data and build from there rather than spending heaps of time chasing down discrepancies, building scrapers, and patching APIs.

Let us handle that for you at Raposa Technologies where we’re building a platform to make quantitative investing easily available. We’ve done the hard work of vetting our data and vendors, giving you access to professional backtesting capabilities to build your own strategies that you can be confident in.