Who are Horsecode?
Horsecode are a company specialising in the analysis of horse racing. Previously they have worked with traditional forms of analysis (using knowledge of previous form etc.) and have provided tips and betting advice to their customers and investors. They started exploring a statistical analysis/machine learning approach given the current accessibility of computing power and data.
Horsecode were populating their database of racing results by importing data manually entered into spreadsheets every day. The manual task led to a high number of errors in the data which was negatively affecting their analysis. It was time consuming and expensive to enter records so their data set was not as comprehensive as they would have liked. Improving the accuracy and therefore yield of their algorithms was key to their future business strategy. Doing this relied upon increasing the accuracy and size of their data set.
They found a racing results database product that used an inappropriate technology which was neither efficient nor quick. The format of the data meant a large amount of work would be needed to translate it into the format they were working with. They approached other companies that held this data in a more desirable format, but the costs were prohibitive.
What they needed was a way to increase the accuracy and size of their data set that they could control.
Having worked closely with one of the guys at Horsecode a few years previously, Simon (our CEO) was approached to see if he could help. With 15 years’ experience working in a variety of industries as a systems designer, and his love of a challenge, Simon was only too keen to get involved.
Due to the financial and time cost associated with changing existing analysis systems, the interface of the database could not change. The analysis of the existing database structure and the identification of the full data set available on various racing results sites led us to design a new database. The existing one would not cope with the volume of data we’d be storing in the future and would become a serious bottleneck. The new database was designed for high speed queries. We maintained the same interface as the old database; no changes were required to existing analysis systems.
The first goal was to correct the existing data. We created a process that used their historic spreadsheets and verified every record with data found online. Records were corrected, cleaned and normalised before being imported into the new database. Every spreadsheet generated in this interim period was imported via the new process.
The main goal was to provide a database of accurate data, containing as many records as were available. We identified three sites with good data which we decided to target. Creating and integrating custom scrapers for each site with Chattering Monkey’s platform, we set them to run every day. To build up the database quickly, we created a spidering scraping strategy which pulled up to 90,000 race results per day using the previous day’s results as a starting point. We build a decent historic data set going back around 15 years relatively quickly. Using three sites enabled us to sanity check the data against each other to protect against errors. against each other to protect against errors.
While sanitising and correcting the spreadsheet data, we discovered a 1.5% error rate in the existing manual process. The benchmark error rate for manual data entry is 1%. Once the manual element of the process was removed, this error rate dropped close to 0%.
As the database grew to over 1,000,000 race results, Horsecode saw a huge increase in the accuracy of their predictions leading to greater external interest in their algorithms. The optimised, cloud hosted database has increased the speed of their analysis, lowering the cost. They have also launched a paid service giving access to the data to interested parties; a new revenue stream not envisaged at the start of the project.
Chattering Monkey have continued to provide Horsecode with results data every day for the past two years. Due to the success of this initial project, we have been working very closely on scraping further related data and some very exciting machine learning analysis which we'll write about in the future.