GI => GO (III) Data Validation

I download CSI end of day data (EODD) from the TradingBlox website. Then I convert the data to ratio adjusted contracts (RadContracts). All is well ... or is it? What if there are "issues" in the EODD?

For example, in the latest batch of data, I found the following issues:
  • Euro Index (CU_0_I0B) - there is a jump at 1998-01-04 from 0.6 to 1.2 in the unadjusted close (RawCl)
  • US Dollar Index (DX_0_I0B) - RawCl is 1/10 of the correct value for half the series
  • Nasdaq Composite (ND_0_I0B) - RawCl is a factor 10 too large for the entire series
  • Lean Hogs (LH_0_I0B) - 5 roll-overs exceed 20% price changes
  • Natural Gas (NG20_I0B) - 5 roll-overs exceed 20% price changes
  • US Bond (US_0_I0B) - Panama adjustment changes within the same contract month

This list is by no means exhaustive. Obviously a data cleaning process is required before any analysis can be completed:


So far, issues seem to fall into five catagories:

  • Explainable and fixable issues such as that in the Euro series caused by splicing D-Mark to Euro series
  • Fixable errors such as those found in the DX and ND data series - there is an easily identifiable pattern that can be reversed
  • Errors that amount to noise such as those in the US Bond series - easy to find, of very small magnitude, but annoying to have.
  • Genuinely unusual data points that are internally consistent such as LH and NG (OHLC all make sense, Panama adjustment is stable, etc)
  • Errors I cannot find but are surely there - these are the ones I fear most. These are the ones that, hopefully, will be reduced by gradually improving my Validation Script.
This issue highlights the 'nice to have' of a second data source to check, for example, that the huge roll-overs in Lean Hogs and Nat Gas are genuine.

So far, my checks are very limited:

  1. Check that the panama adjustment only changes on the day of a contract roll.
  2. Check that, on a roll, the daily change does not exceed a threshold.
  3. Check that, outside of a roll, the daily change does not exceed a threshold.

I will do some searching to find additional data validation steps to improve the quality of the raw data I am using for analysis.

0 comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Get widget