For example, in the latest batch of data, I found the following issues:
- Euro Index (CU_0_I0B) - there is a jump at 1998-01-04 from 0.6 to 1.2 in the unadjusted close (RawCl)
- US Dollar Index (DX_0_I0B) - RawCl is 1/10 of the correct value for half the series
- Nasdaq Composite (ND_0_I0B) - RawCl is a factor 10 too large for the entire series
- Lean Hogs (LH_0_I0B) - 5 roll-overs exceed 20% price changes
- Natural Gas (NG20_I0B) - 5 roll-overs exceed 20% price changes
- US Bond (US_0_I0B) - Panama adjustment changes within the same contract month
This list is by no means exhaustive. Obviously a data cleaning process is required before any analysis can be completed:
So far, issues seem to fall into five catagories:
- Explainable and fixable issues such as that in the Euro series caused by splicing D-Mark to Euro series
- Fixable errors such as those found in the DX and ND data series - there is an easily identifiable pattern that can be reversed
- Errors that amount to noise such as those in the US Bond series - easy to find, of very small magnitude, but annoying to have.
- Genuinely unusual data points that are internally consistent such as LH and NG (OHLC all make sense, Panama adjustment is stable, etc)
- Errors I cannot find but are surely there - these are the ones I fear most. These are the ones that, hopefully, will be reduced by gradually improving my Validation Script.
So far, my checks are very limited:
- Check that the panama adjustment only changes on the day of a contract roll.
- Check that, on a roll, the daily change does not exceed a threshold.
- Check that, outside of a roll, the daily change does not exceed a threshold.
I will do some searching to find additional data validation steps to improve the quality of the raw data I am using for analysis.