## Taking Out the Outliers II

This is a follow-up to my previous post in which I described Rousseeuw's FastMCD algorithm for using the minimum covariance determinant approach to calculating a robust covariance matrix. I want to use the example from that post (the relationship between the returns of QIM and Quantica over the last 5 years) to explore the behavior of the FastMCD algorithm. Finally I want to suggest a way of combining the advantages of the robust and conventional approaches to covariance analysis.

### Robust Covariance Analysis

Following is a quick re-cap: The complete dataset has n samples. Find the subset of h samples (out of n) whose covariance matrix has the smallest determinant. This is equivalent to finding the subset with the smallest tolerance ellipse. The covariance matrix so derived is, obviously, robust with respect to changes in the (n - h) samples not in the subset. Since the (n - h) "excluded" samples are also the furthest away from the center of the sub-sample, h, they are the outliers. i.e. This approach results in a covariance matrix that is robust to outliers.

### Our Example: QIM and Quantica

The chart to the right shows a scatterplot of 60 monthly returns of QIM and Quantica funds (data available at iasg). In red is the 97.5% chisq (2 degrees of freedom) tolerance ellipse derived from the conventional Pearson covariance. The apparent correlation is close to zero. In blue is the tolerance ellipse derived from the minimum covariance determinant approach with h = 75% x n = 45. Clearly, the robust correlation is much higher (~0.55!).

### The Value of "h"

One's choice of "h" depends upon one's beliefs about the level of contamination in the data. The maximum level of contamination the MCD approach can theoretically deal with is equal to floor((n + p + 1) / 2), where n is the sample size, p is the number of dimensions and floor means "round down to nearest integer". So the possible values of h to consider range from just above 50% to 100% of n. By default (as recommended by Rousseeuw in his 1998 FAST-MCD paper) I use a value of h = 75% x n, right in the middle of the range.

The chart to the left shows how the correlation coefficient calculated using FastMCD varies depending on the size of h with the dataset held constant at 60 samples. The smallest h is 31 (= maximum breakdown value = floor((60 + 2 + 1) / 2)), the largest is 60, in which case the conventional correlation coefficient is calculated.

The correlation coefficient seems to exhibit three distinct phases: around 0.4 for 31 <= h <= 36, around 0.55 for 37 <= h <= 50, and decreasing steadily to 0 from then on. The final phase is easy to understand: the robust converges to the conventional as the sub-sample converges to the sample and the outliers exert ever increasing influence. The stepped behavior suggests there might be some change going on in the data: there are two types of behavior and one dominates the other depending upon sub-sample size. My conclusion: you have to try out different values for h, and see what happens. I recommend reducing h starting from n until you arrive at a plateau-type behavior.

### Varying The Sample Size

I change the starting point for the dataset in the first chart below and the ending point in the second. In both cases the sub-sample size, h, is set to 75% of the sample size, n. In the first chart, n decreases left to right; in the second, n increases. In the first chart the robust correlation coefficient is remarkably stable, gradually rising from 0.5 to 0.6 as older samples are dropped. In the second chart we see very different behavior: there is a step at month 52 as we add that sample. Moreover, the step is from about -0.2 up to 0.5! The robust MCD is putting out values we have not seen so far. This strongly suggests some change in the relationship occurred some time before month 52. There is a step because, at this value of h (= 0.75 x 52 = 39) the data suggesting a correlation of 0.55 overwhelms the data suggesting a negative correlation.
Note this behavior occurs because the samples that indicate a correlation of 0.55 or so are a) more in number than those suggesting -0.2, and b) more recent. If the data were different in timing or quantity, the first chart may have shown a step and second been smooth, or they both have been stepped. We cannot say anything about the superiority of one approach over the other.

In both cases, the conventional correlation coefficient increases in a more gradual way. In the first chart the leverage points early in the time series lead the conventional approach to understate correlation. It is not until those points have dropped off the beginning of the series that conventional correlation begins to approach the robust correlation. In the second chart, those same leverage points seem to cause a more negative estimate of correlation than is warranted, and continue to hold down the estimate once the entire dataset is included.

Overall, I would note that the conventional approach actually starts responding to changes in the relationship between these two managers before the robust approach, but the robust approach reaches the final destination quicker than the conventional approach once it "decides" the change is real!

### Windowing

Here I use a fixed window size of 36 periods (h = 0.75 x 36 = 24) and move it forward through time. This chart nicely contrasts the behavior of the robust estimate of correlation and the conventional. The one steps up from -0.2 / -0.3 to +0.5 while the other gradually climbs. It is as though the robust method waits until the evidence is undeniable and switches modes whereas the conventional method takes an average path! I feel the added color given by the robust approach is well worth it: imagine the usefulness of this in a manager monitoring context.

### A Conservative Approach

It's always important to remember what your goals are, and what the consequences of error might be. If I am considering issues like portfolio optimization I have trade-offs such as frequency of re-optimizing and lost return vs unexpected volatility. I certainly don't want to make the claim that robust covariance is the only way to go: it depends.

So I want to leave you with this chart. I call it my "Conservative Approach". I do not want to rely on correlation relationships that are not there, nor do I want to ignore changes that could hurt me. In this chart I plot the greater of the conventional and robust correlation coefficients for a variety of window lengths. This gives me the benefit of the earlier lift-off of the conventional (warning me to reduce my reliance on the advantageous negative correlation) combined with the quicker arrival of the robust (letting me know just how in-step these managers have become).

If your strategies depend on solid covariance analysis, you must include a robust approach in your toolkit.