R has a couple of nice functions, "quantile" and "ecdf" (empirical cumulative distribution function) that are invaluable if you take this approach. Following is a quick run-down of quantile and ecdf and a couple of ways of employing them.
Quantile Function
quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE, type = 7, ...)
- "x" is the data you want to analyse - a vector of values
- "probs" is the how you want to analyze x - the default gives max, min and quartiles
- "names" tells the function to give names to the elements in the returned vector - set it to FALSE for speed
- "type" takes a value from 1-9 and specifies how to calculate the percentile value for each item in x and how to calculate the value of x for each value in "probs"
Following are some notes on type:
If x takes discrete values, use one of types 1-3, otherwise use one of types 4-9. Picture the process as simply sorting the list of values in ascending order, and counting along the list to get the percentile required.
- Type 1 is simple: take the first value such that the count equals or exceeds the proportion required. e.g. the 25th percentile of 20 items is the 5th in a sorted list: 0.25 x 20 = 5. It would be the 6th if there were 21 items: ceiling(0.25 x 21) = 6.
- Type 2 is the same except you average an item with the next in the list if you "land" on a value: 0.25 x 20 = 5 => average the 5th and 6th.
- Type 3 is trickier: if you land exactly mid way between two values, you use the nearest even-ranked value. e.g. 0.25 x 21 = 5.25, use 5; 0.25 x 22 = 5.5, use 6; .... 0.25 x 26 = 6.5, use 6; 0.25 x 27 = 6.75, use 7.
Next, if you are using a large number of data points it does not really matter which type you use. The percentile values assigned to a data-point's ranking, k, in a sample of size n is as follows:
- Type 4: k/n
- Type 6: k / (n+1)
- Type 7: (k-1) / (n-1)
- Type 8: (k - 1/3) / (n + 1/3)
- Type 9: (k - 3/8) / (n + 1/4)
If k and n are large, the values will be close. Even for large n, when k is small there will be large discrepencies - but how often are you in the 0.1th percentile? 1/1000th of the time! Occurrances in the 10th or 90th percentiles should probably be handled on an "exception" basis anyway. Type 4 always assigns the highest percentrank (7 is the lowest at low k, 6 at high k) - absolute difference between them will never exceed 1%.
Why are they different? Generally each is unbiased wrt a particular statistic. 6 is unbiased wrt the mean of the estimate of the quantiles, 7 for the mode, 8 for the median and 9 produces unbiased estimators of the ranking statistics if the underlying distribution is normal.
Since I am using this in conjunction with the empirical distribution function, I use type 4 which is a linear interpolation of the empirical distribution.
ECDF Function
It took me a while to understand that the result of passing a data set to ecdf is a function.ecdf(x)
"x" is a vector of data for which the empirical cumulative distribution function is required. Essentially, it creates a step function to which you can pass variables. The function returns the probablity of that value's occurring given the set of data "x".
For example, passing the data x <- 1:10 to ecdf and passing "3" to the resulting function: ecdf(x)(3) will return 30% as the result. There is a 30% chance of a value of 3 or less occurring based on a sample population {1,2,3,4,5,6,7,8,9,10}.
Another way of showing this that may be clearer is as follows:
x <- 1:10
fn <- ecdf(x)
fn(3)
0.30
Chisq Testing
We can combine the quantiles function and the ecdf fuction to allow us to compare two empirical distributions via a chisq test. In my case I want to determine if a sub-set of a time series differs significantly from the entire series - I would rather analyze the whole series than introduce a look-back parameter.Run a quantile analysis of the entire time series, bearing in mind the size of the sub-set and the requirement for a count of at least 5, and preferably 10, in each bucket of the distribution. For instance, a sub-set of 200 should work with percentiles divided into buckets 5% wide (5, 10, 15, etc):
percentiles <- seq(0.05, 1, by=0.05)
expected.distribution <- quantile(population, probs=percentiles, names=F, type=4)
population contains the data for the entire population
expected.distribution is a vector of the values (or breaks) defining each of the percentiles. e.g. 5% of values selected at random from this population should be less than or equal to the first value, 10% <= the second, etc.
subset.size <- length(observations)
observed.distribution<- ecdf(observations)(expected.distribution)
observed.counts<- subset.size * c(observed.distribution[1],
observed.distribution[-1] - observed.distribution[-length(observed.distribution)])
By passing the breaks from our quantile analysis of the population to the ecdf function generated from the observed data, ecdf(observations), we get back a vector of the cumulative distribution of the breaks. In a perfect world they would match the original percentiles. i.e. the expectation that a random value in the sample is less than or equal to the 5th percentile break in the population distribution should be 5% if the distributions are identical.
We figure the observed counts in each bucket by converting the cumulative density function to a probability density function by subtraction - we have to add back in the first "bucket" by concatenation, then multiply by the total sample size.
expected.counts <- rep(subset.size / (length(percentiles) - 1), length(percentiles) - 1)
chisq.pstat <- chisq.test(observed.counts, p=expected.counts, rescale.p=T)[[3]]
Because our original distribution is in the form of percentiles each bucket has the same expected count, thus we can easily calculate the p-statistic for the chisq test. And there you have it.
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.