## Converging on Monte Carlo

Inspired by reader Sluggo I am going to take a look at the issue of convergence. I will use the simple Monte Carlo method to estimate pi introduced in my previous post. We are going to briefly explore how the estimate of pi converges on the true value as we increase the size of our trial (the number of random samples we use).

If we know how accuracy varies with the number of trials, we can better design our method to achieve the accuracy we desire without having to iterate; we can better manage the trade-off between accuracy and absorbing computing resources.

I plan to explore the relationship between the accuracy of the estimate of the integral and the fraction of the search space occupied by the integral in a future post.

We are trying to estimate the value of pi by sprinkling, at random, points over a unit square. The unit square contains a quarter-circle of unit radius centered at (0,0). This sets us up so the circle occupies 50% of the square. We sum the points inside the circle (under the curve) and divide by the total number of points to arrive at a fraction, p. This fraction should equal pi/4, so pi=4.p.

Another way to look at this is we are completing the following integration:

\frac{\pi }{4} = \int_{y = 0}^{1} \int_{x = 0}^{1}f\left ( x, y \right ) dx dy \textit{ where }
f(x, y) = \left\{\begin{matrix}
\ \textup{0, } x^{2} + y^{2} > r^{2}\\ \textup{1, } x^{2} + y^{2} \leq r^{2}
\end{matrix}\right.

only we are using an approximation:

\frac{\pi }{4} \approx \frac{1}{N}\sum_{i=1}^{N}f\left (x_{i}, y_{i} \right ) \textup{ where } x_{i},y_{i}} \sim \textit{U}\left [ 0, 1 \right ]

Pretty straightforward!
Each time we drop a point onto the unit square we are completing a Bernoulli trial: the point is either inside the circle or it is not with a probability equal to the ratio of the area of the quarter-circle to the area of the square (p = 1/2 pi/4). It is this probability we are trying to estimate.

If we have sufficient trials and p is sufficiently far away from 0 and 1, then we can use the normal approximation to the binomial distribution in which the standard deviation (the root mean squared error) of the estimate of p is \sqrt{p.\left ( 1 - p \right ) / {N}} where N is the number of trials. So we should expect the error in our estimate of p to decline with the square root of N: to reduce our error by 50% we need to quadruple the size of our sample.

ASIDE: There are various rules of thumb that purport to tell you what "sufficient trials" and "far away from 0 and 1" mean. For example n.p > 5 AND n.(1 - p) > 5. In the end it depends on how accurate you want your final estimate to be. Consider a simulation of 100 trials estimating a fraction of 0.05. If you repeat this simulation over and over the distribution of outcomes will NOT be approximately normal since zero is the lowest number of points possible in the smaller partition, yet it complies with the rule of thumb.

I wrote a script in R to demonstrate these effects. I try out MC simulations with various numbers of trials from 2 to 100,000. For each simulation I estimate the error in the estimate of pi using the normal approximation. I make an alternate estimate of the error using the binomial distribution itself and calculating the span between the 15.87 percentile and the 84.13 percentile (1 sd either side of the mean). I run each MC simulation 1000 times and average out the estimates of the errors. Finally, I calculate the theoretical values for the errors in the estimates of pi for the various numbers of trials.

All is then plotted on log-log charts and we see that, when N is large enough, we get what we expect - a straight line with a -1/2 slope:

The Normal version shows the problem with using the normal approximation for very small trials - the actual error differs widely from the expected error. The Bernoulli version shows a lot of jitter for low number of trials, but theory pretty much agrees with practice.

The jitter is to do with percentiles and low population size. For example, if you only do 2 trials, there are only 3 possible outcomes (both in, both out, one in and the other out). 25% of the time the estimate for pi is zero, 50% it is 2 and 25% it is 4.  So the 15th percentile has a value 2 and the 85th has a value 4 so the estimated error is (4 - 2)/2 = 1!  Since one would never do a MC simulation of this size, don't invest time understanding this part of either chart, it was included for the sake of completeness.

To make use of the normal approximation in order to figure out an appropriate number of trials, we need an estimate of p to start with. Start with a simulation using, say, 1000 trials. This allows us to make an initial estimate of p and determine a confidence interval. Let's say we estimate p to be 0.785, and thus estimate the error in p to be sqrt(0.785 x 0.215 / 1000) = 0.013. This gives us a 95% confidence interval of 0.760 to 0.810. Our error will be worst if the "true" value is 0.76. Further, let's say we want the standard error in the estimate of p to be 0.0025, thus:

\sqrt{\frac{0.76.\left ( 1 - 0.76 \right )}{N}} \leq 0.0025 => N \geq 29,184
This is equivalent to an error of 0.01 in the estimate of pi (pi = 4.p => error in pi = 4 x error in p; 4 x 0.0025 = 0.01). Looking at the charts above we see that to get errors of the order of 10^-2 we need trials in the order of 10^4. We would run our MC simulation using, say, 30,000 trials. Finally we would check that our expected error is indeed within tolerance requirements.

ps: Can someone tell me how best to post an R script for easy download? This would be much easier than posting the script in a text box which requires tons of editing for appearance.

EDIT: strike through above - initially I set the radius sqrt(2/pi) then thought better of it, forgot to change the text.

EDIT: typos, expanded a couple of calculations to make them easier to follow.

EDIT: clarified (I hope) the derivation of the charts.

Jez Liberty said...

Good to see you back on the blog - nice project to look into the LSPM package (you might want to collaborate with Josh, if not already, I met him and he's a very nice chap).
Are you planning to make that project public by the way?

As you probably remember from my blog posts a while ago, I started looking at this package too..

But have put this on hold for now, as the optimization never seemed to converge to any values even when letting it run for thousands of cycles and hours when adding drawdown constraints (it was taking 10mins without the constraints).
Anyway, it is in the "back-burner" now and I need to get back to it, maybe get in touch with Josh about this.

For your R code embed issue, I was going to suggest a plugin that I use (WP-Syntax) but it is only for WordPress. I did google and found the following though:
http://gettinggeneticsdone.blogspot.com/2010/11/syntax-highlighting-rstats-code.html
which hopefully would help you (and as a personal bonus, it led me to find a better plugin for code embed in WordPress!).

Cheers,
Jez

Dekalog said...

I was recently having problems with embedding code on my blog but an answer to my query on stackoverflow at
http://stackoverflow.com/questions/6871000/cannot-enter-long-lines-of-text-for-horizontal-html-scroll-box

BlueEventHorizon said...

Jez,

I didn't plan to leave such a gap in my blog - I really don't know how you manage to do so much. I have a 3 more on MC to go and I know I promised one on outliers in multi-dimensional data sets (e.g. if you have 100 months of returns for 5 systems, which 5 months are the outliers?) and one on helping LSPM converge.

I really would like to get the LSPM one done - I think I can help you increase the probability of convergence! In addition, I would like to write my own version of LSPM as I have some ideas one making it more flexible and more efficient. This is in part why I am writing on MC is to ensure I understand it before moving on to LSPM.

Thanks to both you and Dekalog for your suggestions on presenting R-scripts - I mean it is such a chore to get the line feeds, non-breaking spaces ">", "[", etc presenting correctly!!

Jez Liberty said...

I do spend quite a few hours on my blog indeed! (my wife is quite understanding and this helps... a lot!) ;-)

Looking forward to your other posts in the pipeline...

Jez Liberty said...

ps: another "blog" design suggestion: it would be great if you could have a "subscribe to comments" option so that readers know when there are responses in the discussion they have taken part. Not sure how easy it is wth blogger though..

BlueEventHorizon said...

Jez,

If you are signed in when you make a comment you should get an option to subscribe by email to get a notification of any additional comments. You may only be able to do this with a Google account.

I will experiment a bit using my wife's account to see how the subscription buttons in the right margin work - I am not sure if they only notify you about a new post or if they include new comments.

Jez Liberty said...

Yes - it's probably a google account thing..

The "problem" is that they do it differently from most websites: instead of asking for name, email, url... they only asked for name and url (since you dont give your email, you cannot subscribe...) The only commenting system I am signed up on is disqus but this does not seem supported here (which is why I use name/url).

Pumpernickel said...

I wonder whether your axis labels in the two scatterplots might be misleading? If "Log Number of Trials" equals 1.0e+5 then the number of trials is e (or 10) raised to the power 100,000 . This is an incredibly huge number and I doubt you ran that many trials (!?!)

BlueEventHorizon said...

Hi Pumpernickel,

The format of the axes is standard R presentation for log scale - I agree it is not immediately obvious.

I did indeed run all the way up to 100,000 trials - I ran 10 tests at each order of magnitude spaced by one unit (e.g. order of mag = 3 => tested # of trials: 1000, 2000, ..., 10,000)

The beauty of R is that you vectorize stuff like this and it runs very fast - even on my crappy laptop 100,000 trials for this simple of a simulation is virtually instantaneous. This is the code to generate and test 100000 random x-y coordinates between 0 and 1 to see if the sum of tehir squares is less than 1:

pi.est <- sum((runif(100000, 0, 1)^2 + runif(100000, 0, 1)^2) <= 1) / 100000

Thanks for raising the issue.

Joshua Ulrich said...

Hi Ian,

Please do contact me if you have any questions about the LSPM R package or if you would like to collaborate. I'm very interested in any flexibility and efficiency improvements!

Jez, your issue sounds like the algorithm couldn't find a starting population. If your constraints are strict and the search space is large, the algorithm can have trouble finding a solution inside of your constraints. You could set your starting population to _really_ small f values, or you could use random portfolios that are inside your constraints.

Best,
Josh