How Different Are These Things From One Another (Category& Mixed Data)?

In an earlier post I was looking at distance measures for clustering. In a still earlier post I had referred to analyzing hedge fund regulatory data using clustering to try to put the funds into groups by inferred strategy. I had to solve a problem with clustering that has being bothering me for a while: how do you measure distances between observations when the data is sparse? In my case the problem is further compounded by order-of-magnitude differences in the values for one observation vs. another (a Pareto distribution).


I have downloaded from SEC's IAPD website and from NFAs BASIC website a lot of information about the funds operated by 70 of the largest hedge funds according to the 2015 version of Institutional Investors Alpha Hedge Fund 100. My hypothesis is simple: Managers adopting similar market strategies (as distinct from trading strategies) will tend to offer similar funds in the marketplace and use similar names for them.

I dismantle all the names of the funds to create a dictionary of "fund words". This is harder than it sounds - there is a ton of clean-up to do including filtering out meaningless words like brand names, numbers, forms of organization, etc., not to mention the outrageous number of spelling mistakes! For each manager I count up the number of times a word appears and also total up the $AUM associated with each word based on the AUM in the fund that uses the word.

For example, if Anchorage reports 6 funds with $2.8bn AUM with the word "CLO" in their names, 13 funds with $1bn AUM with the word "Credit", etc. After all the filtering, Anchorage only ends up with about 13 meaningful words in its vocabulary.

My overall dictionary across all managers in my base case includes 512 words (my cases range from 50 - 1800 words). So you can see that it is likely that two managers might share only a few words in common. An alternative problem is that the managers have very similar vocabularies, but one may have an order of magnitude more $AUM associated with the same words. Using traditional distance measures like Manhattan or Euclidean will be dominated by the lack of overlap or the sheer overall AUM differences between them. This is the problem I have sought to solve.

I have come up with an approach that appeals to me, and I want to share it. First, I want to look at how we measure distance between observations when the data are categorical. Then I want to show how I think the categorical approach can be combined with numerical data that is sparse.

Inverse Totient Procedure

When I have time, I enjoy solving the problems at Project Euler. I have solved 177 problems as of today using R as my primary tool. In fact, I found Project Euler when I was looking for problems I could use to learn R. At one point I was third ranked by problems solved in the R listings, but I have since slipped to sixth - the takeaway is that the R-crew are not the cream of the crop on Project Euler!

Leonhard Euler came up with the Totient function (the Totient of n is the number of integers less than n coprime to n). Not surprisingly, totients feature in a number of the Project Euler problems. One I have been struggling with involves inverting the totient function. This is not straightforward because for any given totient, there are at least two numbers that could have given rise to it. For example, 3, 4, and 6 all have a totient of 2 (1,2 are coprime to 3; 1, 3 are coprime to 4; 1, 5 are coprime to 6).

I have searched all the usual places for a procedure to invert the totient function. I found answers to specific questions (i.e. if the totient of n is 1000 what is n?). I found academic papers that provide procedures, but I couldn't find a nice simple recipe. So based on what I found, here's one ...

Taleb: Silent Risk, Section 1.3 "Statistics and Risk: Two Different Businesses"

Towards the end of this section, Taleb inserts a sidebar as follows:

Consider the right tail K^{+}\in \mathbb{R}^{+} and the left tail K^{-}\in \mathbb{R}^{-}. Without specifying the support of the distribution:

Definition 1.3 (Probability swamps payoff (thin tails)).

\lim_{K^{+ }\rightarrow\infty }E\left [ X\mid_{X> K^{+}} \right ]=K^{+}\textup{and}\lim_{K^{- }\rightarrow-\infty }E\left [ X\mid_{X< K^{-}} \right ]=K^{-}

Definition 1.4 (Payoff swamps probability (fat tails)).
\exists \textup{ }\lambda^{+}> 1 \textup{ or }\lambda^{-}> 1\textup{ s.t.}

\lim_{K^{+ }\rightarrow\infty }E\left [ X\mid_{X> K^{+}} \right ]=\lambda^{+}. K^{+}\textup{and}\lim_{K^{- }\rightarrow-\infty }E\left [ X\mid_{X< K^{-}} \right ]=\lambda^{-}.K^{-}

N.B: I added the "+" and "-" superscripts to K in the last line where they were missing.

I was curious to look at this behavior for a set of distributions: normal, exponential, gamma, weibull, pareto, and cauchy. I created some functions which I provide below which will allow exploration of the effect of the parameters of the distributions. I also provide a script that can be used to run a set of examples, one for each distribution. The interesting ones in terms of the sidebar above are pareto and cauchy, but the differences between the others are of interest too.

I am not sure I understand what is special about the cases where the constant of proportionality, Lambda > 1. Do we say payoff swamps probability if lambda = 1.1? or 5? It's the transition from "Probability swamps Payoff" to "Payoff swamps Probability" that occurs if Lambda > 1 that I am not understanding. I am guessing that if a risk manager were to casually assume that his expected loss would approximate K, then he might be in for a shock if it is actually 2.K!

There's a reference to Chapter 4 which I have yet to read - perhaps I will "get it" at that time.

Taleb: "Problems and Inverse Problems" Follow-Up

In my previous post I published a bunch of R Scripts that will enable a reader of Taleb's "Silent Risk", Chapter 3, Section 3.2 "Problems and Inverse Problems" to play with the ideas he presents. I thought I should discuss one of the results those scripts produce that does not jive with Taleb's.

I know from writing blog posts that it is incredibly difficult to be consistent and accurate when giving examples. Ideas evolve as you try to write them up, you try different things and inconsistencies emerge between the text, the code, and the charts.

So, the result that is not consistent is to do with the expected loss. Taleb's book suggests "that close to 67% of the observations underestimate the tail risk below 1%, and 99% for more severe risks". My MC simulations indicate that even at a tail risk below 5%, 99%+ of the observations underestimate the loss.

Taleb: "Silent Risk", Section 3.2 "Problems and Inverse Problems"

Section 3.2 in Chapter 3 of "Silent Risk", a draft of a book by Nassim Nicholas Taleb defines the "inverse problem" as follows:
Definition 3.4 (The inverse problem).
There are many more degrees of freedom (hence probability of making a mistake) when one goes from a model to the real world than when one goes from the real world to the model.
He thens brings the problem into sharper focus:
Principle 3.2 (Visibility of the Generator).
In the real world one sees time series of events, not the generator of events, unless one is himself fabricating the data.
He then goes on to illustrate the problem by posing the following problem: If one is provided with a set of data generated by a particular distribution (Pareto, say) and analyses it assuming a different distribution (Gaussian, say) how severe are the errors likely to be. Below, I provide the R scripts to allow this experiment to be simulated.
Get widget