BackgroundI have downloaded from SEC's IAPD website and from NFAs BASIC website a lot of information about the funds operated by 70 of the largest hedge funds according to the 2015 version of Institutional Investors Alpha Hedge Fund 100. My hypothesis is simple: Managers adopting similar market strategies (as distinct from trading strategies) will tend to offer similar funds in the marketplace and use similar names for them.
I dismantle all the names of the funds to create a dictionary of "fund words". This is harder than it sounds - there is a ton of clean-up to do including filtering out meaningless words like brand names, numbers, forms of organization, etc., not to mention the outrageous number of spelling mistakes! For each manager I count up the number of times a word appears and also total up the $AUM associated with each word based on the AUM in the fund that uses the word.
For example, if Anchorage reports 6 funds with $2.8bn AUM with the word "CLO" in their names, 13 funds with $1bn AUM with the word "Credit", etc. After all the filtering, Anchorage only ends up with about 13 meaningful words in its vocabulary.
My overall dictionary across all managers in my base case includes 512 words (my cases range from 50 - 1800 words). So you can see that it is likely that two managers might share only a few words in common. An alternative problem is that the managers have very similar vocabularies, but one may have an order of magnitude more $AUM associated with the same words. Using traditional distance measures like Manhattan or Euclidean will be dominated by the lack of overlap or the sheer overall AUM differences between them. This is the problem I have sought to solve.
I have come up with an approach that appeals to me, and I want to share it. First, I want to look at how we measure distance between observations when the data are categorical. Then I want to show how I think the categorical approach can be combined with numerical data that is sparse.