How Different Are These Things From One Another?

In order to "cluster" observations we need some measure of how alike or different they are. This leads naturally to the concept of a distance between observations. In my last post about clustering I stated the following:
An observation is a vector of values, not necessarily of the same type, associated with the object which is to be clustered. They might be of the following types:
  • Numerical: e.g. 4'8", 6'4", 5'10" - if we placed values on a scale we could visualize distance.
  • Ordinal: e.g. 1st, 2nd, 3rd - the ordering matters (e.g. 1st is closer to 2nd than to 3rd, but we don't know anything about how much closer).
  • Binary: True / False - the feature is either there or it is not.
  • Categorical e.g red, blue, green - there is no ordering to the categories.
Note that datasets may include different data types, complicating the distance calculation.

Let's dig into the concept of "distance" a little more ....

Which of these things is like the others?


The problem of categorization presents itself in many aspects of life. When marketers talk of "market segmentation" they essentially mean: "what categories of customer are there?" When investors talk of "asset classes" they essentially mean: "what categories of investment are there?"

In the past, when facing the problem of categorization, I would typically look at the universe of things I needed to categorize and make fairly arbitrary judgements to put them into categories. Cluster analysis seeks to rationalize the process by processing the data with an open mind: it an unsupervised learning process.

So what is a cluster? It is a group of observations that are similar to each other and dissimilar to observations in other clusters.

Let's start with some basics ...

RSelenium on a Mac

I have been doing research on the largest hedge funds by ploughing through their regulatory filings. The research itself will be the basis of some future posts. The manual process was so time-consuming that I decided to look into automating it using tools available in R. Basically, I wanted to set up a file with a list of firms and have a script run through each firm and download the webpage containing the filing for that firm and parse it for the information I want.

This lead me into the fascinating world of web scraping.

Heeeeere's Johnny!

I am moving on from The Bornhoft Group so I am now able to recommence my blog.

My current thinking is to continue with various systematic trading related topics that are of interest to me. I am going to try to build up a bit more of a following than I had in the past - perhaps I will get some suggestions on topics to review.

I am also considering my options career-wise, and I hope that some of my scribblings may lead to productive and interesting assignments.


A good friend of mine, Anthony Garner and an associate of his, Andreas Clenow, have set up a great traders' community at tradersplace. There are interesting papers, respectful discussion and a wealth of other useful resources. You may recognize some folks from the TradingBlox community among many professional traders; you may even spot a turtle.

I highly recommend it!
Get widget