When we’ve got real values - and this is sort of a primer for the Bootcamp, a reminder for those of you who’ve been out of math classes for a while - when we’ve got continuous data, purely continuous data, we will often use Euclidean distance as the distance, as a way of measuring similarity, actually, really, as a way of measuring dissimilarity because it’s higher the more unlike the objects are. r_break r_break This formula might be a little intimidating to some people but I promise you that you are familiar with Euclidean distance. You just maybe don’t know the term. Euclidean distance is what you’d hear called a distance formula, just the distance formula, in your high school algebra classes. Most people have seen it in two dimensions, and sometimes three. But one of the very nice things about the Euclidean distance is that it generalizes very naturally to as many dimensions as you want. So, in order to calculate the Euclidean distance between two data objects, we take the difference in each attribute value, square it, and then sum that and take the square root. r_break r_break For instance, we have four points here at (0,2), (2,0), (3,1), and (5,1) that are all plotted at different points. And we can construct a distance matrix describing how dissimilar all of our points are. So, point 1 (0,2) and point 4 (5,1) are the most dissimilar. They’re the farthest apart, whereas point 2 (2.0) and point 3 (3,1) are the most similar. They’re the closest together. Point 3 is also fairly similar to point 4, whereas point 2 is somewhat less similar to point 4. r_break r_break Another distance metric that we see, particularly in the context of documents, is called cosine similarity. So, we have documents. We have turned them into term vectors. Cosine similarity is a measure of similarity, not of dissimilarity. We can find how similar the two documents are by thinking of each of them as vectors, taking their dot product. For those of you who never had it or don’t remember your college vector calculus classes, you take each attribute, attribute by attribute, and you multiply them together across your two different objects. So, 3 times 1, 2 times 0, and 0 times 0. Maybe this is play and this is the coach and this is the tournament. And so, we’ll do our count, and then we’ll multiply them all together document to document, and sum that all up. And then we end up dividing by the product of the magnitudes. So, the product of the magnitudes is just you square each attribute, add them all up, and take the square root. In this case, we have a dot product of 5. We have a D1 and a D2 of 6.481 and 2.245 - those are our magnitudes. We multiply these two together and divide 5 by that, and we end up with a cosine similarity of .315. r_break r_break Cosine similarity is a really nice metric for documents because it gives us this very clean 0 to 1 measurement that suffers less from the curse of dimensionality than something like Euclidean distance does. Because document vectors tend to get very, very long because there’s a lot of different words in a given language, and given documents might have lots of different words in them, cosine similarity is a way to avoid some of the curses of dimensionality.