We'll discuss the curse of dimensionality and the techniques to reduce dimensionality in your data. The entire concept is explained with appropriate examples and useful techniques to employ in the field of data science.
What You'll Learn
> The curse of dimensionality
> Impact on dimensionality of data quality
> Importance of dimensionality reduction
The next thing we’re going to talk about is what’s called the curse of dimensionality. So, this is sort of a data quality issue but it’s something that we have to be careful about when we’re doing data processing.
The curse of dimensionality is that as your number of dimensions increases - so as the number of columns, number of attributes you have in your data set increases - the data inherently becomes increasingly sparse in that space, since in a lot of contexts, for a lot of different algorithms, definitions of density and distances between points of similarity and dissimilarity are really important to things like clustering methods and outlier detection to anomaly detection, and this all becomes less meaningful. If you add enough dimensions, every point looks like an outlier.
A great illustration of this is that if we randomly generate 500 points in an n-dimensional space and we compute the difference between the maximum distance between any pair of points and the minimum distance between any pair of points, and this has been normalized in a log taken to make it look pretty, we can see that in two dimensions with 500 randomly generated points, the maximum distance is about three and a quarter times larger than the minimum distance. Actually, this is 10 to the three and a quarter times larger because there’s a log base 10 here. As we increase the number of dimensions, though, that spacing falls off really sharply. By the time we get down here 30, 40, 50 dimensions, our points are so sparse that the minimum distance between points and the maximum distance is almost the same thing. This 50 point represents a factor of something like 10 to the 0.25. Like, the fourth root of 10 is the difference between the maximum distance and the minimum distance. This is a very small number.
It’s really hard to define outliers when you have such high-dimensional data because every point is an outlier in some ways as the space is so sparse. The solution to this data quality problem is something called dimensionality reduction.
We can do dimensionality reduction via aggregation or other sorts of column combinations. But there are also a number of mathematical techniques. Two of the big popular ones are Principal Component Analysis (PCA) and Singular Value Decomposition (SVD). These are mathematical techniques that will run automatically and reduce the dimensionality of your data. PCA actually usually goes from n dimensions - so as many dimensions as you have to have - all the way down to two dimensions.
They are kind of the same thing but they aren’t exactly the same thing. I’m not going to go into great detail because we don’t spend a lot of time on dimensionality reduction over the course of the Bootcamp but my understanding is that they are distinct techniques, though they have the same goal. They have the same goal but they are achieved via different mathematical methods.
Data Science Dojo Instructor - Data Science Dojo is a paradigm shift in data science learning. We enable all professionals (and students) to extract actionable insights from data.
© Copyright – Data Science Dojo