Feature Selection


    Course Description

    Feature selection is another useful technique to reduce dimensionality in data. This video introduces the concept of feature selection and feature creation, and their impact on the overall quality and performance of data.

    What You'll Learn

     Feature selection as a dimensionality reduction technique

     Approaches to feature selection

     To create a new feature


    All right, so another way to reduce the dimensionality of data, other than just PCA, is a lot of times we have redundant or irrelevant features. This is going back to the questions about dimensions being independent.

    If we have redundant features or irrelevant features, that will increase our dimensionality artificially. It contains little to no information, but it increases our dimensionality. So, we want to be very careful about trying to detect these. A redundant feature, for instance, is that the purchase price of a product and the amount of sales tax paid on that product, those things are, based on the state, completely connected. You can calculate one from the other. They’re perfectly correlated. As a result, you want to get rid of it because it increases your dimensionality without adding new information. Same thing with irrelevant features. A student’s ID number, the vast majority of the time, is irrelevant to the task of predicting the student’s GPA. And these types of redundant and irrelevant features don’t just harm us via increased dimensionality.

    Redundant features effectively weigh features multiple times. If we have the same information contained in two columns, two separate columns, that model thinks both are important, we have double-weighted that information. Similarly, irrelevant features can confuse our model. The model will try to do some fitting based on those features and it’ll just sort of diffuse the effectiveness of the model. One of the big steps of data pre-processing is making sure we figure out what attributes are redundant and irrelevant and aggressively cutting them out of our data set. And there are a number of different techniques you can use to do this kind of subset selection. You can brute force it, just try all your different feature subsets.

    Some algorithms, some of the most popular algorithms used, actually, naturally do feature selection, so that’s always good. Sometimes, you have a filter approach where you use your exploration and what you know about the data set in order to filter out the bad features. And sometimes you can get the data science inception going on where you use a data mining algorithm on your data mining algorithm in order to find the best subset of attributes. But that’s feature subset selection. It doesn’t share a lot. I’m going to move on a little quickly. Please ask questions as they are as they arise to you. But we’re running a little bit behind, which is great. I love the discussions we’ve had and it’s important. The front half of this presentation is more critical than the back half. But I am going to start increasing the pace a little bit, just as a heads up. So please ask your questions as they come up.

    Another common technique and this goes with aggregation to a certain extent, is feature creation. We have the cursor dimensionality on the one hand but other times we don’t have enough features. We don’t have enough information. There is more information that we could have. So we can either extract things, say combine two columns in order to get new information. For instance, in sales we could determine the tag price from the total amount paid, filtering out the sales tax, which might be important. Other times we have aggregation and things like that with feature construction. And last, and really mostly least because we don’t do this that much, is mapping data to a new space. Those of you from a scientific background are probably familiar with the Fourier transform, which takes data that is in the time domain and converts it to be in the frequency domain, which allows you to pick out different pieces of information. We don’t do this kind of transformation that much in data science because it tends to require transforming the entire data object but it is something to be aware of, to have in your back in the back of your head. Because there are some times that you really do want to do some sort of massive transformation like this. Particularly in an anomaly detection time series context, you might want to do things like taking a Fourier transform of your data.


    Data Science Dojo Instructor - Data Science Dojo is a paradigm shift in data science learning. We enable all professionals (and students) to extract actionable insights from data.