Summary Statistics

r_subheading-Course Description-r_end Summary statistics give a good idea of the overall nature of a data set. These statistics are both easy to calculate and understand, thus presenting a better picture of the overall behavior of a data set. r_break r_break r_subheading-What You'll Learn-r_end • The importance of summary statistics. r_break • What is a Mode. r_break • How to calculate different percentiles.


So, I’m going to go through and talk a bit about the kinds of summary statistics we like to use now - frequency, accounts, mean, and standard deviation. r_break r_break Summary statistics are numbers that summarize properties of the data, exactly what they sound like. Most can be calculated pretty quickly in a single pass through the data, in one pass, which is very nice. Most of them can be calculated in just about any language you care to do them in, whether you’re doing it in SQL, or R, or Python, or anything else that you care to do it in. r_break r_break Summary statistics are pretty easy to calculate. For categorical data, our most common summary statistics are frequency and mode. The frequency of an attribute is the percentage measuring how often the value occurs in the data set. For example, if the attribute is gender, then the value female will occur a bit less than 50% of the time. The value male will occur a bit less than 50% of the time. And something else will occur some small percentage of the time. So we can think of those numbers as being percentages. r_break r_break On the other hand, the mode of an attribute is the most frequent attribute value. In this case, we might have something like marital status - single, married, divorced. Depending on our data set, we may want to know what the most common value is. Do we have mostly single people, mostly married people, or mostly divorced people in our data set? That will change the way we look at the data. r_break r_break Frequency and mode are typically used with categorical data. Though sometimes when you have continuous data, it’s useful too. More often when we’ve got continuous attributes, we think more in terms of percentiles. This is more useful than direct frequency or the concept of mode, for the most part. r_break r_break Percentiles are pretty simply defined. I have a formal definition here but the easier way to understand it is by looking at it there. So, the percentile is that you count the number of people who have a smaller value than you. And you count the percentage of the total group that is that number. And you are thus at that percentile. So if you are the fourth tallest person in a group of 20th, that means 80% of people are shorter than you. And it means that you are at the 80th percentile. And so if the height is 1.85 meters, then 1.85 meters is the 80th percentile height in this group that we care about.

Data Science Dojo Instructor - Data Science Dojo is a paradigm shift in data science learning. We enable all professionals (and students) to extract actionable insights from data.