Histograms & Box Plots

r_subheading-Course Description-r_end Histograms and box plots are the most common, and widely-used, data visualization techniques. This video introduces the basic concepts associated with the creation and understanding of histograms and box plots. The usefulness of both these plots is highlighted along the way. r_break r_break r_subheading-What You'll Learn-r_end • Introduction to histograms and box plots. r_break • Importance of histograms and box plots in data visualization.


I’m going to take a quick shot through a couple of different visualization techniques right now, different types of graphs. r_break r_break One of the most common and popular types of visualization is a histogram. Histograms show the distribution of values of a single variable. We divide the values into bins and then count the number of objects in each bin. And the height of a bar on our graph indicates the number of objects in a given bin. One of the important pieces of a histogram is that the shape of the histogram is going to depend on the number of bins you use. You usually have to experiment with different numbers of bins to extract the most interesting information. r_break r_break So, here we see two graphs of the petal width of some data set of flowers. It’s actually from that iris data set we were touching on briefly earlier with different bin widths. We can see here more clearly in the second than in the first that we have two very clear spikes. Maybe a third little spike here, and then a sort of a long messy tail over in this side. You can also construct two-dimensional histograms that show the joint distribution of two different attributes. So, here we’re counting the number of objects in petal width, the number of objects in each petal length bin, and then adding up the numbers in each bin to get the height of our count. Two-dimensional histograms are really nice for exploring correlations between different attributes. r_break r_break Another very common visualization technique is the box plot. The box plot displays the distribution of data. We’ve got a little box here where the edges of the box are the 75th and 25th percentiles. The median, or the 50th percentile, is shown as a middle bar. Then we show the 10th and 90th percentiles up above. And if there are any outliers which are a certain distance past the 90th and 10th percentiles we’ll mark them explicitly. For instance, here’s an example of that iris data again, sepal length and sepal width, petal length and petal width shown in various box plots. We’ve got centimeters on the left side, the values on the left side, and then each attribute has its own distribution. And we can see that the sepals are pretty well, but you know, clustered together. Petal length is all over the place and petal width is a little less all over the place. So box slots are very easy, very good for visualizing that kind of distribution.

Data Science Dojo Instructor - Data Science Dojo is a paradigm shift in data science learning. We enable all professionals (and students) to extract actionable insights from data.