We can move on to data set classification. r_break r_break There are a lot of different types of data sets and they require different approaches to analysis. The pre-processing steps, the modeling steps, pretty much everything that you do with these different types of data sets is going to be different – the kinds of models you use, the kinds of visualizations you construct, the kind of cleaning that is proper for that kind of data. r_break r_break Understanding the structure of your data at the beginning is very important to not wasting time and not producing incorrect results. And it’s in this step, the understanding the structure of your data, that things like domain knowledge tend to be very important but there are still, certainly, categories that tend to be similar no matter what domain they’re in. We’ll talk about these three different kinds of types of data sets, records, graphs, and ordered data sets, in a little bit more detail coming up here. r_break r_break Record data is data that consists of a collection of records, each of which consists of a fixed set of attributes. This particular data set, which I use in several places, is record data. Every data object has one Tax ID, has a value of whether they asked for a refund, marital status (whether they’re single, married, or divorced), a taxable income field, and whether they cheated on their taxes or not. So, that’s sort of the structure of this data set. Any data, which consists of this kind of collection of records, which consists of a fixed set of attributes – you almost always represent this kind of data as a table, whether a database table, or a spreadsheet, or something like that, and it’s the most common kind of data. A lot of people will, if you talk about data or data sets, this is what they visualize entirely: record data. It’s, sort of, your most common and fundamental kind of data set. r_break r_break Within record data, there are a few useful subsets. This record data, with the tax data, has some categorical values and then one ordinal variable. Tax ID is ordinal, right? Or is it? It’s really more of a nominal variable when you think about it because ordering doesn’t necessarily matter, right? Sure, it takes numbers but 10 is not meaningfully different from 5. There’s no ordering implied here. So, Tax ID is a nominal categorical field, tax refund is a categorical field, marital status also, taxable income is a continuous field. Most data that you encounter have mixed data types like this. You have some categorical, some numeric, and that’s, sort of, your traditional type of record data. r_break r_break If, on the other hand, your record data consists entirely of numeric attributes – so this is entirely continuous, entirely interval, or ratio variables – then we can think of it as a mathematical matrix, rather than just a table. So, we would have an “m x n” matrix - there are “m” rows, one for each data object, “n” columns, one for each attribute. And this is nice because we can think of these data objects as points in a multi-dimensional space where each attribute is represented along one dimension. And that allows us to use a number of numeric techniques, specifically involving distance that not only make some algorithms easier but which some algorithms require. There’s a number of algorithms that require you to have data matrix data, all numeric data.