Data Attributes

    Course Description

    We'll discuss data attributes and the ways they are classified. There are various types of data attributes based on their usage and usefulness. This video serves as a brief introduction to all such types. 

    What You'll Learn

     > Different properties of data attributes

     > Discrete and continuous classes of data attributes

     > Examples and descriptions of discrete and continuous data attributes

    So, we have objects and we have attributes.

    Each attribute has a set of values which the objects can draw from. So, each object is defined by a set of attribute values. And each attribute we can think of as being defined by the set of values that it can hold. We can have the same attribute mapped to different attribute values – height can be measured in meters or feet, temperature can be measured in Celsius, Kelvin, or Fahrenheit, lots of other sorts of things like that. And different attributes will often be mapped to the same set of values – ID numbers and age are both usually given as integer values, temperature and height are both often given as floating-point values (decimal values).

    The properties of our attributes can also be different. Height, for instance, has a pretty practical maximum and minimum value as does something like age whereas ID number has no real limit, it’s whatever the people who created the dataset define it to be. That kind of gets into an interesting question of “who defines what value set that a given attribute uses?” And the answer to that is, essentially, “we do,” right? The people who create the dataset do, the people who hand us the data, the data engineers, or the Twitter API (or other APIs) that we’re calling in order to get the data will have some definition of it but we can set that ourselves too. We can change our attitudes to be mapped to different sets of values and we’ll use that in a variety of places. So, we know that we have these attribute values.

    It’s useful to talk about attributes as being part of different classes, different types of attributes that we’re going to end up having to handle differently, as we get into the actual data mining and modeling processes. There are two fundamental types of attributes: discrete attributes and continuous attributes.

    Discrete attributes have either a finite or countably infinite set of values. For those of you who don’t know, the term “countably infinite” basically means integers. If you can turn your attribute into integers, then it’s countably infinite – or finite if you’ve got only a limited set of integers. Good examples of these are zip codes, things like click counts, the set of a word count, word counts in a collection of documents. We could, in theory, have as many clicks as we want. There’s a countably infinite set but there are always going to be integers. We have a countably infinite set of values there. Usually, we represent these as integer variables. Binary attributes are a pretty special case of discrete attributes that we end up having to handle differently in some cases. Binary attributes have only two values. We might call those yes or no, dead or alive, 1 or 0. Those kinds of columns are sort of a special case. In some contexts, we really like them, they make things easier. In other contexts, they can be problematic, which is pretty much everything.

    The other big type of attribute classification that we see are continuous attributes. In this case, we have real numbers as our attribute values, there’s no limitation to just integers. So, temperature, height, weight, oxygen level, taxable income, all these things have real numbers as their attribute values. They can, theoretically, take any value at all. Now, in practice, of course, we have to put these things into a computer, and computers can only measure and represent a finite set of digits. So, generally speaking, these attributes are usually represented as floating-point variables. Floating-points, for those of you who are farther out from your learning of programming, are essentially just variables that hold a real number, that can hold a decimal, the floating-point being the decimal point in the number.

     Data Science Dojo Instructor - Data Science Dojo is a paradigm shift in data science learning. We enable all professionals (and students) to extract actionable insights from data.