Now that we’ve got that basic definition, we understand what attributes are in data objects and the different types of them, we can move on to talking about data quality. r_break r_break Data quality is, particularly by new data scientists, one of the most commonly overlooked or shortened, or poorly shortened steps. Pieces of it get ignored, get skipped because it just doesn’t seem that necessary. But understanding your data quality problems and understanding where they could come from is very very important to creating robust models that will actually work in production. You have to know what to expect in order to handle it appropriately. r_break r_break There are three fundamental questions around data quality - we have to ask this of every dataset we get: What problems do we have to worry about? How do we detect those problems? What can we do about those problems? Those are the three fundamental questions you should ask yourself every time upon approaching a new dataset. And some of your earliest explorations should really be focused at answering these questions. I am going to give you some examples of how we answer each of these three questions and some of the categories of things coming up. r_break r_break There are three very common kinds of data quality problems: noise and outliers, missing values, and duplicate data. These show up in production all the time. Let’s go through and think about these in this context.