Understand what data quality is and its importance in data science as the quality of data is important to create robust models and to solve real-world issues.
What You'll Learn
> Data quality and its significance in data science
> Fundamental questions around data quality
> Introduction to major problems in data quality
Now that we’ve got that basic definition, we understand what attributes are in data objects and the different types of them, we can move on to talking about data quality.
Data quality is, particularly by new data scientists, one of the most commonly overlooked or shortened, or poorly shortened steps. Pieces of it get ignored, get skipped because it just doesn’t seem that necessary. But understanding your data quality problems and understanding where they could come from is very very important to creating robust models that will actually work in production. You have to know what to expect in order to handle it appropriately.
There are three fundamental questions around data quality - we have to ask this of every dataset we get: What problems do we have to worry about? How do we detect those problems? What can we do about those problems? Those are the three fundamental questions you should ask yourself every time upon approaching a new dataset. And some of your earliest explorations should really be focused at answering these questions. I am going to give you some examples of how we answer each of these three questions and some of the categories of things coming up.
There are three very common kinds of data quality problems: noise and outliers, missing values, and duplicate data. These show up in production all the time. Let’s go through and think about these in this context.
Data Science Dojo Instructor - Data Science Dojo is a paradigm shift in data science learning. We enable all professionals (and students) to extract actionable insights from data.
© Copyright – Data Science Dojo