fbpx
Learn to build large language model applications: vector databases, langchain, fine tuning and prompt engineering. Learn more

Data Ethics: Advice of a data scientist. Here’s how to master it

Data Science Dojo
Rebecca Merrett

November 22

As data scientists, we are all in this to pursue the objective truth, or close to it. Check your ethics with these bad data ethics examples.

You’ve probably come across this before:

  • A vendor skews a graph that compares their product with a competitor’s in the market.
  • A survey conveniently shows that most respondents unanimously agree on an issue.
  • A cosmetic company claims their new “miracle cream” has been “scientifically tested.”

While these examples may seem silly to some, misleading analysis is a genuine issue that often has profound consequences. Ethical concerns arise when data scientists don’t follow good practices when collecting, handling, presenting, and modeling data.

As an aspiring professional in data science, your personal viewpoint should not matter.

Repeat: Your. Personal. Viewpoint. Should. Not. Matter.

As data scientists, we are all in this to pursue the objective truth, or close to it. This is where data ethics comes in. We want to find out and discover things that improve our understanding of the world and the people around us, and to better predict our future.

This is not only a mantra: it’s a way of thinking that every data scientist should adopt if she or he is going to be successful in their role. Your personal subjective viewpoint can get in the way of being a good data scientist.

There’s a saying that your model is only as good as your data. This also means that any conclusions you make about certain groups of people or how the world works depends on whether good data ethics collection practices were used.

For example, you might come across a model that was based on “race” as being a heavily weighted predictor variable.

There are two issues with this:

First, the model just so happens to classify people of a certain race as all being high credit risk applicants for a home loan at a bank. However, when looking closer at the actual data, it’s apparent that most cases are from one racial group, with all these cases living in the same part of the city or location.

How different the results be if there was a more diverse random sample of cases, across all locations? What if there were many cases of this racial group living in other locations with a good credit history that just didn’t make it into the dataset?

Also, when it comes to classification tasks, if there is extreme imbalance of classes in the dataset, the model will tend to correctly predict the majority class most of the time but will struggle to predict the under-represented classes.

Second, why did the bank decide to place such a heavy weight on the predictor variable “race”? Are the results different when race is not heavily weighted? Was this decision driven by a personal viewpoint, or was there a non-subjective reason behind placing a heavy weight on race?

It could be that the reason behind this decision is purely subjective and skews the results, therefore making any conclusions negligible.

Bad data ethics examples

Studies that make conclusions about crime rates among certain ethnic or socioeconomic status groups are another example where data ethics are a concern. Why is it that some studies only use data on certain cities and not others? Could it be that crime in these carefully selected cities are likely to falsely prove a subjective viewpoint and make wrong conclusions about a group overall?

What ever happened to the good old practice of getting a random sample across the entire population before you even thought about using the data to make conclusions about the entire group?

Consequently, deliberately excluding certain cases from the analysis, without any reason to believe the data is incorrect or inaccurate is a problem. Also, wrangling the data in a way to try and prove a viewpoint is another ethical deal breaker.

For example, let’s say you came across a statistical significance test that shows men and women math students are significantly different when it comes to learning mathematics. However, the test is based on all men included in the dataset, and all women excluding a few outliers, with some cases of women merged into one case with their computed average.

It’s important, because this could result in incorrectly rejecting the null hypothesis of there being no real difference in favor of the bogus claim that one gender is better at math than the other.

The takeaway

In conclusion, these examples of bad data ethics should be front of mind when collecting, cleaning, wrangling and modeling data, so that our conclusions are not based on false “truth.”‘

Finally, think about it this way: how would you feel if someone painted a misleading picture of you based on a subjective viewpoint and tried to label it as “fact”?

DSD Sign
Written by Rebecca Merrett
Have a similar idea? Submit your guest post with us
Newsletters | Data Science Dojo
Up for a Weekly Dose of Data Science?

Subscribe to our weekly newsletter & stay up-to-date with current data science news, blogs, and resources.

Data Science Dojo | data science for everyone

Discover more from Data Science Dojo

Subscribe to get the latest updates on AI, Data Science, LLMs, and Machine Learning.