Required reading


Gay-face case study

All state case study

Recommended reading

Key concepts/skills/etc


“We shouldn’t have to think about the societal impact of our work because it’s hard and other people can do it for us” is a really bad argument.

I stopped doing CV research because I saw the impact my work was having. I loved the work but the military applications and privacy concerns eventually became impossible to ignore.

But basically all facial recognition work would not get published if we took Broader Impacts sections seriously. There is almost no upside and enormous downside risk.

To be fair though i should have a lot of humility here. For most of grad school I bought in to the myth that science is apolitical and research is objectively moral and good no matter what the subject is.

Joe Redmon, 20 February 2020.

Although the term ‘data science’ is ubiquitous in academia, industry, and even more generally, it is difficult to define. One deliberately antagonistic definition of data science is ‘[t]he inhumane reduction of humanity down to what can be counted’ (Keyes, 2019). While purposefully controversial, this definition highlights one reason for the increased demand for data science and quantitative methods over the past decade—individuals and their behaviour are now at the heart of it. Many of the techniques have been around for many decades, but what makes them popular now is this human focus.

Unfortunately, even though much of the work may be focused on individuals, issues of privacy and consent, and ethical concerns more broadly, rarely seem front of mind. While there are some exceptions, in general, even at the same time as claiming that AI, machine learning, and data science are going to revolutionise society, consideration of these types of issues appears to have been largely treated as something that would be nice to have, rather than something that we may like to think of before we embrace the revolution.

For the most part, these are not new issues. In the sciences, there has been considerable recent ethical consideration around CRISPR technology and gene editing, but in an earlier time similar conversations were had, for instance, about Wernher von Braun being allowed to building rockets for the US. In medicine, of course, these concerns have been front-of-mind for some time. Data science seems determined to have its own Tuskegee syphilis experiment moment rather than think about and deal appropriately with these issues, based on the experiences of other fields, before they occur.

That said, there is some evidence that data scientists are beginning to be more concerned about the ethics surrounding the practice. For instance, NeurIPS, the most prestigious machine learning conference, now requires a statement on ethics to accompany all submissions.

In order to provide a balanced perspective, authors are required to include a statement of the potential broader impact of their work, including its ethical aspects and future societal consequences. Authors should take care to discuss both positive and negative outcomes.

NeurIPS call for papers, as accessed 26 February 2020.

Ethical considerations will be mentioned throughout this book. The purpose of this chapter is not to prescriptively rule things in or out, but to provide an opportunity to raise some issues that should be front of mind. The variety of data science applications, the relative youth of the field, and the speed of change, mean that ethical considerations can sometimes be set aside when it comes to data science. This is in contrast to fields such as science, medicine, engineering, and accounting where there is a long history. Nonetheless it can helpful to think through some ethical considerations that you may encounter in the content of a usual data science project.


Can a question be evil?


Think about how, even with the best of intentions, the question or task that you are wanting to do could be implemented in a way that is unethical. Here I think about the raw incentives that are being faced. This may be obvious. For instance, in the case of the Wells Fargo bank accounts situation, it should have been clear to management that having unachievable goals would result in bad decisions. But in the case of a data science project it can be more difficult to see.

Think about a lending decision and the variables that go into that. If race (and what does that even mean?) is an important variable then we need to think about whether that is a model that we want to be using. (In the case of race, it’s not just an ethical question, but could be a legal one.) Similarly, what about age or sex? Is asking these questions likely to actively harm people? To what extent are we willing to trade off that harm? And are the benefits and costs accruing to different groups? It’s often some particular group that pays, while some other group benefits.


There is a naive assumption that if you see numbers in a spreadsheet, they are real somehow. But data is never this raw, truthful input, and it is never neutral. It is information that has been collected in certain ways by certain actors and institutions for certain reasons. For example, there is a comprehensive database at the US federal level of sexual assaults on college campuses – colleges are required to report it. But whether students come forward to make those reports will depend on whether the college has a climate that will support survivors. Most colleges are not doing enough, and so we have vast underreporting of those crimes. It is not that data is evil or never useful, but the numbers should never be allowed to “speak for themselves” because they don’t tell the whole story when there are power imbalances in the collection environment.

Catherine D’Ignazio, Corbyn (2020)


There is a general framework for ethical behaviour when it comes to data in the context of medicine and the sciences. This includes:

It’s not PII There’s no way it’s PII It was PII

‘A GDPR data privacy haiku’, Lisa Phillips, 13 November 2018.

It’s also the case that when using machine learning, we train it on the past. To use a trivial example - if we wanted to forecast the race and sex of the next US president, it would be a brave model that didn’t suggest it would be a white person, because that is what all of them apart from one, have been; and it would be a braver model that didn’t suggest it would be a man. Does this mean the model is biased or that history is? But if we are wanting to make forecasts about the future, then what do we want? Similarly, what would be most accurate?

Finally, consider the source of the dataset. There are many cases where academic work has been done using hacked data, for instance, the Ashley Madison.



There are basic aspects around constructing the model including ensuring that it is transparent, and reproducible. There may be other concerns, such as ensuring the model results are not a result of p-hacking.

The scale of data science projects means that the potential for abuse and different outcomes at large-scale should be considered. While a data science project may be ethical at a small-scale, it may be unethical at large-scale. There may also be different ethical concerns when a project is under your control compared with if it was available to others.

Finally, while it may be the case that the person who made the model understands the trade-offs and value-judgements that were part of constructing the model, it is possible that others may not. This may lead to a false confidence in the results of the model because of a false sense of the impartiality of statistics and quantitative methods more generally, and a failure to properly appreciate the assumptions.



People tend to hire people who are similar to them. If this happens in the context of a data science team then what are the sorts of effects that are possible on even the best-intentioned workflow? Should a team that consists only of Australians really be hired by the Maple Leafs to analyse hockey?

Do the decision-makers understand the trade-offs and realities involved in data science to an extent that they can make informed decisions?

Finally, consider the existence of conflicts of interest.

Case-study: Gay face


Case-study: All state


Corbyn, Zoë. 2020. “Catherine d’Ignazio: ’Data Is Never a Raw, Truthful Input – and It Is Never Neutral’.” The Guardian.