How one scientist is cleaning up the world’s dirty data

Felicity Carter

3 years ago

When Amazon developed an automated recruitment tool, the hope was that an unbiased, logical algorithm could read a CV and identify the best candidates.

The algorithm turned out to be an engine of sexism, which was not only biased towards male resumes, but actively downgraded candidates if they came from one of two women’s universities in the USA.

The problem was that the tool sought applicants whose CVs resembled previously successful job seekers; as most of these were men, the algorithm learned to reject women.

It was stunning proof that algorithms are not neutral. They work according to the biases of the people who program them – a problem that computer scientist Professor Shazia Sadiq is acutely conscious of.

Dirty data reinforces biases

Sadiq’s research is focussed on the question of how data underlying the algorithms are collected.

A professor in the School of Information Technology and Electrical Engineering at the University of Queensland, she runs a world-class team within the university’s Data Science research group that is working to improve data quality standards and come up with ways that they can be integrated across areas such as transportation, social media and learning analytics.

Sadiq says her career has been influenced by American cognitive psychologist Herbert Simon, who coined the term “bounded rationality”, a concept which posits that decisions made by humans are never truly rational, because humans always seek ‘good-enough’ answers, rather than the optimal solution. For example, faced with an overwhelming choice of products on a supermarket shelf, a consumer might grab the packet that says “natural” on it in big letters. That consumer might not know what “natural” means in this context, but it sounds like it might satisfy their desire for healthy food; in reality, the word may simply be a meaningless marketing term. This kind of thinking, where decisions are made without considering all the relevant information, has consequences in business, says Sadiq.

We are bounded by what we are doing, and don’t look at the bigger picture,” she says.

The result, says Sadiq, is that much of the data that underpins high-level processes such as optimising business protocols and improving digital adoption, are flawed. This is why algorithms, such as the one created by Amazon, can go so awry.

But the problem goes back to how that information is collected in the first place, and packaged and handled by various teams, which – unwittingly or otherwise – assert their own biases, agendas and shortcomings in the process, says Sadiq. “People buy it, companies buy it. As it goes through the pipelines, it’s curated, it’s transformed. Decisions are made.”

Information resilience

The way to disrupt this cycle of bias, says Sadiq, is “information resilience”, which means understanding how information is collected, and identifying all opportunities for bias to be introduced.

Sadiq is working with a team of researchers at an industry transformation research program. Launched in 2020 with support from the Australian Research Council, it is based at the University of Queensland and has input from researchers from Swinburne University in Melbourne, as well as partners from local government, law enforcement and industry.

“All of our partners are going to work on different aspects of information resilience, such as curation at scale and responsible use,” says Sadiq.

The goal is to improve data management and analytics. One challenge that the team would like to solve is streamlining how different government agencies share data. Right now, many agencies cannot work effectively together because they each use a data-management system that has been developed specifically for them. This is particularly problematic when it comes to managing health data – when information is siloed within different hospitals, for example, it means critical information that could be vital for community health is unavailable.

“It’s just a spreadsheet somewhere and nobody wants to look at it, because it’s not directly helping them with their patients,” says Sadiq. Being able to collate and analyse large swathes of health data could reveal crucial insights into diseases such as cancer and viral infections.

Mentor and advocate on a mission

Sadiq is also focussed on ensuring that more women feel like they have a place in the information technology sector. “In the 1990s, heaps of girls were interested in computer science,” she says.

Today, girls tend to drop out of STEM subjects around Year 9. And even if they do make it to university, says Sadiq, they face many barriers.

Sadiq has spent many years mentoring young women, encouraging them to ‘get their hands dirty’ with coding. She says she wants technology students to think deeply about what will happen with the data they are collecting.

“Data tells a story,” she says. “That story has to be based on rigour and facts. But it’s also about how you present the data.”

Through her teaching, advocacy and research, Sadiq is helping to transform data itself – from the way it’s collected, analysed, visualised and managed today, to who will collect it in the future, and what questions they will ask of it.

Which, hopefully, will lead to data scientists creating algorithms that help businesses optimise their processes, rather than reinforcing biases that already exist.

Article by Felicity Carter
Photo Credit: Photo supplied