How to Avoid Data Deserts with Your ML Models
Nathan Benaich and Elliot Branson explain how to avoid training your machine learning models with unrepresentative information by putting rigorous evaluation practices in place.
Just as plants need water to grow, AI and machine learning models need data to thrive. Many times, however, those models are fed unrepresentative information and find themselves in a data desert.
"A data desert reflects a situation where your model is being trained on a dataset that doesn't represent the entirety of the population you want your model to serve," said Nathan Benaich, founder and general partner at Air Street Capital, a venture capital firm that invests in AI-first technology and life science companies.
Speaking at a recent online fireside chat with Elliot Branson, director of artificial intelligence and engineering at Scale AI, Benaich provided as an example of a data desert a model about cancer patients in the United States whose dataset is made up only of people who live in California and New York. “You’re missing the vast majority of the U.S. population,” he said.
The notion of representative data is becoming more important, Branson noted:
“[We're starting to see a bunch of companies developing imaging models and applying to the FDA for clearance and to Medicare for reimbursement. If you want to scale the impact of these models across jurisdictions, addressing data desert issues is really important.”
Patients Underrepresented in Studies
In their State of AI 2021 report, Benaich and angel investor Ian Hogarth noted that data deserts in biomedical AI research are likely to result in model bias in the clinic. The State of AI report noted that AI models often perform poorly on populations that are not represented in the training data. “It is critical for AI training data to mirror the populations for which [the] model[s] are ultimately serving,” the report said.
The report cites a Stanford University study by Amit Kaushal, Russ Altman, and Curt Langlotz. Their work looked at 56 studies, published between 2015 and 2019, about the training of a deep learning algorithm used to perform an image-based diagnostic task versus a human physician across six clinical disciplines.
The researchers found that, in clinical applications of deep learning across multiple disciplines, algorithms trained on U.S. patient data were disproportionately trained on cohorts from California, Massachusetts, and New York, with little to no representation from the remaining 47 states. California, Massachusetts, and New York may have economic, educational, social, behavioral, ethnic, and cultural features that are not representative of the entire nation, they noted.
Algorithms trained primarily on patient data from these states may generalize poorly, which is an established risk when implementing diagnostic algorithms in new geographies, the researchers said.
“Advances in machine learning, specifically the subfield of deep learning, have produced algorithms that perform image-based diagnostic tasks with accuracy approaching or exceeding that of trained physicians,” the researchers wrote.
“Despite their well-documented successes,” they continued, “these machine learning algorithms are vulnerable to cognitive and technical bias, including bias introduced when an insufficient quantity or diversity of data is used to train an algorithm.”
Benaich cited another example involving gene expression datasets where researchers found a significant lack of certain ethnicities and overrepresentation of gender and age. “That causes problems because there’s a huge amount of human diversity that these models are not trained to process and learn from,” he said.
Datasets for People with Disabilities
In an attempt to water some existing data deserts, some companies have turned to innovation. Microsoft, for example, has launched its Seeing AI app to help blind people better “see” things around them. Not only can the software identify generic objects, but it can also recognize objects specific to an individual.
According to Microsoft, until recently there hasn’t been enough data to train a machine learning algorithm for personalized object recognition for people who are blind or who have low vision However, the City University of London, through Microsoft’s AI for Accessibility program, has launched its Object Recognition for Blind Training (ORBIT) program to create a public dataset from scratch from videos submitted by people who are blind or who have low vision. That dataset can be used to identify personal objects, as well as generic ones.
“Without data, there is no machine learning,” Simone Stumpf, senior lecturer at the Centre for Human-Computer Interaction Design at the university and the ORBIT lead, said in a Microsoft blog. “And there’s really been no dataset of a size that anyone could use to introduce a step-change in this relatively new area of AI,” she added.
Microsoft’s blog explained that researchers and developers working on intelligent solutions that can assist people with disabilities in everyday tasks have been hampered by a dearth of machine learning datasets that represent that community.
The blog also pointed out several areas where poor datasets could be harmful to people with disabilities. For example, if the dataset for a self-driving car doesn’t recognize a person in a wheelchair as an object to be avoided or calculate how much longer it takes an older person to cross the street than a younger one, an accident could occur.
Make Sure to Vary Your Data
When designing AI models, Benaich said, developers need to adopt rigorous evaluation practices to make sure testing and training aren’t being performed on similar datasets.
“It’s all about robustness and generalizability,” he said. “If there's a lack of generalizability, it could be because of a data desert.”
Just as a focus on accessibility has made many websites more useful to those with disabilities, the increased focus on collecting and augmenting previously sparse data will make AI helpful to all.
See the full fireside chat with Elliot Branson and Nathan Benaich: