Why You Should Shift from Model-Centric to Data-Centric AI

Better AI requires working more on the data and less on the code, says Landing AI’s Andrew Ng.

John P. Mello Jr.

Data, and not code, should be at the heart of artificial intelligence development, said Landing AI founder and CEO Andrew Ng. The idea is to use continuously updated data to create and train AI systems, instead of using stagnant data that rarely if ever changes and then tweaking the code for performance.

Ng, a Stanford University adjunct professor of computer science, said a major shift is needed to achieve AI's fullest potential. Ng’s AI experience spans more than 20 years and includes founding Google Brain, building Baidu's AI group into a team of several thousand people, and, most recently, founding Landing AI, which provides AI-powered SaaS products to companies.

By placing data at the center of AI development, Ng said, tools can be developed to engineer data that, in turn, improves the code's performance. It's easier to get an algorithm to do what you want it to do if you feed it high-quality data rather than mixed-quality data.

At TransformX 2021, an AI and machine learning conference sponsored by Scale AI, Ng delivered a keynote address about the benefits of data-centric AI. Here are key takeaways from the session.

Data-Centric Tools Needed

Ng explained that many experts have already been intuitively working with data-centric AI. Now is the time to identify and codify the principles behind this concept, he said, so developers and others can build on them and the practice can proliferate. Once those principles are established, engineers will need tools they can use to apply those concepts systematically and automatically, Ng said. For example, it would be useful to have a tool that allows data labelers to continuously compare the differences in how they're labeling data until they can reach a consistent set of labeling instructions.

By improving the quality of data in that way, the onus is taken off an application to make sense of data that’s been imprecisely labeled.

Quality Is Paramount for Small Datasets

With consumer applications, where a dataset may contain millions of images, you can include noisy data with confidence that it will be averaged out with your algorithm. For industrial applications, though, datasets are much smaller—sometimes just 50 or 100 items—so quality is much more important, as are data-centric tools that improve data quality and ensure an application’s optimal performance. And having more low-quality data isn’t better than having less high-quality data, Ng said.

Ng offered five tips for better data-centric AI development:

Make labels consistent.
Use consensus labeling to spot inconsistencies.
Clarify labeling instructions with documentation.
Toss out noisy examples.
Use error analysis to focus on improvement.

When reviewing visual data to be fed into an algorithm, labelers’ evaluations of data can vary, Ng said. One labeler’s chip might be another’s scratch in a piece of furniture, for instance. Reconciling these inconsistencies can be used to improve the quality of the data.

Ng said that having an imperfect standard is better than having no standard at all. Standards make life much easier for a learning algorithm and allow it to perform better.

After ironing out inconsistent labels, be sure to document the final decision in your labeling instructions. These instructions should include illustrations of the concepts around the labels. Such examples can make it easier for new labelers to consistently and systematically label data.

Because more data isn’t necessarily better data, noisy examples should be eliminated. If an example is unclear to a human, feeding it to an algorithm isn’t going to increase its clarity and will likely only hurt the algorithm’s performance.

When working on a machine learning system, there are many choices to be made and many ideas to try out. Ng said that the most effective development teams use error analysis to focus on where they should channel their time and energy. Error analysis will reveal to the team where an algorithm may be underperforming so they can concentrate on improving it.

The great part of the data-centric approach, Ng said, is that it allows you to improve the data just for the part of a learning algorithm flagged by error analysis. After that, you can go back and train the model further.

Ditch the Idea of ‘One and Done’ Data

The data-centric approach to AI undermines the common misconception that getting data right is a preprocessing step that’s done once and left alone thereafter.

Data should be a core part of the iterative process that goes into the AI development process, Ng said, where you train a model, carry out an analysis, engineer the data, retrain, and keep the loop going until you have your finished product.

The question developers should be asking is no longer, “How do I tune the model of the code to improve performance?” but, “How can I systematically change the data to improve performance?”

Learn More

See Andrew's keynote address about the benefits of data-centric AI from TransformX 2021, "The Data-Centric AI Approach With Andrew Ng".