Scale Events
+00:00 GMT
Sign in or Join the community to continue

The Data-Centric AI Approach With Andrew Ng

Posted Oct 06, 2021 | Views 6.4K
# Keynote
# TransformX 2021
Share
speaker
avatar
Andrew Ng
Founder @ DeepLearning.AI, Founder and CEO of Landing AI

Andrew Ng is Founder of DeepLearning.AI, Founder and CEO of Landing AI, Managing General Partner at AI Fund, Chairman and Co-Founder of Coursera, and an Adjunct Professor at Stanford University. As a pioneer both in machine learning and online education, Dr. Ng has changed countless lives through his work in AI, authoring or co-authoring over 200 research papers in machine learning, robotics and related fields. Previously, he was chief scientist at Baidu, the founding lead of the Google Brain team, and the co-founder of Coursera – the world’s largest MOOC platform. Dr. Ng now focuses his time primarily on his entrepreneurial ventures, looking for the best ways to accelerate responsible AI practices in the larger global economy.

+ Read More
SUMMARY

Dr. Andrew Ng, Founder of DeepLearning.AI and Founder and CEO of Landing AI shares real-world examples to show how switching from a model-centric to a data-centric AI development approach, by 'engineering the data', helps improve the performance of machine learning models. He shares his views on key AI best practices for those considering data-centric model development to unlock greater AI performance and efficiencies.

+ Read More
TRANSCRIPT

Nika Carlson (00:15): Next up, we're honored to welcome Andrew Ng. Andrew Ng is founder of deeplearning.ai, founder and CEO of Landing AI, Managing General Partner at AI Fund, chairman and co-founder of Coursera and an adjunct professor at Stanford University. Andrew is one of the most impactful educators, researchers, innovators, and thought leaders in the field of artificial intelligence. His courses on AI, machine learning and deep learning are some of the most popular and have helped countless engineers and developers break into the field of AI. Andrew, over to you.

Andrew Ng (00:58): Hi, it's nice to see you here at TransformX. I'm Andrew Ng, CEO of Landing AI and founder of deeplearning.ai, and I'm excited to share with you some thoughts about the rise of the data-centric approach to AI development. So what is data-centric AI? For the last 30 years or so, a lot of progress in AI has been driven through the model-centric or software-centric approach to AI development. And what that means is, you probably know that AI systems involve both writing code to implement some algorithm or some machine learning model, and then taking that code and running it or training it on a set of data. The dominant paradigm for AI over the last 30 years has been to maybe download the dataset, hold the data fixed, and work on the code to try to get to good performance.

Andrew Ng (01:47): Thanks to this paradigm, we collectively have made tremendous progress in AI, but what I see now is that thanks to this, for a lot of practical applications, the code that open source neural network can download off GitHub is basically a solved problem. And I think it's time for us to shift to the data-centric approach to AI, in which you can even hold the code fixed, but instead find systematic tools and methodologies and principles and algorithms to systematically engineer the data so that when the datas train on the code, it gives you the performance you need. Here's my quick case study. My team, Landing AI, was working with a steel manufacturing plant to inspect sheets of steel like these, for defects. So there are 39 types of defects of which I'm showing four types here. The steel plant has achieved 76% accuracy and one's at 90% accuracy.

Andrew Ng (02:39): A few teams went in and applied a model centric approach. Took the steel plant's data, tweaked the model, hyperperameter search, different neural architectures, and after months, got no improvement. One of my engineers went in, and using one of our tools, helped the steel plant in about two weeks improve the performance to over 90%. And the secret sauce really, is I find that if you expect every single application, say every steel plant, to invent a new neural network architecture, well, that's challenging. But taking a data-centric approach, it puts all of us as AI practitioners in a better position to empower even non-AI specialists, such as the staff in a steel manufacturing plant, to engineer the data systematically to feed to the algorithm and that results in a much bigger performance improvement. So what does engineering the data mean? Well, it is like kind of... Let me share with you an example.

Andrew Ng (03:36): One of the defects in these sheets of steel is a foreign particle defect, which means loosely specs of lubricants on the sheets of steel. Out of roughly 10,000 images they had in the dataset, there were about 30 images that had the foreign particle defect, and it turned out that one labeler, when also labeled this defect was drawing rectangles or bounding boxes around these specs of lubricant like this. A second labeler was drawing rectangles that looked like this. Neither of them was wrong. Both of them were doing a reasonable job drawing rectangles to indicate the position of the foreign particle defect, but what other teams worked on for months and never noticed, was the inconsistency between label one and label two. So when one of my team members used our tools to spot this problem, he was able to very quickly, respectfully, suggest to labeler one... to labeler two, to label the data like this, and this consistency made the data much less confusing to the learning algorithm and allowed the steel plant to quickly resolve an issue in two days, that otherwise went unresolved for many months.

Andrew Ng (04:48): So I've seen the data centric approach give better performance faster in many applications. We just looked at the steel defect detection example. In a solar defect detection project that we worked on, data-centric approach gave much bigger improvements than the model centric approach. Another surface inspection. And I think that what many experienced AI practitioners have maybe intuitively sensed for years and kind of started feeling our way around, is that with the maturity of machine learning models today, which is more true today and much less true even four or five years ago, there's a growing set of applications where entering the data is going to be even more important than entering the neural network architecture or the model. So my teams at Landing AI and deeplearning.ai have been working on this for a few years. Started talking about this publicly just in March. So I think there's like an hour-long YouTube video of me doing an event, talking about ML offset, data-centric AI movement.

Andrew Ng (05:54): And I've actually been really happy since I started to talk about this on, I think March 24th. The number of very promising and successful businesses that are also embracing the data-centric AI movement. So before this talk back in March, data-centric AI didn't exist on the internet in its present form, but I was very happy that [Klee 00:06:13] Technology listed data-centric AI right there on the homepage and raised quite a lot of capital to work on this. Snorkel AI, led by a friend, deepest respect for them, writes data-centric AI right on their homepage. And I was actually very happy when also Scale AI wrote the importance of data-centric AI right there on Scale AI's homepage.

Andrew Ng (06:35): So I think to any of you watching this video, wondering if this data-centric AI movement is a thing I, I think this movement is rapidly accelerating and this is a good time for all of you to jump into the data-centric AI development movement if you have not yet. So what I've seen is that, when there's a new technology approach, it often evolves like this. First, there's a handful of experts that do it intuitively. For example, take deeplearning about 10 years ago, over 10 years ago. There was a few people implementing neural networks and C++, right? And then eventually the ideas become more widespread, and many people were then implementing neural networks in C++. And finally there came tools like TensorFlow and PyTorch that made the application of ideas and deep learning, much more systematic and less error prone than when there's a bunch of us hacking it up in C++.

Andrew Ng (07:36): So I think that the rise of data-centric AI... There has been some number of experts doing it intuitively, and I feel like we're right now, maybe in the middle of hopefully identifying and then publicizing the principles so that more people can build and apply it. And I think there is actually one significant gap still, still a lot of work to be done, to build the tools to enable data-centric AI principles to be applied more systematically. So this is one problem that my team is working on, but I think hopefully hundreds or thousands of teams around the world should innovate on. And I think to give an example of what I mean by and making it systematic, today, if data like the steel inspection example I mentioned, if data is inconsistently labeled, we are still sometimes counting on the skill or luck of the machine learning engineer to see if they can spot that problem.

Andrew Ng (08:38): But with the rise of MLOps, machine learning operations as a version of DevOps, but for more machine learning systems, I feel that we're starting collectively to evolve processes such as asking two independent labeler to label a sample of images, and then measuring consistency between labelers to discover where they disagree. Now you might be thinking, "Oh Andrew, I've heard people talk about this, get a bunch of labelers to label it and take the average", but I think that with the data-centric philosophy, I get two or more labelers to label a set of images, not to get them vote and just see what the majority vote is, but instead identify where they disagree so they can repeatedly revise the labeling instructions until they become consistent. And this is one of the tools, one of the many tools, we have in the developing to improve data quality.

Andrew Ng (09:32): And it turns out that data quality issues arise in many problems, especially small data problems where the data set sizes are smaller and it's really important to get as many examples as you can to be clean. But to share another example, this is a fun one. One of my friends, [Kian Katenfaruge 00:09:50] likes iguanas, so I have a lot of iguana pictures for some reason. But if you're trying to build a system for detecting iguanas, you might take 500 pictures like this and send labelers instructions like, "Hey, use bounding boxes to indicate position of iguanas". Well, you may have one labeler working hard trying to be diligent, label it like this. Second labeler may go, "Oh, the iguana on the left actually, the tail actually goes all the way to the right", so second labeler may say, "One iguana, two iguanas", like this. And the third labor may look at all five images and say, "Well, bounding box, I'll show you where the iguanas are".

Andrew Ng (10:28): And so I see this in many practical computer vision applications. When I was building AI systems in some of the large consumer internet companies with a hundred million images, you take noisy data and just throw it into algorithm and then average it out. But I find for a lot of applications outside consumer software internet, you just have to solve a problem with 50 or a 100 images and focusing on data quality and having the right data centric tools to improve the data quality is the key to getting the performance you need for that application. This turns out to be an issue for speech recognition as well. So quick example, I used to work on voice search, right? For web search, and given an audio clip like this, which is quite typical for a web search engine, you might have...

Speaker 4 (11:18): "Today's weather".

Andrew Ng (11:20): And given that audio clip, maybe one label will transcribe it like this, one label would transcribe it like this, and one label might say, "The "um" is just noise. Why would I transcribe noise?", and transcribe it like this. So none of these labels are wrong. Like the iguana example, they're all kind of okay, but the problem is the inconsistency makes the labels confusing for the learning algorithm, and having tools to spot and then hopefully correct the inconsistency is one of the keys to improving the quality of the data. Now, since I started to talk about the data-centric AI movement in March, I think there's literally been flowering of... There are a lot more people talking about it, trying to move the whole movement forward. I know that a lot of people in this audience are highly technical, so let me get quite technical and share with you five tips for data-centric AI development.

Andrew Ng (12:17): And I'll run through these one at a time very quickly. Make the labels y consistent, use consensus labeling to spot inconsistencies, clarify labeling instructions, toss out noisy examples, more data is not always better, use error analysis to focus on subset of data to improve. Okay? So if anyone wants to take a screenshot and then have the set the tips to share with your friends, this would be one reasonable slide to screenshot. I'll summarize the tips again at the end. So tip one, make the labels y consistent. So it turns out that your learning algorithm will have an easier job learning a concept you want it to, if there exists some deterministic, meaning non-random, function mapping from the inputs, X, say the image is X, to the output labels Y, and if the labels are consistent with this function. So for example, let's say you are doing a manufacturing visual inspection task, and you are looking at pills like this to see if they're scratched or otherwise defective, right? You want pills like this.

Andrew Ng (13:20): So maybe your labelers label it like this. Here's a bunch pictures, they label some as okay, some as defective. So how do you know if this is consistent or not? It turns out if you were to take the dataset and plot scratch length on the horizontal axis and defective, zero one, on the vertical axis, you see that this labeling is not that consistent, or at least as a function of scratch length to defect. This curve kind of goes up and down, right? It doesn't look like there's a deterministic function. And so when you spot problem like this, it would be very helpful if you could sort your images in order of increasing scratch length, and then pick a standard where you say scratches smaller than a certain length are acceptable and scratches longer than a certain length or not acceptable. And that responds to redefining your objective function, to just have a clear threshold above or below which scratches are considered too small to be significant or large enough to be considered a defect.

Andrew Ng (14:29): Just to get a little bit theoretical... Those of you that don't like theory or don't... Feel free to ignore the next 30 seconds of this video, but it turns out that there's some theory from a learning theory on the realizable versus a agnostic or unrealizable pack learning scenarios that suggest that in some cases, if your data is noisy, if there isn't a deterministic function like this, right? It's less than 2.5mm, is zero. Greater than 2.5 millimeters is that. Suggested that the data is noisy, then your error, generalization error, decreases as order of one over square root of M, where M is the training set size. Whereas in the realizable case, or in other words, if you're just trying to learn this threshold, your error decreases goes down as order of one over M, where M is the training set size. And one over M... The curve, one over M, actually goes down much more quickly than one over square root of M, which is why, when you have a clean and consistent training set, you can actually learn and generalize much better with sometimes a surprisingly small data set.

Andrew Ng (15:36): So in fact, at Landing AI, when working on manufacturing problems, even now, I'm sometimes surprised by how good a model we can get with just 50 images of concept we want to learn, if we can get clean and consist images. Maybe a 100 images, but sometimes even fewer than 50 images. So tip one, make the labels y consistent where you can. Tip two, use multiple labelers to spot inconsistencies. I'm going to illustrate this with computer vision, and it turns out that there's some surprise... Turns out there's some pretty common ways that labelers for computer vision tasks are sometimes inconsistent. One I've seen the lot is the label name. So maybe one laborer will label it as a chip and the second one a scratch. And what I do is, whenever I suspect that there's an inconsistency, I would ask two labelers to label the same image to then measure the degree of consistency so we can then find a way to fix it later.

Andrew Ng (16:38): Or the bounding box size. So, given an image like this, maybe one labeler will label it like this and say, "That's discolored". Second labeler will label it like that and say, "That's discolored", or number of bounding boxes. Sometimes a labeler will label it like this, sometimes labeler will label it like this. And what I found is that in cases where you've spotted inconsistency, usually almost any standard is better than no standard. So between these different inconsistent standards of what the different labels are doing, often picking some standard to just ask them to be more consistent, will make life much easier for your learning algorithm and allow your learning algorithm to perform much better, even on a modest sized data set. Tip three, repeatedly clarify label instructions by tracking down ambiguous examples. So when I'm trying to develop one of these systems, my workflow is often to repeatedly try to find where the label is either ambiguous or inconsistent.

Andrew Ng (17:39): And when I spot an inconsistency, then have the different labelers sit down and try to clarify, just make a decision about how this should be labeled. Is it one bounding box? Is it two? What's the appropriate size of the bounding box? And documenting that decision in your labeling instructions. In manufacturing, we sometimes call it the "defect book", but document that clearly, your labeling instructions. Where labeling instructions, some best practices, I found that labeling instructions illustrated with examples of the concept, such as, well, show some examples of scratched pills, examples of borderline cases and near misses, and any other confusing examples. Having that in your documentation often allows labelers to become much more consistent and systematic in how they label the data.

Andrew Ng (18:26): And then tip four, especially if you work on small data sets, it's useful to toss out bad examples. It's not true that more data is always better. So for example, if I ask you to train a classifier to detect the defects shown on these six images... I don't know. Not sure how they'd do that. Maybe you can kind of tell what's the defect I'm asking to recognize, but it turns out of these six images, one is poorly focused and two have bad contrast. And if you were to toss out the bad examples and just focus on the three remaining examples, it becomes much clearer from the three remaining examples, both to us as people, and also it turns out the learning algorithm, what we want the system to learn. And so doing error analysis and asking humans to clarify examples... There's some examples that humans even find unclear, and sometimes there's assigned that is a bad example, like a poorly imaged example, and tossing it out is key to improving learning algorithm performance.

Andrew Ng (19:33): Last step. When we're working on machine learning systems, there's so much you could do. There's so many things you could try, so many ideas. And so I found that the most effective teams almost always used error analysis to drive a systematic process to select what's the focus of... select what's the subset of data is most worth for your while focusing on improving. So for example, given a data set like this, there are lots of things we could do. You could use data augmentations. You create more data, clean up the labels. Lots of things you could do, but suppose you find, for error analysis, that the algorithm is doing poorly just on the subset of the data. Say the subset of data with scratches. Then it would make sense to focus attention just on this subset and look at how consistent or not, are the quality of the data of the scratches.

Andrew Ng (20:26): And this actually takes you back to the case that we had in tip one, where focusing attention on the scratches allows us the spot, say that the scratch labels are not consistent, which can initiate a process to make the labels more consistent and just fix up the scratch labels or get more images of scratches where there's needed. And that will improve the performance of the algorithm on scratches.

Andrew Ng (20:49): So the error analysis process allows you to focus your attention on say the subsets of the problem that the algorithm's performance is poorer and that you want it to do better. And in this case, focusing attention on the scratched pills allows us to go back to basically the workflow you saw for tip one, where you can now look at these labels and spot that the labels are not consistent as a function of the size of the scratch and this can trigger a workflow to make these labels more consistent and just to improve the performance of the learning algorithm on scratched pills. So in terms of data-centric AI development, this is the iterative workflow that I hope you will think of using for your machine learning work, which is often you training a model and many of you know that machine learning development is a very iterative process, right?

Andrew Ng (21:46): So you train the model and then you carry out error analysis to decide on the next step. Whenever I train a neural network, there's so many ideas for what to try next. So error analysis is very useful for deciding, "Oh, we are doing well on these types of classes, but not so well on scratches. So let's focus attention and improving the scratches". And when you've decided to try to improve one part of your learning algorithm, when you take a data centric-approach, one of the nice things that you can do is improve the data for just that part or that slice of the data. And when you do that, you can then go back to train the model further. Now, here are some ways you could use to systematically improve the data. One, you can ask multiple labelers to label the same data in order to measure the consistency of the label.

Andrew Ng (22:39): So whenever I suspect that the labels for a class are inconsistent, that's when I would consider asking two or sometimes more labelers to label the same image or the same audio clip or something else, to measure consistency. Or once you've spotted an inconsistency in the way different people are labeling, you can then work to improve the label definitions and relabel the data more consistently, such as clarifying the length of scratch that makes a pill defective and relabeling consistently with that. So the first two tips are ways to improve the labels, the labels Y. Three, you can also toss out noisy examples such as you saw tossing out the poorly focused or the poor contrast images or improve the quality of the input X. For example, take the camera and just refocus the camera. Or once you've spotted that you have insufficient data of say scratches or one type of thing you're trying to recognize, you can then have a focus effort to collect more data through data collection or use data augmentation or some of the more sophisticated data synthesis techniques to generate more data within that slice.

Andrew Ng (23:52): And so these are some ways to improve the quality of the inputs X or the images in this example. Most of the techniques I've described here are maybe a little bit more applicable to problems on unstructured data, such as images or audio or text, where you rely on human labelers. There's a separate set of principles for data-centric AI for structured data. More tabular data or spreadsheet data where you may not be counting on human labelers as much, that I think... Well, maybe I'll get to cover in a separate forum.

Andrew Ng (24:25): One of the most powerful tools, one of the most powerful things you could do with data centric AI is that, when you train the learning algorithm, and error analysis tells you that it does well on some classes... Say it's detecting chips and discoloration and pills well, but it's doing poorly on some of the classes such as scratches, data-centric AI gives you a tool to engineer just the subset or the slice of data on which you want to improve performance. And previously, when we were more in the model-centric paradigm, we didn't really have a lot of tools to look over at the problem and decide, "I got to improve performance just on this subset or this slice of the data". So it's actually a very powerful tool that the data centric approach gives us.

Andrew Ng (25:09): And in fact, in terms of promoting, pushing forward responsible AI, making sure the AI systems are reasonably free from bias and are fair in how different predictions are made on different subsets of people say, I think that if you find that if you audit an AI systems performance and find that its performance is problematic on one slice of data... Say the way it's making loan decisions to one minority group, then the data centric-approach gives you a way, gives you a set of tools, to engineer that slice of data to hopefully become another tool in your toolbox. Not Panacea, which is another two in your toolbox, to reduce bias on subsets of data when an auditing process may identify them. So I think data-centric AI gives this powerful tool to also enhance responsible AI.

Andrew Ng (26:02): So just summarize, I think that one common misconception is that getting the data right is some pre-processing step you do once, but I don't think that's true. With the data-centric approach to AI, I think it's a core part of the iterative process and model development in which you train a model, carry out an analysis, engineer the data, retrain, and keep on going around that loop. And just to summarize the five tips I presented down below... And I think also one thing I did not get to talk about in this presentation is that, entering the data is not just a core part of the iterative process of multi-development. I think it's also a core part of the process of post-deployment monitoring and system maintenance, where after you deploy, if there's data drift or concept drift, having the right tools to spot the problem and bring the data back and engineer the data to fix problems that may arise after you've deployed. I think that is a key part of the data-centric workflow as well.

Andrew Ng (27:03): Just to wrap up, here are some key takeaways for data-centric AI development. We talked about how AI systems are both code plus data and with the maturity of your neural network technology, there are more and more applications where now, it is more fruitful to focus your attention on entering the data. Not all, but I think many applications. So in the model-centric or sometimes called "software-centric" approach to AI, people used to keep asking, "How do you tune the model of the code to improve performance?", but in the data centric AI, I think that we can now ask, "How can you systematically change the data to improve performance?". And there's a [inaudible 00:27:45] discipline called MLOps, machine learning operations, where I think that... It's still a discipline that a bunch of people are creating and trying to standardize on, but when I stand up a MLOps team to help build, maintain, and deploy machine learning systems in production, I think the most important task in MLOps team is to make sure high quality data is available through all stages of the machine learning project lifecycle. From scoping to collecting data, to the training model, to deploying production.

Andrew Ng (28:17): And lastly, I think one of the most important research frontiers is to create data-centric technologies to make data-centric AI an efficient and systematic process. One common misconception about data-centric AI is, some people think it's just a mindset or philosophy. I think it is. It is a exhortation to pay more attention entering the data, but rather than everyone's just trying harder to do it, I think there's a lot of room for invention. For creative invention of tools and algorithms and principles to make this possible for everyone, just like I think the creation of TensorFlow and a lot of research papers on neural networks made the engineering of machine learning models more systematic. One thing I am doing is I'm also one of the organizers of an upcoming Neurips Data-Centric AI workshop. I hope this will be the premier event for research. Cutting edge work on data-centric AI. And so I hope you'll check out our website and maybe join us in the workshop in December.

Andrew Ng (29:18): And I plan to continue to write tweets and so on about AI and data-centric AI regularly, so please keep in touch. This is my social media handle, and The Batch is a weekly newsletter published by my team. I wish I read the letter every week and the deeplearning.ai team also has cutting edge news, so please keep in touch. I hope that we have many more opportunities to connect and to collectively move forward data-centric AI, which I think is going to be key for unlocking the value of AI for many, many people. So I'm excited to be here to speak at this Scale AI event, and I think together, Landing AI, deeplearning.ai, all of us, hopefully can build the tools and the education and the processes and the mindsets that can empower many more around the world to apply data-centric AI in a [inaudible 00:30:15]. Thank you.

+ Read More

Watch More

24:00
From Big Data to Good Data with Andrew Ng
Posted Jun 21, 2021 | Views 5.1K
# Transform 2021
# Keynote
# Fireside Chat
Meta’s Journey to AI-Centric: PyTorch, Data Quality & Beyond
Posted Oct 18, 2022 | Views 1.6K
# TransformX 2022
# Fireside Chat
A Responsible Approach to Creating Global Economic Opportunities With AI
Posted Oct 06, 2021 | Views 2.3K
# TransformX 2021
# Breakout Session