From Big Data to Good Data with Andrew Ng
Andrew Ng is Founder of DeepLearning.AI, Founder and CEO of Landing AI, Managing General Partner at AI Fund, Chairman and Co-Founder of Coursera, and an Adjunct Professor at Stanford University. As a pioneer both in machine learning and online education, Dr. Ng has changed countless lives through his work in AI, authoring or co-authoring over 200 research papers in machine learning, robotics and related fields. Previously, he was chief scientist at Baidu, the founding lead of the Google Brain team, and the co-founder of Coursera – the world’s largest MOOC platform. Dr. Ng now focuses his time primarily on his entrepreneurial ventures, looking for the best ways to accelerate responsible AI practices in the larger global economy.
Andrew Ng discusses the importance of data in developing AI applications and joins Scale AI CEO, Alexandr Wang in a fireside chat.
Brad Porter: Thank you, Alex. Good morning and welcome to Scale Transform. My name is Brad Porter, and I'm Chief Technologist at Scale AI. We have an amazing lineup today and we're incredibly excited to get started. We're particularly pleased to kick off today's Transform 2021 with Dr. Andrew Ng.
Brad Porter: Andrew's one of the most impactful educators, researchers, innovators and thought leaders in the field of artificial intelligence. He's also the Founder of DeepLearning.AI, a General Partner at AI Fund as well as an Adjunct Professor at Stanford University.
Brad Porter: Andrew's courses on AI, machine learning and deep learning are some of the most popular on Coursera and have helped countless engineers and developers break into the field of AI. Today, Andrew will be giving a presentation followed by a fireside chat discussion with our CEO, Alex. Please join me in welcoming Andrew to Transform.
Andrew Ng: Thank you for having me at the Scale AI conference. Today I'd like to share with you some ideas I think will be important for how we build AI systems in the future. In particular, I hope to share with you some thoughts on how I think our mindset should shift from building big data to building good data. This will unlock how we can more efficiently build AI systems for many more industries, especially ones outside consumer internet.
Andrew Ng: First, let's take a snapshot of where we are in the AI industry. I think AI is one of the major inflection points of technology that we're seeing in the last several decades. There was the mainframe and minicomputer revolution in the 1960s. Then the PC revolution, then the internet revolution, the mobile cloud revolution, and I think the AI revolution is still underway. Whereas the earlier technology revolutions disrupted many major industries, maybe more and more as time went on I think AI is on the path to disrupt every industry. And we're still in the early stages of that.
Andrew Ng: Further, many of the fundamental drivers that had caused AI to take off in the last decade are still continuing to drive forward progress. Take the falling costs of compute. It is almost as if there was an old Moore's law in the pre-GPU era, and then there's a new Moore's law, but the rise of compute is continuing unabated.
Andrew Ng: The world continues to become more and more digital and this is creating data and there continues to be new algorithms invented pretty literally every week. Let's dive a little bit more into these.
Andrew Ng: Take compute. One of the diagrams that's inspired me for quite some time is this benchmark that was created by the Stanford DAWNBench project, which showed that the cost to train ImageNet to a relatively high level of accuracy fell from over $2,000 to about $10 in just about a year.
Andrew Ng: If you look at the rising language model sizes, leading research groups are also continuing to build bigger and bigger models. So compute continues to become more and more available.
Andrew Ng: On the research front, the number of research papers published in AI continues to take off like never before. So, the compute, digitization of our world, and the creativity of researchers continues to drive forward AI progress.
Andrew Ng: What challenges still remain? I've been saying for many years now that AI is the new electricity. Similar to the rise of electricity starting about 100 years ago which transformed every industry, AI will do the same.
Andrew Ng: If you look at what AI has done so far, it has transformed the software internet industry, especially the consumer software internet industry. Web search, online advertising, language translation, social media. These businesses have become much more effective and valuable because of AI. But once you look outside the software industry, everything from manufacturing to agriculture to healthcare to logistics, I think AI today is still in the earliest stage of development. I think that in the future, AI applications outside the consumer internet will be even bigger than the applications that we have in consumer internet.
Andrew Ng: Sometimes working in tech it's easy to forget that the majority of the economy is actually not in the software consumer internet industry. I hope AI will go to those other industries and have an even greater impact than we've seen so far.
Andrew Ng: Why aren't we there yet, and why is building AI systems so hard today for so many industries? I think it would be useful as we build AI projects to more systematically think about the life cycle of a machine learning project or the life cycle of an AI project.
Andrew Ng: There's been a lot of attention in academia, in research and in the popular press about building AI models such as training your supervised learning model that can input an English sentence and output a French sentence, or input this and output that. Training models is something that AI and deep learning has gotten much better at over the last decade. That's something that we should absolutely celebrate.
Andrew Ng: Those of us that have built and shipped commercial systems know that there's a lot more to building the AI system other than training the model. When I work with AI teams, I encourage teams to think about the entire life cycle of a machine learning project, which includes starting with scoping the project, defining what's worth working on, what projects to pursue. Turns out to be one of the hardest skills still in today's AI world.
Andrew Ng: Next is collecting the data. This includes defining what is the data you want, making sure the inputs X are representative, making sure the labels Y are unambiguous, to then training the model, carrying out error analysis and iteratively improving the model by updating the model or in many cases getting more data until your model performs well enough, and then deploying in production.
Andrew Ng: Some teams view deploying in production as the endpoint of an AI project. I don't view it that way. I think often, deploying production only means you're about halfway there. Only after you've deployed in production do you then get the live data to flow back to your system to continue to learn and to improve and refine the AI system. For many teams today I actually tell them that when you deploy in production, you're only may be halfway there.
Andrew Ng: That's why I think when you look at the life cycle of a machine learning project, this is a very iterative process where when you're training the model, you often want to go back to collect more data. Even after you deploy the model, you may want to train the model, update it, or go back and update your data set.
Andrew Ng: Many teams will collect some data, train the model. When you get through that for the first time, you have what we sometimes call proof of concept. You have like a demo, maybe a prototype that does well on test sets that are stored in a data store somewhere. But after you've deployed in production, there is still an iterative process to go back to the earlier stages to keep on improving the system.
Andrew Ng: In terms of why building AI systems is hard, I think building the initial proof of concept is hard, especially when there's insufficient high-quality data. You never have enough high-quality data so that makes building POCs hard. But even after building a successful proof of concept, there's often still a gap between the proof of concept and production. So let me step through both of these issues in a little bit more detail.
Andrew Ng: It turns out that AI had grown up in consumer internet companies that may have hundreds of millions or over a billion users. When you have a billion users, that gives you a huge data set and it's great. You should use that data to learn from it. But once you look outside the consumer internet, many industries just do not have these giant data sets.
Andrew Ng: For example, I do a lot of work on manufacturing visual inspection and fortunately no manufacturing plant has manufactured a million or heaven forbid a billion scratched smartphones that would need to be thrown away. If you want to use computer vision to see if a smartphone is scratched, no one has a million pictures of distinct scratched smartphones.
Andrew Ng: Or take healthcare. Worked on many healthcare problems where we may only have 100 X-rays of a patient with a certain condition. So, your ability to get an AI system to work with only 100 images or 1,000 images is key to breaking open these application areas.
Andrew Ng: I have been surprised myself at how often it is possible when you do certain things that it becomes possible to build a computer vision system not with a million images like with ImageNet, but with near 100 images or maybe a few hundred images. This small data capability and technology, and frankly not just technology but practices and ways of thinking will be important for expanding AI into these other industries that may not have giant data sets.
Andrew Ng: Let me give you an example. It turns out that if you have a small data set like this… I used to fly helicopters and drones, and if you want to predict the RPM of a motor as a function of the voltage, if you have five examples and it's a noisy data set like this, it's really difficult to fit a good function. Very difficult. Any number of curves could fit this data set.
Andrew Ng: If your data set is noisy but you have a lot of data, then it's actually okay. With a curve like this, you can relatively confidently fit a straight line. Take a local average. You do just fine.
Andrew Ng: One very interesting case is what if you have a small data set, so five examples, same as the example on the left, but the examples are relatively noiseless or low noise. In this case, even though you have a very small data set, the clean and consistent labels allow you to fit a function that does quite well.
Andrew Ng: I think big data is great. If you can get it, let's get it and let's use it. But even when you have small data sets, I've found that putting an emphasis on good data hygiene practices can have a huge impact on the quality of learning algorithms of models you can fit. Even today I'm surprised over and over at how we can build systems that work with 100 images or sometimes even fewer.
Andrew Ng: At the highest level, an AI system is made up of code plus data. By code I mean code including your choice of model or your choice of machine learning algorithm. We as an AI community have put a lot of work into inventing new practices and algorithms and ways of improving the code. I think the amount of work on improving the data should grow as well.
Andrew Ng: In fact, we tell jokes. We say as a joke that 80% of data science is data cleaning. We say it as a joke, but well, it's a joke but it's also true. To me, if we find that 80% of a machine learning engineer's work or 80% of the data scientist's work is data cleaning, then maybe data cleaning is or has to be a core part of the work of a data scientist or machine learning engineer.
Andrew Ng: When I speak with general audiences I sometimes use the analogy, data is food for AI. The goal in life is not to maximize the number of calories. Beyond a certain point, it's about having high-quality food. So, today we have to spend a lot of time sourcing, preparing high-quality data before we train the model. I think that just like a chef would spend a lot of time to source and prepare high-quality ingredients before they cook the meal, I feel like a lot of the emphasis of the work in AI should shift to systematic data preparation.
Andrew Ng: Recently I went into an arXiv and looked up the 100 most recent machine learning papers that I found on the arXiv. I just skimmed the abstracts. It was a very informal skimming of abstracts. I found of the 100 papers whose abstracts I skimmed, 99 were on machine learning models. Literally one was on data augmentation. So maybe arguably for a very loose sample, 99% of the research work, academic work, is on 20% of the work of building these AI systems.
Andrew Ng: One idea I want to share with you is the idea of data iteration instead of model iteration. Today, many research teams will hold the data set fixed and iterate to try to improve on the model. I have found that this is a great way to do research. Download the benchmark data set, improve performance, maybe publish a paper. For most of the commercial projects I work on, I've actually often gone to the team to say, “Hey everyone, the model is just fine. It's good enough. Use RetinaNet or ResNet or U-Net or some new network. It's just good enough.” Instead, I've gone to my teams and said, “Please hold the code fixed. Let's not mess with the code anymore and let's just iterate on the data to get the performance we need for this application.”
Andrew Ng: For quite a lot of the projects that I've worked on where I gave that direction to the team, it actually accelerated the team's progress.
Andrew Ng: I think one of the challenges we have is today, a lot of how we iteratively improve on the data is often some engineer or a few engineers hacking around in a Jupyter notebook rather than the repeatable, systematic process and moving toward making the data more systematically repeatable, rather than having engineers hack around until hopefully they have the right insight. That I think will accelerate the development of many machine learning systems.
Andrew Ng: I've talked about the importance of data in the context of small data projects where the data set is smaller, in which case having really clean labels helps. I want to share with you that this also holds true for big data problems which are long-tail problems.
Andrew Ng: Specifically, if you are working on web search which has lots of rare queries, the rare queries you actually have only a small sample of, even if you have a giant data set in the aggregate.
Andrew Ng: Or if you're working on self-driving cars where you have to handle the very rare corner cases. That one time that pedestrian does something unusual. Rare corner cases in self-driving cars, even though the self-driving car industry in aggregate has a very large data set, the number of examples you have for the rare corner cases is actually very small.
Andrew Ng: Or product recommendations. Even if the company sells a lot of products, you also have things that could cause problems. Products that haven't been much sold yet or the long-tail items where you may not have a lot of data on. Even for those, even if the data set in aggregate is huge, there are many small data problems embedded in this big data problem, and I think that more systematic data practices for getting very clean data will still be critical for improving performance on these otherwise big data problems.
Andrew Ng: Building the proof of concept is hard when you don't have enough data. You never have enough data. One of the directions I hope a few can get better in is to think about tools for systematically improving our data sets to drive machine learning progress, in addition to thinking about how to improve our models to drive machine learning processes.
Andrew Ng: Just for the record, for those of you that are highly technical engineers, hyperparameter search is important. You do have to get the hyperparameters chosen reasonably, but in many projects I see once you have chosen a reasonable choice of hyperparameters, then often iterating on the data will buy you at least as much progress as spending comparable amounts of time. So, searching for the latest sayleon algorithm algorithm and downloading that and trying that out instead.
Andrew Ng: Again, the AI world is very diverse, so the advice I'm giving is more for commercial projects than for doing brand new research projects.
Andrew Ng: Even after you've built a proof of concept, we still often have a proof of concept to production gap. Accenture recently released a report saying that many companies' AI projects are still in the proof of concept stage. So why is that?
Andrew Ng: There was a very influential paper by Dave Sculley and others out of Google that pointed out that all the code needed to build an AI system, only a relatively small portion is machine learning code. Unfortunately, building a working system often requires more than doing well on your hold-out test set.
Andrew Ng: For example, a lot of teams have published research papers like this. This is work that my collaborators and I were involved in. Work that says AI detects pathologies at a level comparable to a human radiologist. Even though there are research results like these, if you take an X-ray today chances are there's not going to be an AI system reading your X-ray. Why is this?
Andrew Ng: I think one major reason is that a POC result that works that go into a published paper often does not work in a production setting because of concept drift and data drift. It turns out if we collect data from Stanford Hospital with high-end X-ray machines, well trained technicians, and test on very similar data, then we can indeed publish papers showing that we are comparable or sometimes superior to human radiologists. But if you take that neural network that you just published a paper about, walk it down the street to an older hospital with a slightly older X-ray machine, maybe a different protocol that the X-ray technician uses to image the patient. Maybe the patient is at an angle. Then the performance degrades significantly. In a way that is much less true compared to a human radiologist that could walk down the street from Stanford Hospital to a different hospital and do just fine.
Andrew Ng: This is why I think deployment is a process, not an event, and beginning deployment only marks the beginning of flowing data back to enable continuous learning.
Andrew Ng: One of the things I'm most excited about is the rise of MLOps as a discipline to help us make building and deploying machine learning systems more repeatable and systematic. In the traditional software imaging world, we would write codes, software engineers, we write codes and then gently hand it off to the devops team that were responsible for ensuring quality and infrastructure and deployment.
Andrew Ng: With AI, we now have not just code but code plus data. I think we need a new discipline, maybe called MLOps to help make the creation and deployment of AI systems more systematic.
Andrew Ng: If you look online, you'll find different inconsistent definitions for what MLOps is. Even large companies have definitions of MLOps on their webpage that will differ from each other. So this is an exciting emerging discipline. Let me share some thoughts on that.
Andrew Ng: One of the challenges with AI software is that it is not a linear process. In traditional software you could scope it, write the code, hand it over to devops and then deploy in production. But with AI software, that doesn't work because you scope it, collect data, train the model and deploy in production, and then the process is iterative to keep on coming back to the earlier stages of the process.
Andrew Ng: That's why I think for most projects it is a mistake to think of MLOps as a team that you hand your model to for them to deploy and then they take care of it from then on. Instead, I think MLOps needs to support multiple stages in the life cycle of an AI project because it isn't this smooth hand-off. Even post-deployment, you have to keep on collecting data and keep on updating your model.
Andrew Ng: In the projects that I work on, I often think of MLOps as supporting all the stages. Really especially collect data, train model, and deploy in production. All of these stages of an AI project.
Andrew Ng: I feel like there are a lot of things that MLOps need to do. Manage data governance and show that the systems are reasonably fair, maybe audit performance, make sure scalable to your system. A lot of things that MLOps needs to do.
Andrew Ng: In terms of deciding what's the single most important thing that the MLOps team needs to do, I think it should be to ensure consistently high-quality data throughout all stages of the machine learning project life cycle.
Andrew Ng: There are other things that MLOps can do as well, but if they do only that one thing then that will significantly accelerate the creation and deployment of AI systems. So to any of you that are standing up MLOps teams, I will urge you to have those teams. Consider this as the number one principle that they must do well to organize their work around, and they could substitute other things as well.
Andrew Ng: Just to summarize. As AI systems are code plus data, for many projects… Not all, but for many short-term commercial applications, I think the code is almost a solved problem. Download an opensource thing on GitHub, it'll do fine. Not for all problems but for many problems. But the data is a huge unsolved problem.
Andrew Ng: I think having a shift in mindset from big data to good data coming with practices to come develop good data will be critical for accelerating the build-out of AI systems for many industries outside consumer internet.
Andrew Ng: What is good data? It has to cover the important cases. Good input distribution X. Has accurate labels. It's a high quality Y. The data is timely. Timely feedback from deployment to manage data-driven concept drift. And of course, we need good governance of this data, to be reasonably free from bias, to ensure fairness, satisfies privacy, data provenance and lineage, and regulatory requirements.
Andrew Ng: If we can stand up MLOps teams as an industry to help solve this issue for us, then I hope we can accelerate the adoption of AI both in consumer internet and perhaps even more important, outside consumer internet.
Andrew Ng: Thank you.
Q&A
Alex: Andrew, really enjoyed your talk.
Andrew Ng: Thanks for having me here, Alex.
Alex: In your talk you discussed what good data entails and why that's so important. Could you speak through some of the implications of bad data?
Andrew Ng: Even today I've been sometimes surprised at how much you could do with a surprisingly small data set so long as the data is clean and consistently defined and well labeled. For example, take speech recognition. Several speech recognition algorithms. If you work on speech recognition for voice search, you may get an audio clip where someone says today's weather. Question is, how do you transcribe that? It is comma, today's weather? Or dot dot dot, today's weather? Or is it just noise? You just don't want to transcribe noise. You should just transcribe today's weather.
Andrew Ng: It turns out any of those three ways of labeling your data is just fine. I'd probably pick the first or the second, not the third, but any of them will work well enough.
Andrew Ng: The problem is if you have a lot of labelers and one third of your labelers choose the first way, one third chooses the second way and one third chooses the third way. Then you have noisy data in the learning algorithm. Just has to randomly guess which one of these transcriptions one of your labelers happened to pick, which is very difficult for a learning algorithm.
Andrew Ng: I've been surprised how many times if you go through error analysis of a machine learning problem, it turns out that there's a problem with the input X or with the definition of the labels Y, or sometimes the data is stale. It's just not timely data because you didn't deploy and close the loop. Fixing those issues, representative input X, well defined and accurately label Y, and timely data. If you could do these things, then sometimes fixing these things is even more important and necessary than having a giant data set. Although of course, you can get a giant data set. We should totally do that too.
Alex: How can organizations working on AI ensure that the data that they're working on is so-called good, and what kind of investments should they be making to ensure that this is the case?
Andrew Ng: One thing I mentioned in the talk is that a lot of teams, partly because of the way AI research is done, tend to hold the data set fixed and iterate on the code. I think we should for many problems also consider holding the code fixed and iterating on the data.
Andrew Ng: How we do this is quite ad hoc across the AI industry. Often you have some engineers hacking around the Jupyter Network and hopefully the right engineer has the right insight that allows you to improve your data in the right way.
Andrew Ng: I do see that senior AI engineers often have the framework for doing this more systematically. I've seen here for example, when my senior friends and I look at a project, we often give advice to the team that are actually very similar to each other. So there is a methodology, but I think this methodology is still a little bit folk wisdom in the heads of a relatively small group of senior AI people. Very high level.
Andrew Ng: Things that I do include, systematic error analysis. Look at the way the algorithm isn't quite working yet and then tag the different error clauses. Now instead of asking, “How can I improve my code to solve the problem?”, consider asking, “How can I improve my data to solve the problem?”
Andrew Ng: One example to continue with speech recognition, I've built speech systems before and then found that they don't do well if the speaker is… There's a lot of car noise in the background. So if you're doing voice search in a car. Don't do that while you're driving. With the passenger seat. You have a lot of background noise, the speech system doesn't work well.
Andrew Ng: One thing you could do is say, how can I adjust the algorithms and make it do better? That could work. The other solution is to say, let's just use data augmentation or collection or something else. Let's get a lot more data of what people sound like in cars.
Andrew Ng: I have done that, my friends have done that in a reasonably systematic way, but I think developing processes and ways of thinking will help the AI world move forward faster.
Alex: Andrew, it's so great to hear you talk about the importance of data when it comes to building these machine learning systems. Some people such as Andrej Karpathy have gone even so far to say how data is really what does the programming in these machine learning systems, not the code, and in some sense, data is kind of the new code. How do you think about the right proportion of investment and effort that organizations should spend with data vs. code?
Andrew Ng: It's very problem dependent. There are definitely cutting-edge projects where we just need better algorithms, but I find that for a lot of commercial projects… The AI field is so broad now, it's hard to give one size fits all answers, but I find that for a lot of commercial applications if your goal is to build something you just make it work, then very often the code is basically a solved problem. There'll be some open source GitHub thing that's an appropriate license, you just download and use. You still need to tweak it for the framework, but for a lot of computer vision problems for example, if your data set is small then the model neural network is a low bias learning algorithm. Essentially a high variance, low bias learning algorithm if you have only thousands of examples. Then modern neural networks will fit your training set just fine.
Andrew Ng: There, the vast majority of the effort if we put it into data, I think that often leads to faster improvement.
Andrew Ng: I mentioned in the talk, I often go to teams and ask the teams, “Hey everyone, code's good enough. Please stop messing with the code. The only thing we'll do for this project is work on the data to get the performance we need.” I'll be surprised how many times when I give that direction to the team, it results in faster progress.
Andrew Ng: But again, not everything in the AI world is one size fits all because the field is now so diverse, but I do think that as a field we should shift more of our mindset to engineering the data as opposed to primarily engineering the code.
Real-World Applications of AI
Alex: One of the things that's super interesting about your experiences Andrew is that you've worked on such a range of problems. Everywhere from speech transcription at consumer internet companies to self-driving cars to now more recently manufacturing. Over a time horizon of 20 to 30 years, what are the industries that you think will be most deeply impacted by AI, and what are use cases that you're particularly excited about?
Andrew Ng: All of them. I have to say I think having led AI teams in a couple of very large consumer internet companies, a lot of my focus now is on trying to take AI to all of the other industries outside of consumer internet. So I'm excited about manufacturing, agriculture, healthcare, logistics. Also e-comm, retail. I think there are a lot of industries that do not have a billion users per company, so much smaller datasets where we can develop AI tools that will enable us to create a lot of value.
Andrew Ng: Most of the economy is not consumer internet and I think AI still has room to grow to make inroads there.
Andrew Ng: Right now, AI does a lot of work in computer vision, specifically manufacturing vision inspection. Something manufactured by a factory. Can you help build a computer vision system to inspect it, to see if there's a scratch on a smartphone or a crossed wire on a semiconductor wafer or a dent in an automotive component, or what have you.
Andrew Ng: I'll share one interesting design patent that we use in manufacturing that's useful for many industries, which is that one of the challenges that the AI world faces is the cost of customization. Every factory makes something different, and the neural network for detecting scratches on smartphones is not particularly useful for detecting crossed wires on a semiconductor wafer is not that useful for detecting a dent in an automotive part.
Andrew Ng: How can one roll out AI solutions to 10,000 factories without hiring 10,000 machine learning engineers to do all the customization work? Jason Horowitz had a very thoughtful article talking about one of the challenges of AI being the high cost of customization. That has made some parts of AI like the SaaS business but with lower margins, so who wants to do that? Who wants to say, “Let's build a SaaS business but with lower margins” because of the cost of customization which is per customer?
Andrew Ng: I think the only way out of this dilemma for the AI industry is for us to build vertical platforms to enable industry experts in these different industrial domains, in these different organizations that want to use AI to do the customization themselves. So rather than me trying to hire 10,000 machine learning engineers to build a custom AI model for every factory, I find a more promising path forward. This is what we do with LandingLens, to build a tool that enables the factory to do their own customization.
Andrew Ng: I think when we build tools that are transparent and empower the plant operators in this country and Asia to update the model any time of day they want, then they can do the customization and can deploy and operate these AI tools with much more confidence. I think this design patent in building vertical platforms will be important for other industries as well.
Andrew Ng: Take electronic health records. Every hospital or every healthcare system has a slightly different way of storing patient data and coding their electronic health records. So, if you want to build AI for reading EHRs which I've done before, how do you deploy this to 10,000 hospitals without hiring 10,000 machine learning engineers? Again, I think the way out of this is to build a vertical platform that enables every hospital or every healthcare system to do the customization they need to make the AI work for their healthcare system. I think this design patent will be useful for a lot of industries.
Data Customization vs Code Customization
Alex: Andrew, you bring up the super interesting point of customization. One of the big advancements that's happened in the field over the past year is transformers which have really standardized most of the neural network architecture code for machine learning algorithms. How close do you think we are to a world where most of the customizations actually come in the form of data customizations vs. code customizations?
Andrew Ng: I think for a lot of industries it will be possible to pick one model that will work for that use case. For manufacturing visual inspection, RetinaNet or U-Net seems just fine. There is still integration needed. The pre-processing, the output of the AI system needs to be integrated in the manufacturing plant. There is that customization needed, but other than that, a lot of the further customization is indeed in making sure you find the right data to feed the learning algorithm.
Andrew Ng: Or take EHR. Maybe come up with some algorithm that will be good enough. Maybe XGBoost. Some flavor of boosted decision trees for dealing with electronic health records. There's still the integration in the pipeline. What do you do when the AI system uploads a notification that you consider this drug for this patient or consider this consultation for this patient? There's that integration that's needed, but again, I think a lot of this will be to think through the entire life cycle of the machine learning project and to collect the data, train the model and then deploy and close the loop to enable continuous learning to make sure that the healthcare system or whatever is the application can manage their own AI system.
Andrew Ng: When the data set is small, then a lot of model learning algorithms are actually low-bias algorithms, but when we have a high variance problem, then figuring out the right data is often a promising thing to work on in order to address that.
Alex: For folks that are in businesses or industries where they're now trying to implement AI and have that impact their products and their businesses, what are some skills or frameworks that product managers, engineers, or business leaders should know or learn about so they can be maximally successful?
Andrew Ng: AI adoption is so complex, it's hard to distill it down into a couple bullet points. I actually wrote and published online a doc called The AI Transformation Playbook that synthesizes a lot of lessons I had learned leading AI teams and speaking to CEOs on how a corporation can adopt AI. So, The AI Transformation Playbook.
Andrew Ng: I think just a couple ideas. I think that many enterprises will be well served to start small. When I was leading the Google Brain team, at that time a lot of people were skeptical about AI and people didn't know how to use modern deep learning. This is still the state of a lot of industries today.
Andrew Ng: My first internal customer within Google was the speech recognition team where my team worked with the speech team to help make speech recognition more accurate. It wasn't the most important project in Google. It wasn't web search or online advertising, but by delivering that nice initial win, it then opened the door for me to start the bigger project with the Maps team, where we used OCR basically to read house numbers in Google street view to more accurately geolocate buildings and homes in Google Maps. Improve the quality of map data.
Andrew Ng: Only after making that second project work could I then start the more serious conversation with the advertising team that clearly has a much more direct revenue impact. I think many organizations I encourage to start quickly, but it's okay to start small, and then plan for the organization to learn and then ramp up their capabilities.
Andrew Ng: I still remember. You'd be surprised to hear, I still remember our first GPU server at Google. It was just a server under some guy's desk with a nest of wires. That superserver taught a lot of us how to operate GPUs in a multi-user environment and how to use them for deep learning. Then now, it looks like a lot of GPUs around many companies, but that turned out to be for me, for my team a key step.
Andrew Ng: The other thing that's very difficult for a lot of companies is scoping. I think there's still a global shortage of machine learning engineers. Not enough, especially experienced machine learning engineers. There's one role that's even harder to hire for than a machine learning engineer, which is the AI architect.
Andrew Ng: Having someone that can understand the business problem, understand the technology and form a point of view on what can and cannot be done. What's the ROI and select the one or a small handful of projects that's worth doing. That is still very challenging.
Andrew Ng: I find that when the business leaders are willing to learn a little bit about AI and when the AI tech people learn a little bit about the business, often that cross-functional brainstorming process could lead to healthy project selection.
Andrew Ng: I'm curious, Alex. You've helped many companies start to think about AI. What do you see in terms of best practices in helping companies transition to AI?
Alex: Andrew, it's a super interesting question. There's three points I would make.
Alex: I think the first one is like you've described. I think this on-ramp of problems is very important. I think it's important that organizations tackle problems that are important for their business, but also tractable with machine learning. I think all too often it's very easy for organizations to consider some big vision such as fully automating customer support without identifying the right road map of smaller problems along the way that'll allow the organization to build the muscle in AI and build a core competency in the technology before jumping all the way to the big vision. So, having this steady road map of problems I think is really important.
Alex: The second thing I would say is, solving problems where the organizations already have the data to be able to work on the problem. It's very hard to be able to do machine learning. You know as well, in the context where you don't have any of the data. Businesses naturally have some pools of useful data or some pools of data which are going to be efficient to label, and it's important to start with problems where that's the case so that you can move quickly.
Alex: The last point that I would make is that I actually think we're at a very different point in AI maturity today where for most businesses and most industries, the set of problems that you can attack with AI and machine learning and the set of problems that are tractable, I actually think it's possible to just follow where the industry's going. I think that gives organizations a blueprint and a road map to be able to work on problems where it might be known that the problem is tractable.
Alex: For example, areas like customer support automation or predictive maintenance or being able to do AI for defect detection. These are use cases which are becoming more and more normalized in some organizations and are able to be successful with them. So it makes it much easier to apply AI. You don't need to be a genius. You don't need to have a lot of ingenuity in approaching the problems.
Andrew Ng: On the data piece, that's a very good point. I've met some CEOs and sometimes CIOs that have said to me things like, “Hey, Andrew. Give me two years, maybe three years max to upgrade my IT infrastructure. Then in three years, we'll have this beautiful IT infrastructure. We'll have all the data you could possibly want, and then we'll do AI then.” I always say I think that's a mistake. Much better to look at the applications that are a little bit more AI ready with some data.
Andrew Ng: I don't think anyone in the world has perfect data infrastructure. Always what I find is that every company looks at their data infrastructure and they go, “Boy, my data is so messy. I bet other companies are better organized than me.” But it turns out everyone says that. I find that it's better usually to start from the data you have. We all feel like our data could be less messy, more clean, more complete. Start with the data you have. Quickly try to leverage it using AI, and then the feedback from the AI team will be incredibly useful for prioritizing where to build out the data infrastructure.
Andrew Ng: So, in a manufacturing plant, do you want to upgrade the sensor to collect data 10 times a second rather than merely one time per second? It's only after you're in the AI project can you do that. Or for a self-driving car. Do you want to add another camera here or another camera there or a Lidar there? Only when you have cars driving on the road, that's really helpful for helping an engineering team make that decision.
Andrew Ng: So I think some projects are more AI ready and those are often good candidates for the quick initial wins, to then ramp up, progress from there.
Advice to ML Researchers and Practitioners
Alex: For the next generation of people who are going to be shaping the world of AI, many of whom are in our audience today, what's one piece of advice you would give them to take to heart to be maximally impactful for the good of the world?
Andrew Ng: I'll share two thoughts. One is to keep learning. I find that the AI world is so vast today, there's so much knowledge, that one success is often more related to someone learning a little bit every week for a long time.
Andrew Ng: Turns out if you read two research papers a week… It's quite a lot of work to read two research papers a week. Then you have read 100 papers a year. That actually is a very significant body of knowledge. So I would say keep learning.
Andrew Ng: The second is, in addition to studying and learning, also do project work. There was someone I spoke with at a company, at a travel company. They had done this project building a chat bot. I won't describe the details, but they told me that their management team told them, “Look. This chat bot has very little ROI. I don't see the revenue case. Why are you doing this?” I felt that was actually a pretty bad thing for their management chain to say basically shut it down, because I find that doing projects is an important part of everyone's learning experience. Yes, of course we would love to do the project with a high ROI. When you're a good AI architect you have a better chance of doing that, but for individual learning, it's often fine. Not if you do this forever, for the rest of your life, but that you start with some projects. That's a key part of your learning experience as well.
Andrew Ng: I think also, unfortunately there is a gap between academic research, AI, and what it takes to build these things into production. So I find it very useful to nudge teams to think about not just the model training but the entire life cycle of an AI project, from scoping, collecting the data, to building the model to then deploying. Just gaining practice with everything in the life cycle of a machine learning project.
Alex: Andrew, it was so great having you. Thank you so much for being here at Scale Transform. Hearing your thoughts about how AI is going to transform industries, what the bottlenecks are and what the bottlenecks on the data are, have been incredibly insightful. Thank you again.
Andrew Ng: Awesome. Thanks for having me, Alex. This has been great fun to join you here.