AI at Facebook Scale with Srinivas Narayanan
Srinivas Narayanan leads the Applied Research team at Facebook AI doing research and development in a wide range of areas such as computer vision, natural language, speech, and personalisation to push the state of the art in AI to advance Facebook products. He’s led several major efforts at Facebook, including creating the interest graph, launching the location product, and leading engineering for photos, where he also helped start Facebook’s efforts in computer vision and deep learning. Previously, he was a founding member of two startups and part of the database systems research group at IBM Research – Almaden.
Srinivas discusses how AI is used at large scale in Facebook’s products and shares some of the recent advancements in AI and how Facebook is deploying them into production quickly.
Brad Porter: Thank you, Alex and thank you, Kevin, for spending time with us. Next up is Srinivas Narayanan. Srinivas is the Head of Applied Research at Facebook. Where his research covers a diverse spectrum from computer vision, natural language, speech, and personalization, to pushing the state-of-the-art in AI to advance Facebook’s products. Srinivas joins us today to share examples of how AI is used at large scale Facebook products. And will share some of the recent advancements in AI and how Facebook is deploying them into production. Srinivas, welcome, and thank you for joining us today.
Srinivas Narayanan: Hi, I’m Srinivas Narayanan, I lead the applied research team at Facebook AI. Today, I’m going to talk about some of the challenges in deploying AI at scale and Facebook’s approach to solving them. To start, I’d like to set some context and give a few examples of how we use AI in Facebook’s products. AI is core to Facebook. Facebook today could not exist without AI. And we use it across our family of apps. We use AI to provide individualized experiences for 2.8 billion people each month, helping people to connect to what matters to them and filtering out content that isn’t relevant.
Srinivas Narayanan: It helps us give people the power to build communities across languages. Our neural machine translation system powers more than six billion translations a day. AI helps our products be more accessible to everybody. For example, by using computer vision to understand images, our systems can generate descriptions for the visually impaired and the quality and detail of these descriptions have gotten continuously better with the advances in AI.
Srinivas Narayanan: AI also makes the social graph more valuable to people. For example, social recommendations, our NLP systems understand when someone posts looking for advice and we can automatically make their friends’ suggestions more useful. Our AI tools for content understanding also help to proactively identify and remove inappropriate policy violating content like share bedding, spam, hate speech, et cetera. It also helps connect people interested in becoming blood donors, today more than 85 million people have signed up to get notifications on Facebook about donating blood.
Srinivas Narayanan: In addition to all this, we use AI to power innovative new experiences that otherwise wouldn’t be possible. Bots and assistance, generated content, AR/VR experiences. These are some of these products. And Portal is a great example of one of these products. A key feature is Portal Smart Camera, which uses a full body computer vision, full tracking model, to automatically frame people in the shot and adjust as they move. And it’s highly optimized to run entirely on the device. And to do this, we created a new version of Mask R-CNN, the breakthrough full body post detection model that Facebook AI researchers first released in 2017.
Srinivas Narayanan: And this new model, Mask R-CNN2Go, is 400 times faster than the original model. And it runs on the device’s mobile chip set. As you can see with these examples, AI is an extremely important technology, underpinning everything that we do. And we have deployed AI at a really large scale. There were a lot of interesting problems along the way and new challenges that are emerging. I’d like to share some of these challenges, how we address them and things we have learned that I hope will be useful for others as well.
Srinivas Narayanan: One of the biggest challenges in building AI systems is getting the right and enough amount of training data. Getting labeled data for supervised learning can be difficult, expensive, and in some cases impossible. And so one way we have approached this at Facebook is to focus on techniques beyond supervised learning. And I’ll share a few examples. To be clear, it’s not that the need for labeled data is completely going away. What these techniques are allowing us to do is to get a lot more bang for the buck for the labeled data that you have.
Srinivas Narayanan: On Instagram, images have hashtags, we’ve trained an object classification system using 3.5 billion publicly shared images. And the hashtag that they were shared with as weak supervision, the weekly supervised learning approach helped us create the world’s best image recognition system. And this technique enabled us to leverage a much larger volume of data for training than would have been possible otherwise, and set a new state of the art result. We released an open source version of this model, a pre-trained version of this model, and it’s available in PyTorch Hub. And so you can build off of it as well.
Srinivas Narayanan: This was a case where we had large scale data and we needed new techniques to make that data useful for training. Now, I will share another example where it wasn’t even obvious that the training data even existed. Translations, so as I mentioned, at Facebook, we’re providing more than six billion translations a day in over 4,500 language pairs. For some pairs like French to English, there’s of course large bodies of training data available, but people on Facebook speak over 100 different languages. And for a majority of these, the pool of available translation training data is either non-existent or so small that it cannot be used with the existing systems.
Srinivas Narayanan: And so to solve this challenge, our researchers developed a way to train a machine translation model without access to translation resources at training time, also known as unsupervised translation. So for low resource languages, now there is a way to translate between say Urdu in English by having access to text in English and completely unrelated text in Urdu without having any of the respect of translations. That’s really cool. So this system is now used in production and we use it for translations in many low resource languages, including in many low resource languages, including languages like Nepali, Sinhala, Burmese, et cetera.
Srinivas Narayanan: Many of you may be familiar with how cell supervision works in natural language processing. Large-scale language models, like GPT, have become really popular. And we’ve extended that across languages to build cross lingual language models. So, here we use the same idea as mass language models that you see in GPT-like systems. But we do this for pairs of parallel sentences, that is sentences with the same meaning in different languages. So to predict a masked English word, the model can look at both the English sentence, and it’s French translation, and test and align the English and the French representations, which makes these representations truly cross lingual. And this is exciting because it shows promise for how we can scale language understanding tasks across languages, without the need for a lot of explicit labels in each and every language
Srinivas Narayanan: We’ve now extended this idea of self supervision to speech, as well. Here, this model is called wav2vec, and is trained to predict the correct speech unit for the masked parts of the audio. And with just one hour of labeled training data, this model outperforms the previous state of the art on the hundred hour subset of library speech, using a hundred times less labeled data. And like in the previous example on NLP, we have also developed a cross lingual approach that can learn speech units that are common to several languages. And this approach helps when we have even small amounts of unlabeled speech since languages for which we have little data can benefit from languages for which more data is available. And these techniques have been used to improve the quality of video transcriptions on our products.
Srinivas Narayanan: Now let’s look at videos. We looked at individual modalities like images. We looked at text, we looked at speech, and videos are a really interesting content form that brings all of these things together. In this model, we use an approach called generalized data transforms. The model is designed to learn audio and visual encoders of the video, such that the representation of audio and visual content, taken from the same video, at the same time, are similar to each other. But, the representations from different times, or from different videos all together, are different. So once you learn to align the audio-visual representation this way, you can use it to find similar videos without a lot of supervision, and this has been really useful in products like Instagram Reels to recommend related videos to people.
Srinivas Narayanan: And now, we’re extending this to cover text in videos as well. So in this approach, we first learned the audio and visual representations using CNNs and transformer models, that are then combined to produce an overall audio visual representation of the video. Separately, we process the text, whether it’s captions, descriptions, et cetera, for the video, with the transformer model and the recurrent neural network, to produce a representation for each word, and then we aggregate that across all the words. And then, we use what we call contrastive training to match the representation across the audio-visual and the text modalities. And here, we are trying to make sure that the video and text encoders have similar representations for the text and videos that are similar, and have different representations for inputs of text and video that are unrelated. And this approach, once you have trained representations across modalities to be aligned, this approach is really useful for applications like video search. For example, let’s say you have a text query, “show me every time we sang to Grandma”. We can compute the text embedding of that query and find related videos, which could be found by just looking at the nearest video neighbors in the embedding space. And this can allow us to find videos that might have been otherwise hard to find previously.
Srinivas Narayanan: So, now let’s say you have a large quantity of high quality training data, whether it’s through techniques like weak supervision or self supervision, like you talked about, or maybe even just traditional supervision, right. Now, there’s another challenge for you: the growing compute demands. This is a chart that our friends at Open AI published showing how the compute behind their experiments have grown over time. This is just one example, but it illustrates an industry-wide problem. And at Facebook, we have seen significant growth in the compute used, driven by the needs, by the trends of self supervision, as well as the growing needs and use of AI across many products. And with this it’s essential that we ensure we are operating as efficiently as possible. So let’s see how we’re addressing some of these efficiency challenges. As I mentioned earlier, personalizing newsfeed as an example of AI in action at scale and Facebook and Instagram’s feed systems are powered by large scale, deep learning models.
Srinivas Narayanan: Here’s an overview of the model architecture that we use for such recommendation systems. These models have a combination of dense features, which represent features like aggregate counters, such as number of likes comments, et cetera, as well as parse features that represent things like pages a user may have liked.
Srinivas Narayanan: These parts features are then mapped to dense representations using embedding tables that are learned by the network. And then you have more neural net layers on the top. Here are some of the scaling challenges in this work. The models have to be trained on tens of billions of examples, and the size of an embedding table can be hundreds of megabytes, and you can have hundreds of such tables. So the models can be hundreds of gigabytes in size. As you can see here, the system has a different set of bottlenecks at different levels. For example, the embedding look-ups are memory capacity dominated, but the higher layers are network communication or compute dominated.
Srinivas Narayanan: One key way we have addressed these training compute challenges is by custom designing servers for these workloads. And we do this in an open way, and we have released our hardware designs externally through the Open Compute Project. For inference, we do a variety of optimizations. We do FP 16 quantization to reduce cost of evaluation of fully connected layers. And we do intake, or even in-full quantization for the embedding tables. And that can reduce the model sizes to an eighth of the original size with minimal impact on accuracy.
Srinivas Narayanan: We also do factored shared computations. So parts of the model that will be exactly the same within a given batch will be sent over the wire and evaluated only once. And these are some of the most used models that we run, and so these sorts of improvements are pretty significant wins for our infrastructure.
Srinivas Narayanan: We talked about shared computation within a model, but we have an entirely different challenge in computer vision, where we have lots of different models for different types of products, use cases. And one of the ways we have improved efficiency there is by sharing the compute across all these different models and all these different image classifier models. Most classifiers use just the last layer of a shared trunk, but for a few complex tasks that we need more accuracy, we can also fine tune it by branching it off at an earlier point in the shared trunk. This is kind of difficult to pull off because you have lots of different demands across many use cases. But this approach of having a shared trunk, with different branches, allows us to trade off accuracy versus efficiency in a much more effective way.
Srinivas Narayanan: Another technique we use to make models efficient is a well-known approach known as knowledge distillation. And the idea here is to train a large teacher model, and then to train a much smaller and more efficient student model to mimic the predictions of the teacher. And this technique has been applied now effectively across many domains, including computer vision, NLP and speech. Over a 12 month window, we’ve seen 2X, the number of engineers training ML models, 3X the number of workflows they’re journeying, and 3X the amount of compute they are using. In addition to optimizing the infrastructure to operate as efficiently as possible, we also need to invest in common tools to enable the engineers and researchers to operate efficiently. So how do you provide the tooling that enables you to increase the efficiency of researchers and engineers? And it required us to build a new stack.
Srinivas Narayanan: Here’s a very high level overview of the stack that we’ve built. On the left, you’ll see tools for preparing data in the right format. In the middle, the pieces you need for building and training models. Going bottom up all the way from hardware, whether it’s a CPU’s or GPU’s. Frameworks like PyTorch that ease the model building environment. Libraries that are specific to each domain, that again, allows us to build models in those domains much faster. And finally, the models that are used in products. And on the right you, once you have the trained model, you need to have the right tools and systems for deploying them in production, whether it’s in a data center or locally on the device. There’s obviously a lot of pieces here, but we have built fully managed platforms for doing this. And many of these tools are also available in open source, so you can assemble them on your own as well.
Srinivas Narayanan: So, here’s a quick snapshot of some specific tools that we have used across all these different needs. So we have tools for data, for training and testing, for debugging, for deploying and serving, and of course, hardware as well. And these common tools and models across the company enable a range of product teams across our family of apps to more easily leverage the AI tech for what they’re building. One key challenge that we needed to solve here is that we really needed to make the research to production flow smooth, even though the needs of the research stage and the needs in the production stage for large scale deployment can be very different. And so to that end, we have built PyTorch as a single framework to enable rapid experimentation of new research ideas, and to bring those ideas to production seamlessly.
Srinivas Narayanan: Today we are using PyTorch in production at Facebook in our AI work, across most of our domains from the research to production spectrum. PyTorch is open source, and Facebook and the community are actively expanding the ecosystem of tools to provide you with this new stack. You can see some of the high level features that are really nice and PyTorch here, and my colleague Samit is going to be talking about PyTorch in more detail in another session at this conference.
Srinivas Narayanan: The next challenge I want to speak to, and one that isn’t one that can be solved with just tooling, and it’s the challenge of ensuring that you can develop and deploy AI responsibly. As we develop and deploy AI at scale, we need to think about its responsible use. Responsibility has many facets, but I’ll talk about one that is particularly important: fairness and bias in AI models. At Facebook, we’re using AI to benefit billions of people around the world, and to do that well, we need to ensure the systems work fairly and equally well for everybody. Let me briefly share a story.
Srinivas Narayanan: I mentioned Portal’s smart camera as one example of how we are using AI to power new product experiences. In an early test of the pre-production hardware, a Nigerian-American member of her team, Lade, noticed a serious problem with the smart camera.
Srinivas Narayanan: It seemed to work fine for white male colleagues, but it wasn’t working for her. It didn’t automatically focus on her when she was the one speaking and gesturing. Why not? It turns out that the dataset that was used for training the model in that test version, wasn’t inclusive enough. The data was good for some people, but not for everybody.
Srinivas Narayanan: In this case, the fix was creating a more inclusive dataset and then validating the results after improving the model. But the issue goes beyond just ensuring we are using inclusive datasets. We all have the responsibility to ensure that as we develop and deploy AI at scale, we’re not introducing or amplifying bias and creating unfair systems. And this is a challenge because it’s not as simple as using the right tools. Fairness is a process. At each step of the implementation process, there is a risk of bias creeping in. In data and labels, in algorithms, the predictions they make, and the resulting actions based on those predictions. And so at each step, we need to surface the risks to fairness, resolve those questions, which means defining fairness in that context, and document the decisions and processes behind it.
Srinivas Narayanan: Next I’d like to touch upon some process and cultural changes in scaling AI. We’re still in the early days of AI adoption, and we’re learning a lot as we are developing these best practices. Let me start with reproducibility. With the rapid pace of innovation in AI, you see some state-of-the-art result almost every day, and you wonder how you can build on top of it. To do that, you first need to consistently reproduce the advances that those papers claim. Unfortunately, that hasn’t been easy. So this is an important problem, and the academic community is starting to make reproducibility a part of the paper submission process, as well, as you can see here.
Srinivas Narayanan: Another challenge is making our engineering velocity faster and enabling continuous integration and continuous deployment for machine learning models. And there are a lot of challenges here as well. There are usually many differences in performance in offline datasets and how those same models perform online on live data. And this slows down the speed of innovation. We’ve been investing in techniques, such as counterfactual evaluation to bridge this gap.
Srinivas Narayanan: Next, ML models often cascade. For example a computer vision model may provide signals to a different downstream model, and it becomes hard to assist the real impact of improvements downstream in such cases. And so you have to retrain and redeploy the entire cascade of models every time. And this is not easy to do without the right tooling and infrastructure improvements. A third problem is that ML models are inherently imperfect. A new model may perform better in aggregate, but may have worse results for some more important examples. And we don’t quite have the right rigorous definitions of model contracts that enable making quick and easy changes with confidence. And these are some of the challenges that we’ve been working through over the last few years. AI is making an impact on lots of problems, but at the same time, many of these problems are still super hard.
Srinivas Narayanan: At Facebook AI, we believe the best solutions will come from open collaboration by experts across the entire AI community, and so we’ve been trying to release more open datasets for some of these hard problems. Now if you can release these open data sets externally that’s great, but even creating strong well designed data sets internally can spur more people inside the organization to advance the state of the art.
Srinivas Narayanan: We also really believe in bringing holistic thinking to AI problems. We bring together multiple disciplines, product management, design, analytics, user research, in helping frame the product problems crisply, define the ML tasks that we need to solve more precisely, and to design the right evaluation methodologies for these systems,
Srinivas Narayanan: So I’ve tried to codify some of these as learnings in how we approach AI as we have scaled it to many products and billions of users at Facebook. Being rigorous. We need, we’re pushing all the AI experimentation to be more rigorous and look at some of the challenges I mentioned earlier like reproducibility and model evaluation much more closely. We’re trying to create more open datasets to foster more open collaboration by experts. We’re encouraging a holistic approach to product thinking rather than just a technology centric approach. AI is also a highly empirical science, and what seems to work in one setting doesn’t work in another. So it requires not just rigor, but also a lot of patience and determination to see the results. Some of our ideas and even the ones that we thought were obvious have taken one to two years from the initial idea to proving value, after many rounds of experimentation and iteration.
Srinivas Narayanan: Lastly, being a finisher. In many cases, our work tends to be picked up by other teams, and we encourage the people on our teams to make sure that eventual product value is realized, instead of just handing off the technology. We often find very interesting problems in the integration process that creates new, exciting work, and also makes us reformulate some of the problems they started on for the next stages of the work that we have to follow up on.
Srinivas Narayanan: So I hope you find some of the ideas and learnings I shared today useful in your own work, and thank you.