Growing With Open Source: From Torch to PyTorch With Soumith Chintala
Soumith Chintala is a Researcher at Facebook AI Research, where he works on high-performance deep learning. Soumith created PyTorch, a deep learning framework that has traction among researchers. Prior to joining Facebook in August 2014, he worked at MuseAmi, where he built deep learning models for music and vision targeted at mobile devices. He holds a Masters in CS from NYU, and spent time in Yann LeCun’s NYU lab building deep learning models for robotics, pedestrian detection, natural image OCR, depth-images among others.
Soumith Chintala Creator of PyTorch, AI Researcher at Facebook AI Research (FAIR) discusses how open-source best practices can be used to run a successful development project with a collaborative community. He shares how these practices helped make PyTorch one of the most popular and widely used machine learning frameworks today. Join this session to learn how you can use tools, processes, and communities to run a highly effective open source project.
Speaker 1: Thank you, Alex. And thank you, Sam. Our next presenter is Soumith Chintala. Soumith is a researcher Facebook AI, where he focuses on high performance deep learning. He's also the CO creator of PyTorch, a deep learning framework does gain significant traction, first among researchers, and now broadly to become one of the most popular machine learning frameworks available today. Prior to joining Facebook, he developed deep learning models for music and vision for mobile devices. Soumith joins us today to discuss the last five years of ML frameworks and the future bets. Soumith, the stage is yours.
Soumith Chintala: Hi, I'm Soumith. I work on machine learning frameworks. I started the Priority Project at Facebook among other things, and I'm going to talk today about machine learning frameworks, how they've evolved within certain dimensions of interest. And within this framework that you can think about, I'm going to talk about how they will continue evolving going forward. And I talk about the future as a distribution of things. And this talk is 30 minutes and the field, you can talk about it for days. So obviously, the talk is going to be simplified in various ways. So please bear with me on that note. But I still think just talking about a few dimensions here and talking through machine learning frameworks and their evolution within these dimensions is going to be pretty useful. Okay, so let's start. I'm going to introduce three people, three personas.
Soumith Chintala: The first one is MODELER, a modeler is someone whose job it is to look at the data and they assess the quality of the data. They ask, "Hey, do I need more labels?" And then they start doing pre processing or feature engineering. And then they pick some way to do machine learning, they build an architecture. And then they encode inner priors into the learning either via some trick of the architecture or some regularization scheme. And then they build a training pipeline. They do machine learning to solve some tasks, either of research interests or business interests. And then there is the second person I call PROD. And prod is typically the person who modeler goes to when they actually want to reliably ship something into some critical part of some tasks, so reliably ship it to what we generally call production. So a prod usually tries to make sure you're able to version your models so that in case something feels wrong they can roll back and that you're able to version the data that comes in and goes into the models when they're trained.
Soumith Chintala: And they also generally make sure that all of the metrics that they monitor are within acceptable ranges, and they make sure that new models the MODELER has given them are with an acceptable ranges of performance to keep costs or power down. And they make sure to do that in coordination with the third person I call compiler. Now what does compiler do? Compilers job is to map models that the modeler has given either for while they're still training the models or when they enter production to map those models as efficiently as possible on to hardware. That could be server hardware, that could be accelerators, that could be phones, that could be some embedded systems, that could be the Mars Rover, anything. So their job is to squeeze the best performance out of the models, either maybe see performance per watt or performance per second or performance per dollar. That's pretty much it. Even though the term is compiler, they can even be a hardware implementer, they just build new hardware somewhere like in video.
Soumith Chintala: So let's talk about how the software stack. Don't forget the personas. But I'm going to just quickly talk a little bit about how the software stack has evolved over time. And that's kind of important. And then we will actually tie this to the personas. So before deep learning are popular, before 2012, you typically had a software stack that somewhat looked like this, where a lot of focus was on pre processing, feature engineering, post processing. And so you had domain specific libraries for that. And for the machine learning models themselves they had a very small way to interact with software packages or libraries that built those machine learning models and trained those machine learning models for you.
Soumith Chintala: So if you ever use XGBoost, or Scikit-learn, or Volvo vibe it you give some kind of configuration of what model you're building, what learning rate or regularizer, or how many trees are in the forest and so on. And once you build that config, you give that to a factory and then along with that you give your data that is in some pre processed or clean form. And then the engine, the software engine that implemented a particular machine learning algorithm just handles the entire stack of the training loop and all the implementation details of the model. And then pre 2012, they mostly map to CPUs. So something like XGBoost would specialize a lot for gradient boosted trees to always have the best performance in register of CPUs and do all kinds of tricks and things that are very specialized to boosted trees that make it go faster. And then, the one thing to recognize here is the model in this context is typically is a configuration that is generally small and usually readable by humans, and then a blob of weights that are stored on some blob format, maybe on disk or in memory.
Soumith Chintala: So, enter deep learning. Late 2012 deep learning got popular, deep learning is nothing but neural networks, or differentiable learning. And to get popular and hence came the frameworks that enable modelers and compilers and prod to practice deep learning. So in the post deep learning world this is how the stack looks like. So the stack looks like you have a very large API surface in the middle. So mainstream learning frameworks like PyTorch or TensorFlow have thousands of functions in their API. And these thousands of functions are string together by modelers to build models and they can look in all forms of shape and size. And below that you have data structures, typically tensors, say dense tensors or sparse tensors or within dense tensors, you can have layouts of memory that might make computation more or less efficient. And then you have a bunch of hand optimized functions that are typically written by high performance computing experts that map these APIs efficiently on to accelerator hardware.
Soumith Chintala: You also in the last few years have been seeing compilers pop up. So XLA or TorchScript, or TVM or examples of compilers that take whole models described in the APIs of these frameworks. And they map them more efficiently to hardware than stringing through their together hand optimized functions. And lastly, you typically have a distributed transport layer that enables these models to run on multiple devices at once or multiple machines at once. And on top of this API, you have domain specific libraries that make it easy to train your models within particular domains. Like for example, you might have computer vision specific pre processing or functionality that all computer vision people can use together. NLP audio, they generally come in all flavors and sizes. But you also have high level frameworks such as fast.ai or Kara's or Piper's lightning who try to bring that pre deep learning convenience of quickly describing what you want to do or quickly fitting your data to your model instead of verbosely implementing everything manually.
Soumith Chintala: And then on top, you have prod tooling, such as TFX or Torchso or SageMaker or the Spark AI starting to have some tooling. So, the general mainstream deep learning frameworks do a full vertical integration across the stack to make things pretty efficient. There are particular solutions by various parties that only focus on particular parts of the stack and they interface cleanly with the rest of the stack. So one thing to recognize here is in this post deep learning mainstream machine learning frameworks PyTorch and TensorFlow model is described as code, so typically code in some language, that is, basically it's not a configuration file or a JSON blob anymore. It's actually complicated code which can have loops and various structures that you typically define associated with the programming language. And then weights, they are just blobs of numbers that are stored somewhere.
Soumith Chintala: So it wasn't always like this, that picture didn't always look like this. So just after deep learning got popular, you've had various frameworks could ICANN net, which is the framework that started the revolution and then Cafe One and then I used to use this framework called AB Learn. And they had a much smaller API surface. And they had lesser data structures, they had only hand optimized functions, they didn't have compilers, they typically didn't have distributed support, they didn't have much going on. And they didn't have an ecosystem of domain specific libraries or utilities on top of them. And in regime model was still described as a config, like as a protobuf or a JSON, or like customly defined configuration files and weights. So it was still basically transitioning from that pre deep learning world and that was what was most convenient. But you actually had counter examples of those. So Tiana, which was actually much very ahead of its time, had model being described as symbolic graph. And the large API surface basically makes writing the framework really hard. So Tiana had a compiler, the compiler was really slow or wasn't very efficient. And that largely made things very difficult.
Soumith Chintala: And eventually, things evolved where there were, I think, tens of frameworks and they all evolved to only two surviving as mainstream frameworks. And those two are Hydrogen and TensorFlow and they both do model, being, code, and weights. And one thing to ask ourselves is, why did we enter this model equals code regime? Why didn't we just stay with config files and so? And one of the reasons is basically, modelers were pushing the limits of frameworks. And they were implementing ideas that look more and more like real programs, they had dynamic control flow, dynamic shapes, basically, the shape of the input tensors changing from one iteration to the other, typically seen an object detection or NLP or if you looked at say, again, training, it was very different from say standard image classification or any kind of classification where typically you just did forward, backward, and then update and then went to the next iteration forward, backward, update, again, training, change that loop, which means some internal details of these ML frameworks were no longer compatible with what modelers wanted and semi supervised learning takes that even more extreme schemes like BYOL, or SimCLR, which became recently popular.
Soumith Chintala: They have a very complex training regime. And the training loop itself is very involved. So again, the whole field roll towards the convenience of the modeler, convenience of expressing ideas of modelers. And it did come at a big cost, both compiler and prod were generally unhappy because their lives got worse like it became harder for them to write a compiler or math efficiently to accelerate our hardware if you're talking about more general programs. And same as prod like if model is a config plus rates, prod could use the version models and such, but that wasn't the case anymore with model becoming code then prod had to figure out how to debug models in production and all kinds of nasty issues and prod wasn't happy with this regime either and isn't. So you can ask, "Oh, there's three people, somehow model was code stack." And then the second question you could ask is, "Why do we have such a large API surface that's not where we started? Cafe or put a con that had typically a very small API surface. And again, it has to do with the fact that every few months, people publish some disruptive new results that involve some new building block or some new training regime that has to be expressed in different terms than previous mid level building blocks.
Soumith Chintala: So for the large part of we had these ML frameworks who roll towards very low level or mid level building blocks, and a lot of them to express all the mathematical functions and ideas that modelers had. It again, was because of the convenience of the modeler and it came at a cost that compilers and prod where we're even more unhappy. So why did modeler get so much leverage? Like, if there's three people in this ecosystem, why is modeler getting so much importance? Why do they have so much leverage and that's a fairly important question to ask. The reason is because modeler is credited largely with making progress in the field. So AI after 2012 slowly increase in the hype to a point where everyone wants AI to do everything in the world. And modelers have been credited with trying to keep up with going towards that hyped up world and making progress. And so they've been the ones who are creating all the new value. I mean, the AI ML compiler software, whatever stack has been evolving to taking care of modelers and that has been almost existential for compiler and prod to survive.
Soumith Chintala: For the large part, there seems to still be progress using whatever modelers do, so that's the way that field is going. Compiler isn't happy, right? Compiler is looking at themselves and they're like, "These three are disruption cycles were some fundamentally important architecture that I thought it will be important for the next 30 years is no longer used." I mean, you can look at say LSDM, or AlexNet, or VGG, or Inception and all these very popular architectures of their time that people were almost universally using only three to five years later no one uses them. I mean, LSDMs are old but they start getting popular sometime in 2014. Again, because of work out of Google and that's what I'm referring to. Like they got popular and then once Transformers came out no one's using LSDMs anymore. So if someone somewhere and I know of a few people tried to build specialized hardware or compilers or implementations of software that are handwritten, they're very specialized to say LSDM and ResNet-50. And that's pretty much all it does. But it does that 100 times better, or some promise like that.
Soumith Chintala: But then to develop that software of hardware, they would take three years. Well, by the time they actually ship, these things are no longer used. And that's a problem. So compiler is generally not happy that the only stable primitives that they have been able to work with are convolution and general matrix multiply. That's also why GPUs are still extremely dominant and haven't really given up their market share to a more specialized hardware yet. So what does compiler actually want? What do they want that it's better? They want something that looks like an A1, they want a stable high level IR that is small and closed within itself. And they just want for it to not change. So they can build some specialized, high performance expertise to map that more efficiently to new hardware or just build new hardware that executes this high level set of programs more efficiently. But modelers keep expanding the operator set and the keep breaking all kinds of fundamental abstractions and keep going lower and lower down the stack. And they keep giving trouble to the compiler.
Soumith Chintala: And the other persona that's not happy is prod. Prod want easily version for DevOps like models, they want to be able to roll back, like they want to do very simple things so that they can keep, if something goes wrong, there's very few variables that actually change. They don't want you to pull some random Python function from some random Python package from the internet and then use that within your model because then that model has to ship to productions of prod. Between you're writing the model and then them shipping it, they have to figure out how to strip the model off that Python function, or figure out that it's actually safe and shippable. Or you have some constraints with production, right? You might say, let's see how to ship it to Android or something. Then it's a lot of work to ship Python into some app on Android. So prod isn't generally happy with doing crazy things and modeler just does crazy things. And so as I mentioned modelers leverage is that every three years they seem to have very big disruptions. And every few months, they seem to have incremental disruptions. And the pace of value creation has been slowing down, they're still seems to be a gas in the tank.
Soumith Chintala: So one of the reasons I would say PyTorch was successful is because it put modelers at the center of the universe. I used to give talks in the early days where I said, "Hey, I don't know if PyTorch is the fastest framework in the row, it might even be 10% slower, but it will give you more flexibility and debug-ability and help you express your ideas better." And what that did is it made modelers lives easier. And the compiler and prod people back then were like, "Yeah, but we will never ship this into production." And then what ended up happening was because modelers created future value and that future value depended on all this flexibility, compiler and prod actually had to come around to come to terms with the new reality. So let's talk about the future. Modelers leverage, when does it end? Will it end or will it maybe still increase? Is there still like gas in the tank for modelers to keep innovating and keep getting credit for progress in AI?
Soumith Chintala: And so compilers and prod will continue to under fit to the problem and be under leverage in doing a better efficiency job if they lived in a different more stable world. Whenever we talk about future, I typically think of it as a distribution over chains of events. You say, "Well, this thing can happen with the probability x. And then if that happens this next thing can happen." And then you just chain them. So I'll talk about a few events that could happen and how the ML frameworks stack would change. So let's see the effects of a few possible events. The first event is, let's say, today, Transformers and con nets make up for the majority of what people think are the answer to everything. Let's just hypothetically say that actually becomes true. And that they just become the stable dominant architectures where dominance, like they take all the heavy parts of the distribution of architectures that people use, then what would happen to this diagram from before? Well, the API surface of the four frameworks that are needed to be mainstream will actually reduce, the data structures then we'll shrink rated, we wouldn't need so many, like tensors with five layouts and all that. And then pretty much everything under the stack will just have a much, much easier time.
Soumith Chintala: So the number of hand optimized functions will shrink, the compiler will have an easier job, the hardware people can start specializing more to say the shapes and sizes of the types of convolutions or matrix multiplies or whole transformer blocks that they need to compute. I don't know if that will happen, but if it does happen, that there will be a next wave of frameworks, which will again look like the classical frameworks where they will just drive everything with config files and then specialize, you don't have to expose a much more generalized scientific computing framework to the general public. Hugging Face was already doing this, become more dominant temporarily, there will be other players that come in that try to take charge of this insight. Let's talk about a second event. Let's say there is some hardware that looks very different from all the existing simulators. And then there's some obscure, not obscure, but some not as used machine learning models, such as probabilistic graphical models or even some popular ones such as sparse networks that have not been mapped efficiently enough to the current accelerators.
Soumith Chintala: Let's say they were mapped onto some hardware that looks very different, like Cerberus. And there's some disruptive results that are shown, then pretty much the entire stack of machine learning has to be rethought from scratch. And it would be a very, very, very disruptive event. And new frameworks that actually enable that work will take the mantle. And it can actually end up being a transformative event that gives an opportunity for new languages like Julia to start taking charge of the field. Right now, no one wants to move from Python because they don't have enough incentive to. So it could create an incentive as such that can make such a change. And that would be interesting and exciting. And I would definitely look forward to something like that. The third event I want to discuss is let's say, you had a particular regime where models were actually first one together from a bunch of pre trained priors or weights. And then hence, models became much more data efficient. They didn't need as much labeled data. I think, typically, it depends on the priors and how they're expressed. But let's say the priors are neural networks, then PyTorch and TensorFlow will probably continue their status quo. But then there will be whole websites that are about selling priors and which are about discovering priors, websites that are going to democratize prior discovery and usage.
Soumith Chintala: And there will probably be new sets of people trained to know which priors are better than the other. And there might even be neural network architectures that predict which priors to string together for which problem. And if they are not, if priors and not just neural networks, but there could be neural networks or mathematical functions of various kinds then we need to figure out the way these deep priors like the pipeline, if you found pipeline the priors you need to be for how they interoperate and talk to each other. The only way for mainstream frameworks to stay relevant within all this is if they can keep a very high velocity, maintain main stream frameworks, so very large and complex pieces of software. And they are being worked upon by lots of people. So if there's a change in the field and they don't keep up fast enough, they eventually will die. So the only way they can actually keep up is they maintain a very high velocity. And there will be specialized tooling that comes in all the time because specialized tooling that is more niche, more specific, doesn't have the baggage that comes with moving slow. So they can just move faster, they can be more efficient, they won't have the advantages of full vertical integration.
Soumith Chintala: And so if mainstream frameworks do move faster then they will just be able to kill specialized tooling over time. So the last words I wanted to leave with my talk is in science progress is a combination of having great ideas and having the tools to execute those ideas. If either one is stuck in a local minima then all progress will stop. So let us continue to make progress by being open to both new ideas and new tools. Thank you.