Scale Events
timezone
+00:00 GMT
Sign in or Join the community to continue

A Machine Learning Infrastructure Playbook with Lambda

Posted Oct 06, 2021 | Views 1.8K
# TransformX 2021
# Breakout Session
Share
SPEAKER
Stephen Balaban
Stephen Balaban
Stephen Balaban
Co-founder and CEO @ Lambda

Stephen Balaban is the co-founder and CEO of Lambda. He started using CNNs for face recognition in 2012 and was the first employee at Perceptio. (Acquired by Apple in 2015.) At Perceptio, he developed software that ran CNNs locally on the iPhone's GPU. He has published research in SPIE and NeurIPS.

+ Read More

Stephen Balaban is the co-founder and CEO of Lambda. He started using CNNs for face recognition in 2012 and was the first employee at Perceptio. (Acquired by Apple in 2015.) At Perceptio, he developed software that ran CNNs locally on the iPhone's GPU. He has published research in SPIE and NeurIPS.

+ Read More
SUMMARY

In this session, Stephen Balaban, CEO of Lambda, shares a playbook for standing up machine learning infrastructure. This session is intended for any organization that has to scale up its infrastructure to support growing teams of Machine Learning (ML) practitioners. Stephen describes how large ML models are often built with on-premise infrastructure. He explores the pros and cons of this approach and how the workstations, servers, and other related resources could be scaled up to support larger workloads or numbers of users. How do you scale from a single workstation to a number of shared or dedicated servers for each ML practitioner? How can you use a single software stack across laptops, servers, clusters, and the cloud? What are the network, storage, and power considerations of each step in that journey? Join this session to hear some best practices for scaling up your machine learning platform, to serve the growing needs of your organization.

+ Read More
TRANSCRIPT

Speaker 1 (00:15): Next up, we're delighted to welcome Stephen Balaban. Stephen is the co-founder and CEO of Lambda. He started using CNNs for face recognition in 2012, and was the first employee at Perceptio, acquired by apple in 2015. At Perceptio, he developed software that ran CNNs locally on the iPhone's GPU. He has published research in SPIE and NeurIPS. Stephen, take it away.

Stephen Balaban (00:50): Hi, I'm Stephen, CEO of Lambda. Today I'm going to be going over Lambda's ML Infrastructure Playbook. The playbook is a clear roadmap from how to get started with standing up machine learning infrastructure for your team, and how to incrementally expand that machine learning infrastructure as you grow. So first, a little bit about me, and a little bit about Lambda. Lambda is building a future where scaling from a single GPU to an entire data center just works. We offer a wide variety of hardware and software products that go from our TensorBook laptop all the way up to the Echelon GPU Cluster, which is for a multi rack GPU cluster design with InfiniBand, and also have a cloud service that is a build by the minute GPU cloud service.

Stephen Balaban (01:53): Installed on all of our hardware systems and on every single one of the cloud instances that you'd spin up is Lambda Stack. Lambda Stack is a managed, always a to date software stack for deep learning. So before we go into the meat of the presentation, I want to tell you a little bit about the history of deep learning. And really in the beginning, there was a workstation. "Our network takes between five and six days to train on two GTX 580 3 gigabyte GPUs. All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger data sets to become available." Now, this is a quote from the AlexNet paper in 2012, and it really is a oppression image of the future.

Stephen Balaban (02:46): Really the results can certainly be improved by waiting for faster GPS and bigger data sets. Fast forward nearly a decade later, and massive models reign, but for most, the work is still happening on a work station. At Lambda, we've done a survey of people who have downloaded or requested a quote for our website, and we send this survey out to them, and I wanted to share some of those results with you, because I think it really tells an interesting story about where most people are with their machine learning infrastructure. So the distribution of computation or compute, like wealth, follows a power law. When asked what the total number of GPS that your lab has available to trade neural networks with, almost 85% of respondents said, "Less than 10 GPUs." And so you can see that the distribution of the number of GPUs available is clearly following a power law here.

Stephen Balaban (03:59): It's also true for the distribution of research and development head count. Someone asked, "How many ML researchers and engineers does your team have?" Over 70% said, "Less than five," and the people who have 50 or more machine learning researchers or engineers is less than 0.2% of the population of the surveyed people. Now, obviously, there's some bias on this, because these are people who are coming to our website to download a quote, but it's still very interesting. So can you guess what percentage of respondents said that they primarily train on-prem versus in the cloud?

Stephen Balaban (04:49): The interesting thing here is that at least for people who are downloading quotes from Lambda now, obviously again, there's some bias here, because we're a on-prem GP infrastructure company, but more than 80% of people when asked whether they train primarily on-prem or in a public cloud said that they train on-prem. So here's a playbook for how to get started standing up machine learning infrastructure for your team. First of all, you're going to want to decide between cloud, on-premise, or some sort of hybrid between cloud and on-prem. Some good reasons to choose cloud would be that you need the GPUs now. You need inaudible 00:05:39. You need access right now. And the fastest way to get access to a hundred GPUs is to spin up some cloud instances.

Stephen Balaban (05:48): Furthermore, if you're doing production inference, so if you are primarily doing machine learning inference, and all of your data is sort of already on AWS or GCP, it's pretty easy for you to sort of pull that data from S3, and do the inference in the cloud where you are currently running your production application. So that's another good reason to be in the cloud. Furthermore, if you have a spikey or inconsistent workload, so maybe you use a bunch of GPUs, and then don't use anything for a couple of months, maybe use a few GPUs, and then spin it down for most of the year, then cloud is probably a good idea for you. However, if you've got a more consistent workload, then you should definitely be considering on-prem.

Stephen Balaban (06:38): So the advantages of on-prem are that you get significantly more compute for less money. It is a fraction of the cost, if you're heavy user of compute, to be on-prem than it is to be in the cloud. Also, you have complete control of the data that you are gathering, and using, and training, and so if you care about data sovereignty and security, then that's another good reason to be on-prem. Also, if you're working with huge data sets, so multi petabyte datasets that maybe you've gathered from an autonomy application, or from field work that your team has done, when you have big data sets a like that, you get expensive cloud storage and very expensive egress fees, if you ever need to bring that data back on-prem. And so that's another good reason to be considering on-prem, but really it boils down to you get significantly more compute for less money, even just if you think about building out a single GPU workstation.

Stephen Balaban (07:38): So if you're thinking about just getting started, most engineers and researchers will just either stop with a GPU laptop or a GPU workstation. And unless you've got a very good reason as to why you need to start off with building an entire cluster, or need to be spinning up thousands of GPUs, you should also be starting with just a single workstation for your machine learning researchers. When you're deciding on the GPU to select for your workstation, it's a good idea to look at the latest benchmarks that are out there. So Lambda runs some benchmarks, and you go to lambdalabs.com/gpu-benchmarks to see those. And there's also some good benchmarks on MLperf.org, but you can basically select the model that you're planning on training, and see which GPU performs the best on that particular model. And so we've done all of those benchmarks for all the major models, and for all of the production GPS that are available today in the market.

Stephen Balaban (08:51): So this is an example of throughput results for BERT training, BERT-large on the SQuAD dataset. And as you can see, we've done very comprehensive benchmarks, shown you sort of the speed up that you'd get over a reference GPU. So we use V100 32 gigabyte as the reference GPU, and you can see that certain GPUs like the A100 and the A6000 perform really well on this particular model. And so again, figure out what model you're planning on training, and you can go and check out the benchmarks to choose the best GPU for your application. And after you've done that, you can just sort of incrementally add compute for every new hire that you add. So whether maybe the new hire only needs a single GPU, in which case, getting a TenserBook laptop might be a good idea.

Stephen Balaban (09:58): Maybe they need two to four GPUs, in which case a works station would be a good idea for them, and just sort of incrementally scale out some on-premise infrastructure for them. You'll find out that this is not only significantly less expensive than something like AWS, but it also is going to give you much faster training results, because you can use A6000 and A100 GPUs locally that for the same price you may not be able to get access to in the cloud. And once you've done that, let me sort of walk through a playbook for expansion. So when your team starts to share resources, that is to say that they want to sort of share a single system, it's usually a sign that you should start looking into getting a centralized server to provide compute to the whole organization.

Stephen Balaban (10:59): And you can start off with this sort of hack, I guess, if you will, for doing resource sharing, which is if you use CUDA_VISIBLE_DEVICES, you can mask the GPUs that somebody sees, so you can have them prepend this environment variable to all their jobs. And so in this case, Joe is going to be only running his training job on GPUs zero and one, and Francie is going to be running it on 3, 4, 5, 6, 7, and maybe there's going to be GPU two is still going to be available. But the point is is that you can set this environment variable, and it will restrict the number of visible GPUs to that process. You can also do hard coded allocations in their particular UNIX accounts bashrc file. So when they SSH into the server, you can export the CUDA_VISIBLE_DEVICES variable, and give them explicit access to particular GPUs.

Stephen Balaban (12:07): And that's great for a small scale, but over time, that method will become really unwieldy as you try and figure out who's running what jobs, is it an important job? I'm sure you've all experienced the sort of fear of whose job is this even, and will they be upset if I'm to kill that job? Is it an important experiment, or is it just sort of some sort of task that they're running? And at that point in time, you should probably consider using a job scheduler like SLURM or Kubeflow, and a job scheduler will basically allow people to enqueue work into a job queue, and then the set of servers will pull those jobs down from that queue, run them, and then mark them as finished when they're done. And this is a much more organized way of both helping you prioritize your jobs, as well as just making it so that instead of individuals sort of getting pre-allocated access, it's just a function of whether or not there's currently space in the queue, or if there's currently any jobs running, and it will do a better job of sort of optimizing your resource allocation.

Stephen Balaban (13:30): Another thing you can do is if your machine learning engineers or researchers aren't super comfortable of always working in the command line, you can set up separate Jupyter Notebook instances on the machines, and sort of allocate each one of those to an individual. And again, this is all sort of your first draft of what really amounts to an MLOps platform. And so we'll go into that a little bit later, but before that, I want you to sort of think about considering what the software stack is going to be at every single scale. So when you're small, I would say just use Lambda Stack. Lambda Stack is a single line install. So you just Ubuntu 2004 LTS or 1804 LTS, and then copy and paste this line in, and you will install your Nvidia drivers, PyTorch, TensorFlow, CUDA, cuDNN, and Lambda Stack really just makes managing your deep learning environment so much simpler.

Stephen Balaban (14:43): And when you're a small company, and you're only a couple of people, you should just be using Lambda Stack on your laptops, on your work stations. It's something that you can install, even if you don't buy hardware from Lambda. So the cool thing is that Lambda Stack sort of provide that same environment anywhere you go. So if you install it in your TenserBook and on your Vector workstation, and if you spin up an instance in our GPU cloud, you'll have the same exact environment. So you can easily sort of move models between machines, and it will just work.

Stephen Balaban (15:19): So as you grow, you're probably going to want to start using containers. So Lambda Stack also supports Nvidia container toolkit. We have a hosted version of that actually, and you can install Docker and Nvidia container toolkit, and run NGC containers, your own containers. We even have sort of Lambda Stack containers that have the same Lambda Stack environment in the container. And finally, at a certain scale, you're going to want to use containers and consider using an MLOps platform. So there's a lot of different MLOps platforms to choose from. There's Weights and Biases, Cnvrg, Determined, Run, Seldon, and many others. And that's sort of where you're going to want to be in the long run.

Stephen Balaban (16:07): So as your team grows, you're going to want to move from just a couple of servers to a cluster. And as you start to move into building clusters, there's a few different sort of considerations. There's network, power, colocation, where are you going to put these things? And you'll start to see you're going to be scaling from one Gbps ethernet to 10 Gbps SFP+, to maybe even 200 Gbps Infiniband connectors. And this is really a function of whether or not how much node to node communication that you're going to need. And sort of the end game boss of the networking world are these things called director switches, where they will set up a spine and leaf topology for all within one single chassis.

Stephen Balaban (17:07): And so this CS7520 is 100 Gbps EDR InfiniBand director switch. It's got 216 ports available to use, and basically, it's constructed of a chassis, some management modules, which are running the subnet manager, a spine switches, which are basically the spines in your spine-leaf topology in your leaf switches, which are basically providing internal connections into the spines, and external connections to your nodes. And so this is providing 216 ports of non-blocking 100 Gbps Infiniband. And this is the kind of thing that you're going to want to use if you're going to be doing distributed training at scale. This is an example of network topology from the Lambda Echelon white paper, and you can and see that it's not uncommon to have, basically, a couple of different network fabrics, a compute fabric for transferring gradients between nodes, a storage fabric for transferring data between your storage and your compute nodes, an in-band management network, and an out-of-band management network for IPMI.

Stephen Balaban (18:34): Again, going back to sort of, well, what is a spine-leaf topology? When you run out of 40 port SIL, let's say you start with a single 40 port switch. You can support up to 40 different machine... 40 different, I should say not machines, but InfiniBand connections. And if you want to scale beyond that, let's say you want to go to 80 connections. Well, what you're going to need to do is set up this sort of spine-leaf topology, where you have 80 external connections, and then internally you have 80 internal connections, where each of the leaves connected to the two core spine switches, and this provides, basically, non-blocking full bandwidth port to port for all 80 of those, and this type of fat tree or spine-leaf topology can be scaled out to very large networks.

Stephen Balaban (19:40): So as you scale up, again, from individual servers to clusters, you're also going to want to look into what your storage, and how your storage scales. So there are tons of different options. At Lambda, I'd say we primarily work with the folks at WEKA, and also frequently we'll use something like FreeNAS. And when you're doing large scale InfiniBand driven storage, the folks at inaudible 00:20:14 have a really phenomenal storage platform for that. Finally, as you scale up and sort of get into sort of data center scale systems, and even when you're local in your office, you're going to run into power considerations.

Stephen Balaban (20:34): So there'll come a point where you have too many servers in your office, and you are running up against the power limit that your office is able to provide, and you've got a couple of different options. You can either co-locate at that point in time, and put your servers into a data center, or you can upgrade your office's power. And if you are going to go the route of upgrading your office's power, which we've done plenty of times here at Lambda, you're going to really want to understand how at least the basic math behind these systems. So as you may remember from elementary school physics, watts = volts x amps. So it actually is a little bit more complicated for data center power, because oftentimes you'll be dealing with a three phase system.

Stephen Balaban (21:27): And so for three phase systems, you're actually calculating it with there's three separate legs of the three phase system. You do 3 x voltage x amperage, but then there's a factor that you need to divide by, which is the √3, which has to do with how far out of phase the three different legs are. And thankfully, that can be simplified down to 3/√3, and formula in the middle can be just simplified down to √3. So you can calculate the total power of a three phase system by doing the √3 x voltage x amperage, and then derating it by 80%, or derating it by 20%. So multiply by the 0.8 sort of regulatory derating factor, which is often applied to PDUs.

Stephen Balaban (22:21): So how do PDU manufacturers calculate their power capacity? Well, they do sort of that exact formula. So they'll derate the amperage. So you take the 60 x 0.8, and then you'll see that their maximum load capacity in volt amps is √3 x 208 x 60 x 0.8, which gives you the 17300, which you see listed by APC on this particular APC 8966 power distribution unit. Finally, I'm just going to go over some of the common plug types that you might see used in an HPC environment. So the blue one on the left is the IEC6309, and that's a 60 amp three phase plug. It's pretty common to see. For example, this APC 8966 has that as an output.

Stephen Balaban (23:23): If you are dealing with 415 volt systems, you'll see this red IEC6309. And if you are dealing... And these IEC standards, they're an international standard, but some people will say it's a European standard, even though these are very commonly used in the United States as well. The NEMA standard, which is the North American one, you'll often see the L15-30P, which is a 30 amp three phase plug. And then as far as those PDUs, the PDUs will provide receptacles that look like these, for example, IEC C13, and IEC C19 nineteens on the PDUs. And you can see that basically this is the plug that's compatible with that particular type of receptacle. And these are the types of plugs that you will typically be dealing with when you're in a data center or co-located environment.

Stephen Balaban (24:23): And so if scaling out office's power is not an option, then you should really look into colocation services. Lambda does provide colocation services, but so do a lot of other companies. And that's just a consideration that your team is going to want to make in terms of how hands on you're going to want to be with your infrastructure. So for a comprehensive sort of more technical deep dive, you can check out our building a GPU cluster for AI video, which is sort of going over our Lambda Echelon white paper and reference design. And that's a more technical presentation that dives into sort of larger scale systems.

Stephen Balaban (25:15): So some finishing thoughts. I've presented today a very incremental, easy to follow sort of path from going from just a couple of machine learning researchers to dozens of them, going from starting out with just laptops and works stations to individual servers, then to clusters. And I think that that's probably the path that most people should follow. I don't think that people should just jump into building a multimillion dollar cluster from day one. I think that you should start with sort of incremental steps, and that's really sort of in line with what we're trying to accomplish in the long term. In the long term at Lambda, I'd love to have a world where when you're done training your network on your laptop, and you've finished up prototyping it, that you should be able to easily scale that out to an entire data center without even changing a single line of code.

Stephen Balaban (26:20): And I think that future is going to happen, and I'm looking forward to seeing a world where machine learning compute sort of is treated more like a utility, where you plug it in, and it just works, as opposed to this very complicated, hard to set up infrastructure that we kind of have today. So if you've got any more questions, you can feel free to reach out to us at Lambda. We have contact information on our website, Lambdalabs.com. You can also check out our YouTube channel, which has a bunch of great videos and tutorials on different topics related to machine learning infrastructure. I really hope that you've enjoyed this talk, and I look forward to interacting with you in the future. Take care.

+ Read More

Watch More

31:05
Posted Oct 06, 2021 | Views 10.8K
# TransformX 2021
# Keynote