Sign in or Join the community to continue

Panel: Building a Resilient MLOps Strategy Through Dataset Management

Posted Oct 06, 2021 | Views 2.9K

# TransformX 2021

Share

speakers

Chun Jiang

Head of Product @ Unfolded.ai

Chun works on large-scale geospatial analytics and location intelligence platform as Head of Product at Unfolded, now a part of Foursquare. Following her graduation from Cornell, Chun has worked across the interactions of mobility, autonomous driving, data visualization, and machine learning. She is an advocate for making AI explainable and designing and building products focusing on human-in-the-loop machine learning. As a proud alumni of Uber ATG, Uber, and Scale, she has had the opportunity to work among some of the brightest people and launch products like Uber's Michelangelo and Scale Nucleus.

+ Read More

Alessya (Labzhinova) Visnjic

CEO & Co-Founder @ WhyLabs

Alessya Visnjic is the CEO and co-founder of WhyLabs, the AI Observability company on a mission to build the interface between AI and human operators. Prior to WhyLabs, Alessya was a CTO-in-residence at the Allen Institute for AI (AI2), where she evaluated commercial potential for the latest advancements in AI research. Earlier in her career, Alessya spent 9 years at Amazon leading Machine Learning adoption and tooling efforts. She was a founding member of Amazon’s first ML research center in Berlin, Germany. Alessya is also the founder of Rsqrd AI, a global community of 1,000+ AI practitioners who are committed to making AI technology Robust & Responsible.

+ Read More

Adrian Macneil

Co-founder and CEO @ Foxglove

Adrian Macneil is the co-founder and CEO of Foxglove (foxglove.dev), providing powerful data management and visualization software for robotics. Prior to founding Foxglove, Adrian led Infrastructure and Developer Tools at Cruise. Adrian has a passion for improving developer productivity, and believes that optimizing the developer feedback loop is critical to unlocking rapid innovation in robotics. During his time at Cruise, Adrian faced firsthand the challenges scaling a robotics team, especially in a vertical as data-intensive as autonomous vehicles. Data sits at the heart of robotics, and scaling the Cruise engineering organization necessitated development of many custom tools to manage, categorize, transform, label, and visualize data, both real-world and simulated. Foxglove offers a powerful solution for visualizing and collaboration over robotics data. It is optimized for the types of multimodal data common in robotics, seamlessly integrating images, point clouds, and robot state into a single interface. Foxglove’s open source visualization platform is web-based, accessible from anywhere, and easily extensible via custom plugins.

+ Read More

Ville Tuulos

Co-founder and CEO @ Outerbounds

Ville Tuulos has been developing tooling for machine learning in industry and academia for more than two decades. He is a co-founder and CEO of Outerbounds, a startup developing a human-centric platform for data science and machine learning. Prior to Outerbounds, he led the machine learning infrastructure team at Netflix where he started Metaflow, an open-source framework to support the full lifecycle of data science projects. He is also the author of an upcoming book, Effective Data Science Infrastructure, published by Manning.

+ Read More

Elliot Branson

Director of Machine Learning & Engineering @ Scale AI

Elliot Branson Is the Director of AI and Engineering at Scale and leads the Machine Learning, Platform, Federal, 3D, and Mapping products. In his prior work, he helped create the Cruise Automation self-driving car and served as the first Head of Perception and AI. His interest in robotics and AI started with national and international robotics competitions in high school and continued in college and grad school where he published work on field robotics, localization, computer vision, and AI systems. His previous work includes stints on the Google Project Tango AR platform and Air Force MURI research programs.

+ Read More

SUMMARY

Dataset debugging, versioning, and augmentation is essential to building successful ML pipelines and models. Even with the robust training, optimization of AI models deployed with well defined CI/CD operational pipelines, high-quality data remains an essential and critical part of the whole AI development process. Learn how different organizations collaborate to improve their datasets and debug errors in their data. See how doing so helps then unlock higher accuracies as well as other key benefits and efficiencies.

+ Read More

TRANSCRIPT

Elliot Branson (00:43): Hi, welcome everyone to our panel on dataset management and kind of why dataset management is essential to building successful ML operations. Without the automation of data quality MLOps can fail even when the rest of your pipeline is really well thought out. We want to learn a little bit more about how data different companies prioritize and improve their collaboration to improve their data set quality, model debugging errors, and just building better models in general. I'd like to introduce our panelists today. So Alessya.

Alessya Visnjic (01:16): My name is Aleyssa. I'm the CEO and co-founder at WhyLabs, the AI observability company and I'm really excited to be here to discuss data set quality.

Elliot Branson (01:26): Welcome to have you. Thank you so much. Chun, do you like to go?

Chun Jiang (01:30): Hey everyone. My name's Chun, currently head of product at Unfolded.ai, we just joined the Foursquare and we are a geospatial analytics and visualization company. Before this I was at Scale and before Scale I was at Uber ATG.

Elliot Branson (01:47): Thanks. Nice to see you again. Adrian, would you like to go?

Adrian Macneil (01:51): Hi, my name's Adrian Macneil, I'm the CEO and co-founder at Foxglove. We are a visualization and data management company for robotics and self-driving. And before that I was the director of infrastructure and develop [inaudible 00:02:07] Cruise.

Elliot Branson (02:11): Very nice and last but not least, Ville.

Ville Tuulos (02:14): Hey, my name is Ville Tuulos, I'm the CEO and co-founder at Outerbounds where we developed Metaflow, which is an open source project we started at Netflix, so prior to Outerbounds, I was leading machine learning infrastructure at Netflix.

Elliot Branson (02:27): Awesome. Thanks. Thanks so much everyone. And I guess to get started, one of the things that's really common across a lot of ML operations and different companies is working with a lot of really varied and different data types. So I wanted to kind of throw an initial question out there is how do you manage kind of the ingestion and working with data when your company can be working with computer vision and audio and text and LIDAR and all these different data types at the same time?

Adrian Macneil (02:56): Yeah, I think that's something that we see a lot of and especially in the self-driving and robotics industry you have this kind of multimodal data, you're recording camera images, LIDAR point clouds often, and then other data that maybe happening on device often audio radars. And I think it's important to look at both how you are kind of recording that data in the first place on the device and in such a way that you can easily align between camera images and point clouds for example and have a record of exactly a timestamp so those are coming in and then have that flow through into your data storage and your data lake. So yeah, that's something that's very important. I think there's a, I can kind of go into more detail if you want, but it's something that we found very important at Cruise and something we see a lot at Foxglove.

Elliot Branson (03:57): I guess one quick follow up is what's two different data types that are really hard to synchronize and work with at the same time or two that you have to have special care around?

Adrian Macneil (04:07): Yeah. I think so it depends a lot on how people are approaching this problem. So we've noticed for example, chatting to different companies, let's take LIDAR point clouds and images for example, many companies, well, there's different ways of approaching recording images. And we see some companies especially ones that have kind of come from a ROS background or using ROS, the robotics framework where often they will be recording image data as individual frames. And so that makes it actually relatively easy if you're recording individual frames of images and individual frames of point clouds and you have timestamps on those. You still need to deal with aligning the timestamps, but at least you have a complete image and a complete point cloud. However, what is common especially again, in AV and robotics industry is to record images as actual video feed.

Adrian Macneil (05:04): So H 264 for example and we've come across a lot of companies that are recording H 264 or some other video encoding on their device and then coming back and realigning that can be quite challenging. For those that are unfamiliar, if you're using a video encoding like H 264, you don't have a perfect frame. Every individual frame is not recorded. The compression is using elements of previous frames. So you actually need past data from that stream to be able to even regenerate a frame at a single point in time. And then if that is getting recorded to literally a separate file on the device, then it's much more efficient space wise, but it makes life very hard for you when you're coming along and trying to realign those with point clouds and you want to know other metadata, like the exact second that that particular frame was received.

Adrian Macneil (06:00): So what we've seen people do in that case is actually record a separate channel of information. Again, it depends whether you're saving things like in a ROS or in protobufs or something like that, but they're storing kind of a separate data channel which is keeping timestamps and other information that's needed to draw a parallel between an exact frame and your video feed with the point cloud. So that's one, I guess where I've seen it can be very easy, it can be very hard, but you have to make these trade offs between how much space you're using recording wise, and then matching it up later.

Elliot Branson (06:39): Awesome. And Alessya for WhyLabs you're working with a lot of different types of data to make it observable. How have you built the product to deal with that from the beginning?

Alessya Visnjic (06:48): One of the things that we do to accomplish that is focus on reducing dimensionality. So specifically what we capture from different data types is various summary statistics and metadata, that allows us to kind of standardize across going from images to vide to audio files. So for instance, kind of the simplest idea is to capture metadata over time from all the devices that have been streaming or collecting data, and then watching that metadata as it evolves from day to day that very simple thing would help you identify when you have new types of devices introduced, new resolution introduced or new video encoding formats introduced.

Elliot Branson (07:36): So with Metaflow, I know it's something that you guys have been building from the very beginning to scale up to basically any data type like a large scale company could want to handle. Is there anything special that you had to build in from the very beginning to handle that?

Ville Tuulos (07:49): Yeah, that's a good question. Obviously, Netflix has lot of tabular data, lot of structured data as well as a lot of unstructured data media in different formats. And definitely, I mean, the media encodings are interesting question like Adrian pointed out. Now it's an interesting question how much different data types can be unified. I mean, quite factually, we have had different data paths for different media types, we had a different data path for videos, images, and then tabular data. Now, metadata management is definitely something that you can unify. And we had somewhat consistent ways, although, I mean, not perfectly consistent ways of handling metadata across different data modalities. So, but yeah, I don't know to which degree, everything could be unified, but at least is major categories you can define best practices to each one of them.

Elliot Branson (08:47): And so when people are building up kind of storage and data access systems, are you focusing on trying to unify data kind of a raw the way it's recorded level or is there value in kind of lifting it and trying to do the unification at some embedded feature level or some compressed level? I think that Alessya mentioned working with sufficient statistics and trying to do the comparison at that level.

Chun Jiang (09:11): I think from the geospatial analytics perspective, usually people like using time stamp is kind of the core to connect different data and different data formats. But the one thing we heard a lot, or working across the AV space and also [inaudible 00:09:25] space is we use geospatial location as kind of the major key to connect all the information. So at least for our company is the way unification works is we assign our key to each geospatial locations using our like H3 kind of framework, indexing framework. And then we put all the time there, time stamp there. So us in the future can easily kind of aggregate or join all different data space on your geo location. So it's combining time and location that can have better concept about how the correlation between the space works.

Adrian Macneil (10:09): Yeah. I agree with that. One thing I would add is that I would draw a distinction between when you're first ingesting data. You're bringing in and again, maybe this is more specific to the AV industry or robotics, but you're ingesting data. And I would typically advocate for keeping it as close to the raw format as possible. And ideally in some kind of very standardized container where you have you can do this with container format where you're keeping track of timestamps and channels or topics that you're keeping data from and keeping it as close as possible at the raw form. But one thing that's really important there is make sure that in that raw form in your data lake or whatever you have, one thing that's really important is making sure that you're keeping message definitions that are required to actually decode this data.

Adrian Macneil (11:01): So if you're using protobufs or something, it's important that these files are kind of self-contained and they've got the definitions that are required to decode that because those tend to evolve over time. Same thing if you're using ROS or ROS messages, you want to keep that over time and then have that base to build on and you can create libraries or abstractions that make it easy for people to access that raw data in jobs. And then bring that into a kind of more structured because the raw data also tends to, the schema tends to evolve a lot over time. So you want then want to be able to have post-processing jobs, which are taking that and transforming it into more structured data in your data warehouse or as it moves along to a feature store and things. And that's when it'd be really good to be indexing on things such as geo-location and making it easy to find in a structured format that changes less frequently.

Elliot Branson (11:55): Yeah, I guess building off that point. A lot of people when they start out, they're recording their data to hard drives and walking around between the team handing out hard drives or they get an S3 bucket that everyone shares and dumps data onto. I can see a lot of people have done this before. How should people think about how to store their data for discoverability to understand what's inside of it, to make it easy for like multiple different people to use it once?

Adrian Macneil (12:21): Yeah, again this is something that we sort of offer as a product, but advice I would give to anyone building a system like this is again, make sure you're recording on device that you're recording along with any... Every kind of file that you're recording should ideally be self-contained with everything that's needed to decode it. So if you're recording in protobufs or ROS messages, assuming you've got those definitions, have a structured ingestion process. So it's fine if you're recording to a hard drive or something on device, some people record locally and then use a wifi offload, some people record locally and then literally sort of pop the hard drive and ship it somewhere or carry it somewhere. But make sure you build a structured ingestion process, which can take that data again, possibly transform it.

Adrian Macneil (13:11): For example, we recommend actually on ingestions splitting these files up per topic or per channel. Typically, you kind of see, especially in ROS based systems, but quite often in robotics, you see people just kind of recording all the topics into one file, which is great on device. It's simple, easy to reason about, but when you're storing it long term, you want to make it more efficient to access. So you want to be breaking those up based on different channels or topics, maybe breaking them up into smaller files based on timestamps. You can even go as far as keeping kind of a secondary index on what data is contained in which file and where and adding additional metadata like locations or certain kind of test conditions and things like that. And then again, as I mentioned, build a library on top of that so that people can access that data. There's different ways of achieving that, whether that library is something that hits an API to request, give me certain time range, certain topics, certain events that I'm interested in or even a more simpler thing is just having all of that in a sequel database and having people query that directly, but have kind of a structured way for people to access that data and then be able to transform that into more interesting formats that are useful for training or labeling.

Alessya Visnjic (14:26): I would add to that a good practice that we see occasionally in organizations that have a lot of datas invested into some kind of data cataloging solution or approach. The implementation and the solution depends on what type of data you're working with and what type of problems that you are solving. But in order to improve discoverability, it really helps if there is some kind of very structured and easily queryable location where you can learn what kind of data you have. It would be even better if you have some sense of the quality of that data, because oftentimes what's frustrating is you start processing and writing some kind of feature processors on the data. And then three weeks later you discover that there's just a schema that doesn't match and lots of missing records. And so you basically wasted three weeks. So what becomes important is keeping some kind of meaningful cataloging of all the data that you have.

Elliot Branson (15:31): Awesome.

Alessya Visnjic (15:32): I see everybody smiling, sounds like it resonated with everybody.

Adrian Macneil (15:35): Yeah, yeah, absolutely. Even I would add to that actually. To the extent that you have data pipelines which depend on some schema and that schema is something that's happening on your device especially again in the AV or robotics industry. But probably parallels apply to other industries. Catch those regressions as early as possible. I'd even go to so far as to say that if there's particular schemas or topics or things that are getting published that are going to potentially break a data pipeline, put those further upstream in your CI. Even in the build process for creating your primary software that's generating these messages, just put kind of a check there and say, this schema has to be like this, otherwise fail the check and just fail with a... You can just very simply say, if any of these files change, fail your CI and with a message saying, please come talk to us, it can be very, very simple, but move that as far upstream as possible so that you don't find out either getting paged in the middle of the night because something broke or find out three weeks later that it's been silently failing for sure.

Chun Jiang (16:46): Specifically on device logging, I think one thing that really important is always locking the device version because it's every week the device keep updated and also as well as all the ML models and also software in general. So it's really hard to read a message across different versions. It takes a lot of effort from one person to think about, oh, maybe in the next like 10 version, this message can really go across the entire time. And these are really hard to do I guess for especially for a larger team, always locking, what is a device ID, device version and at the same kind of who's the one recording it. And then is this human behaviors or is this car behaviors. So just adding all the metadata there so anything that happen and or when you want to go to debugging you can find exactly the thing that you want to focus on.

Alessya Visnjic (17:39): I see this come up quite frequently in medical devices problems again, not recording the device oftentimes surprises during the deployment because then you lose a little bit of the control of what's the resolution for example of x-rays that you're capturing and you deploy out the model and if you're not keeping track of the devices debugging, that kind of drift is really frustrating and what you brought up the interesting in point of who is capturing the data also becomes very important. There's one organization that we worked with that was developing devices that would capture ultrasound images, handheld ultrasound device would capture images and the quality of the model really, really correlated with who was capturing. If the person was properly trained the model would do really well. And if there weren't, the model would do very poorly. So if you're keeping track of that, that would make debugging a lot easier.

Ville Tuulos (18:42): And maybe one more boring thing that I would like to add is that if you think that you have any kind of issues with data governance or data privacy also making those distinctions upfront, because when you have a big data of 10 petabytes of data and you realize that actually we have to distinguish with some kind of a personal information or not. And you have to scan through everything again and re-categorize everything that can be quite painful. So if you have any inkling that what might be more sensitive and what not, it definitely makes sense to make the distinction upfront.

Elliot Branson (19:15): Yeah. Makes sense. And how much of a role should machine learning itself play in people's ability to search or pivot or break apart the data that they're working with?

Alessya Visnjic (19:28): That's a interesting question. I can take a stab at that. I think the answer from my perspective would be, it really depends because, if you can, I would start, especially for discoverability, I would start with something very stable and statistically simple. So you could explain if you're extracting some kind of metadata or statistics it should be really easy to explain and really easy to reproduce. But of course doing some sophisticated kind of dimensionality reduction is absolutely a best practice, like things like [inaudible 00:20:11] and new map and then doing some kind of visual similarities potentially for cataloging might be interesting, but I would say always start with something very simple.

Chun Jiang (20:21): I remember one thing we were working at, we had at [inaudible 00:20:30] was the similar image search, which was the biggest search of similar image. It's just a really easy way for when you have some more dataset and you want to find more similar data from open datasets. At the same time, it's very efficient. But I guess back to the question of how much, it's also very dependable. If you're answering a wrong question, I don't feel like having a larger scale of data going to help you with that. But if you are solving a very specific like Alessya said then start small and then using all the ML power to search and to how help you speed up your process would be really smooth and just amazing.

Adrian Macneil (21:15): Another thing I would add, I guess, is that, don't jump into ML discoverability until you've done some of the basics around just metadata that you could be cataloging that you already have. So an example from the self-driving industry would be, you have all of this data and you can be very easily annotating it with things like properties like is it raining? Is it daytime or nighttime? Are you in an intersection? What kind of maneuvers are you currently making? Are there pedestrians around, are there cyclists around, are there pedestrians within five meters of the car or something like that? And you already have all of that information from the models that are running inferencing on the device on the vehicle. So you can very easily be logging those and adding metadata to make your kind of core data very discoverable and you don't need to jump into kind of ML discovery ability around your raw data sets at least because that would be a huge amount of data to be processing.

Elliot Branson (22:19): I guess to that point, how much should your, the way you're storing the data already know some of those kind of derive metadata, for example, how many pedestrians are on the car. It's very easy for a [inaudible 00:22:31] since they're adding more and more and more derived features versus having the search and the [inaudible 00:22:36] for it to be able to compute that on the fly.

Adrian Macneil (22:38): Yeah. It's not super kind of resource intensive if you calculate that kind of metadata on ingestion, because on ingestion, you're already essentially doing one loop typically through your entire data set. If you're transcoding it into a different format or even repacking it into different container formats and splitting files and things like that. So you're already doing a passover your data. You certainly want to avoid the number of, I mean, just for the purpose of minimizing your compute costs, you want to avoid the number of passes that you're doing over data on ingestion. But if you carry on and create a come up with metadata that you agree is important, log it in a structured way into metadata store and have references is pretty easy to keep kind of annotations of metadata over particular segments. Just any period of time, you can have either bullions or key value pairs where your number of pedestrians or an intersection, or an enum for weather or something like that.

Elliot Branson (23:47): For designing Metaflow, one question I wanted to kind of dive into is how much did you take into account kind of more meta learning models where you're trying to do week learning or kind of model in the loop training and storage and things like that?

Ville Tuulos (24:01): Yeah. Well, I mean, I don't think that we are super opinionated about it, so I think that they are definitely valid use cases. And like if that's what makes sense for your use case, I mean, go for it. I don't think we did anything really special. So we were concerned about really the kind of the data layer kind of starting from the very fundamental questions, like data discovery, like how do you access data quickly? And of course overall, the versioning, the experiment tracking and so forth. And then these questions of how do you actually want to use data then to be quite project specific? So it really depends.

Elliot Branson (24:38): And what about the idea of bringing compute to the data versus bringing data to the compute? How important is that in when people think about designing their machine learning systems?

Ville Tuulos (24:49): Yeah. Well, I can quickly start with that. So that's actually a really interesting question because I know that it used to be maybe 10 years back when everybody was using MapReduce. The kind of definitely the idea was that you bring the compute to data and that was the kind of the prevailing paradigm at the time. Now, if you think about the cloud, I think what we are seeing is that we take a bit of a hybrid approach that definitely you do want to take the compute pretty close to the data basically meaning something like the same region let's say in the cloud, in the AWS. But I mean then you definitely don't have to couple data and compute on the same box. And I think that that being able to decouple data and compute is actually really powerful. So you want to be close enough, but it doesn't have to be physically exactly on the same device on the same server. And I think that's a pretty powerful paradigm overall.

Elliot Branson (25:38): So kind of changing topics a little bit. I want to kind of get into some of the model debugging and kind of how people should think about using their own data when debugging ML models. So first kind of thing to dive into is when someone's sitting down and not getting the performance they want out of their model, where's the place they should spend their time? Is it looking at the model layer level itself? Is it looking kind of summary statistics like F1 and confusion matrices or is it diving down and looking at specific instances of the data itself?

Alessya Visnjic (26:08): I can jump in here maybe. So I would say the first question I would ask when you're trying to figure out what to start looking at is what changed? What do you suspect has changed in order to potentially cause the performance to change? So if you're running inference and production, there are various things that you can kind of suspect first and rule out. So if your model hasn't changed, if your hyper parameters haven't changed then potentially if your feature pipeline hasn't changed, then I would start looking at the data because likely you're experiencing some kind of concept or distribution drift. If you're experimenting and iterating on the pipeline or the model architecture, then the type of debugging things that you'd want to do would be very different because potentially again, if you're iterating continuously on tuning and hyper parameters, I would start there with really careful cataloging or if you're changing your data pipeline itself, feature pipeline like debugging that becomes very important. So really depends on what's staying constant and what's changing.

Adrian Macneil (27:29): Something that might sound obvious. But when especially again, in an AV context or robotics, we have multiple models that are all interacting and you're shipping those to production. Often, getting enough metrics to be statistically significant immediately, again, sort of before you even land a model in production or merger PR or something, getting metrics to prove that it has actually made an effect can be really difficult. You might need to be driving thousands of miles or something. So yeah, I would say that it's really important to make sure you actually have a good solid process for how you're validating and landing changes to models, individual models and the interaction between those different things before you even are getting things into production, ideally if you can get simulation and things like that that are confidently predicting performance of on our driving before you merge it. Because otherwise, you'll find yourself in a position where you're kind of complaining that it's a month later and some metrics have degraded and meanwhile, half a dozen models have changed. And it's even a bit premature to start looking around it like, is there a problem with my data set or something. It might not even necessarily be one model's fault or the other models fault if you have the interaction of several models causing behavior.

Chun Jiang (28:52): Debugging flows are always very fun to design because everyone that's debugging workflow is just so different. And then the way, I was always thinking about is from model centric debugging versus a data centric. So at ATG, we have this data centric debugging, which is we visualize all locking of human events and car behaviors. At that time you also see kind of how the camera image capture they look different than the real raw data lock there and you can drive guesses around, oh, what might be wrong there. So that's one way of bringing a lot of visualization forward to come back with a good solution of debugging with simulation imaging or debugging with geospatial analytics based on different data formats. That's like for data debugging.

Chun Jiang (29:45): And then at the same time for model debugging, one of the open source tool I was working at is called Manifold. So it's more like a model like performance level debugging, but at the same time, a feature level. So if you figure out, oh, my model A is performing better than B. I want to look at the B more and then kind of segment the feature and also data set into smaller and smaller and smaller batches and to find, oh, which feature might contribute to this set of the [inaudible 00:30:15] performance there. So these are kind of two aspects when I think about debugging tool but I don't know, I think in ideal world, everyone should be very easily to jump between different approaches and hope we can have a unified way or better playground for all the ML engineers to easily debugging with their different approaches.

Ville Tuulos (30:37): Yeah. And maybe to kind of a plus one, what Alessya was saying before about looking at the deltas and changes over time. And with that really understanding all the sources of changes that can leak into your models and pipelines. So if you don't have really well defined processes and infrastructure, it's easy to have really surprising elements. I mean, starting with the fact that let's say if your libraries change and suddenly somebody updated your version of the optimizer and you are getting different results. And technically the models are the same. The data is the same, but I mean, someone just changed the optimizer and maybe you didn't notice that and suddenly the results change. So having a very controlled set up where you can control kind of everything in the environment. And then you know that, okay now the only thing that could have possibly changed is the data. The only thing that could have possibly changed is the model. I mean, that makes things much more sane than trying to kind of fly blind.

Adrian Macneil (31:30): Yeah. Lineage is critical there I guess. You really need to be able to say, this sensor data that we recorded just driving on the road or whatever it is led to this extraction that happened, went into this labeling, went into this feature extraction job, went into this training job. And then this is where that model came from otherwise, hey, we've got this model running in production and no one can explain, it's better or worse. Doesn't matter if no one can kind of say what data that model came from.

Elliot Branson (32:03): I like this idea that every single ML engineer's workflow is a little bit different. And so one question I'd love to ask is what piece of the ML workflow is kind of missing or really underserved right now in terms of working with your data or understanding how machine learning models work.

Chun Jiang (32:20): I think that there are two gaps there from talking with the different stakeholders. First, it's the way ML engineers coming up with metrics is very different than the business in general. So for one example to consider there is if you want to recommend someone to I don't know purchase a product then ML engineers like metrics might be very, very specific and technical, but the business side will be like, oh, I just want to this person to purchase this product in two seconds. So there's a gap between coming up with good ML metrics and business metrics. And I don't know I feel like there should be an individual voice there, or third party to combining saying like, oh, if I hit this amount metric, I have a 80% chance of hitting that business metric.

Chun Jiang (33:10): I'm still trying to figure this out. But I think this is just a very interesting thing to think about. I think another gap is working in different companies. I always find data platforms or have also kind of gap with ML platform. So ML platform or the ML products that we are trying to view is we always think about ML workflow. And then once you start debugging the data or see what can be wrong across the data world, you just kind of want to automate the data you need about the data to debug your tools, debug your models. So what's the best way to using data is the core to design a flow across data platform teams and also a map platform teams and have all the tools that talk really well with each other.

Chun Jiang (34:00): And at the same time with operations, what is the best way to operation team to say like, oh, I think there might be something wrong or your model is not performing well in the segment of the street in the city. How this message should be recorded and then interpreted by the ML engineers this is also another gap there. I think ideally there should be a tool or a more opinion product there to solve these questions. And I definitely believe ML engineers or MLOps or data scientists will be the really strong piece in filling these gaps.

Alessya Visnjic (34:36): Yeah, absolutely agree with everything that Chun said. And I would anecdotally say talking to a really large number of machine learning engineers, one of the hardest questions that they need to answer every day and it's always a surprise to me that this remains to be one of the hardest questions is, what was the distribution of my data yesterday? Or what was the distribution of my data last week? Is it different from the distribution of my training data? Fundamentally, for machine learning systems, it's such an important question to answer but I think there's a real big tooling gap because A, we're dealing with really large volumes of data and oftentimes the data that one through inference doesn't get persistent in anyway. And if you are trying to answer the question of what happened yesterday, what was the distribution of my features, if you're kind of doing this data centric debugging approach, oftentimes what you'll have to do is replay the entire thing, reproduce the entire thing for that day in order to answer that question or another kind of complexity that comes in place is if the data that your system is processing is highly confidential and your operator, machine learning engineer who's just in the ops team, doesn't have access to that highly confidential data for them answering that question becomes then kind of a wild goose chase of getting access and permissions.

Alessya Visnjic (36:03): So I think the real big gap is figuring out how to maybe get inspired by some of the sophisticated DevOps systems that keep track of a lot of artifacts and metadata that capture the health of the system and the state of the system at any given point. So you can easily go back and get at least a vague understanding of what was happening yesterday or last hour or last week.

Ville Tuulos (36:30): Yeah, no, I mean, definitely. And then that's what we have been trying to promote with Metaflow as well. And I think that's super powerful, but overall, to answer the question that, where are the gaps? I think that there are gaps everywhere. So I think always the kind of the way I've been thinking about the question is that, is there something different in your process that you would do differently if you let's say had only 10 megabytes of data? And if you feel that the process would be very different or you could be doing so much more, if you had only 10 megabytes of problem data, then it clearly signals that it's kind of a scale problem. And it's a tooling problem. And quite often, for instance, the iterations are just way too slow and the fact that you can't compare different models, you can't reproduce things easily because things are just way too slow or cumbersome. I mean, that's definitely a signal that there are gaps in the tooling and in the infrastructure and that's hopefully something that we can technically fix over time.

Elliot Branson (37:23): Yeah, Alessya, I know that you and your team have spent a lot of time working on kind of proactive alerts around machine learning and production. What are your thoughts on the differences between production ML debugging versus training ML debugging?

Alessya Visnjic (37:42): I would say there are a lot of similarities and ideally it's kind of the same pattern as your training environment and your production environment for feature extraction or for model serving. The more similar your tools are across these two paradigms, the better because the more familiar you would be and kind of you would be forced to use the same techniques for debugging training and production pipelines. I think the big difference is in production your data itself is constantly changing just by the nature of that. So recognizing how that affects your systems is really hard. And oftentimes that's what majority of the debugging is focused on. So identifying I think as Adrian was mentioning, making sure that if you are recording stuff on devices, if you can't potentially persist all of the raw data that you have seen recording as much information as possible about this data in your production inference pipelines becomes very important and very very difficult to do because you are inevitably dealing with there's this shift where you're probably starting with a lot of data for training and you have to scale out your system to accommodate a lot of training data, because the more data you have, the more accurate is your model.

Alessya Visnjic (39:15): However, after some amount of time that your system has been in production, I think that shifts and you are all of a sudden sitting on potentially months or years of the raw data that you have been processing and at that point, if you haven't been carefully storing it and taking good care of how it's cataloged, then you're sitting on some cold storage with terabytes of data and you don't even know what to do with it and how to use it for any good purpose. So I think thinking about kind of those two paradigms becomes very important.

Elliot Branson (39:49): And given how important kind of or the thing that's unique about machine learning is the data distribution drift and the problem domain that your model's running on, it's kind of slowly evolving over time as different people use it. How should your debugging kind of ML monitoring systems like take that into account?

Alessya Visnjic (40:09): Well, I think one of the interesting aspects about machine learning debugging that doesn't come into play in traditional software debugging is it often becomes important to involve an SME, subject matter expert that maybe understands how the problem that you're solving is solved without machine learning. I think in medical devices, this becomes very important because as a machine learning engineer or machine learning scientist, you're building your model, but you maybe have seen an x-ray twice if you liked snowboarding and ended up in the ER unfortunately, but otherwise you don't understand what x-rays are, what are they capturing. So when you're debugging, it's important to make the tools accessible and understandable by non-technical SMEs that can come in and help you understand, what are you missing and making sure that your model is performing according to your expectations. And I think that's probably one of the trickiest aspects of machine learning, debugging, is making some of these tools for surfacing how the model is performing and what exactly is it doing on particular examples to subject matter experts that potentially do not understand how the model work but do need to inject their knowledge into debugging.

Elliot Branson (41:32): Thanks everyone for taking time to chat. I just want to give everyone a chance to kind of give any quick closing thoughts or closing takeaways. Chun, do you want to start off?

Chun Jiang (41:44): I always have a [inaudible 00:41:45] of pushing all our ML platform and tools to kind of consumer facing level. And that's the thing I'm really passionate about. And we talk about different ownership, shareholders for collaborations and different gap between a data and ML platform and a gap between training and production level debugging. So yeah, it's really excited to hear everyone's thoughts here and then think we have so much to do and then pushing other tools to the next level to make it more accessible and more explainable will be the thing we keep working on.

Elliot Branson (42:21): Alessya?

Alessya Visnjic (42:26): I think it's very exciting that there's kind of this new focus in our community on the tools that would make our systems more robust and more reliable. I'm really excited to be hearing these data centric approaches, because I think that that whole area of thinking has been really underserved. So really love the conversation today. And one thing maybe a little bit self-serving, we talked quite a bit about logging and one of the things that my team at WhyLabs has put out to the open source as a logging library that deals with different data modalities in really lightweight manner focusing just on extracting statistics. And what I find exciting is having collaborators from different kind of fields of machine learning tripping and give feedback, especially from the field that deals with on edge or on device learning. So we'll be really excited to have more of the community come in and give feedback on the approaches that we're taking with the library, it's called WhyLogs and it's available on GitHub.

Elliot Branson (43:35): Awesome, Adrian?

Adrian Macneil (43:38): Yeah, definitely really enjoyed the discussion and it was great to hear different perspectives. I think we talked a lot about the kind of recording and ingestion of data. That's something that we're really excited about at Foxglove. I think other things that I would emphasize just the tooling is so critical to get right to make ML engineers productive. I think that I've seen time after time, people just struggling to find data that's relevant to them, struggle to get it into the right place. If you don't have your data set up and indexing cataloged in the right places then people not only spend a huge amount of time kind of writing their own custom jobs to go and find this data.

Adrian Macneil (44:24): But they also can waste a whole lot of money on compute very easily if you're kind of you haven't properly kind of de-normalized the structure of your data. If you've got if people need to go through terabytes or petabytes of data and cross-reference different pieces of information to even find a bunch of samples in some geo location or some time range or some particular set of events that were happening. So yeah, again, I just kind of would emphasize how important it is having kind of a data structure and then making the tools really accessible to people so that ML engineers and ML scientists can focus on building high quality models and debugging high quality models without kind of writing their own software to extract data.

Elliot Branson (45:12): Thanks. And last one is Ville.

Ville Tuulos (45:18): Well. I mean, yeah, no, I mean, definitely plus wondering what Adrian was just saying about the importance of having a multilink. well, of course, I mean, that's what we are developing and we believe that that's very important. I think the couple of main things that we believe in is definitely having these stable environments. I think that this is the very nature of doing any kind of a data centric development data centric programming that you can really control the environment. You can have a very structured good process for running this different experiments and also making sure that those situations are fast enough. So you have the compute layer, you have the compute capacity, you have a data layer that can supply data fast enough, so that you are maximally productive as a data scientist or machine learning engineer when going through these iterations.

Ville Tuulos (46:04): And then the other part is definitely the production. And when you are in production, you have the monitoring best practices. And also what Alessya was saying before about being able to monitor the delta software time. I mean, it seems table stakes. I mean, that goes surprisingly far so but at the same time, I know that today, of course, I mean, we are still missing much of the tooling, so it's easy to say that these things are important actually making it work in practice still takes quite a bit of work. So but yeah, at least it seems that we roughly know what are the right ingredients.

Elliot Branson (46:35): Awesome. Well, I think that's probably about a wrap, but I want to thank everyone so much for joining us and thank all of our speakers for coming in and sharing their perspectives and some of the experiences that they've worked on in the past time.

Adrian Macneil (46:47): Thanks a lot.

Alessya Visnjic (46:48): Thank you.

Chun Jiang (46:48): Thank you.

Ville Tuulos (46:48): Thanks.

+ Read More

Watch More

Dataset Management: Using the Right Tools for the Job

Posted Oct 21, 2022 | Views 1.7K

# TransformX 2022

# Autonomous Vehicles

# Expert Panel

# Robotics

# Computer Vision

DEBAGREEMENT: A Comment-Reply Dataset for (Dis)agreement Detection

Posted Mar 30, 2022 | Views 4.4K

# Tech Talk

Focus on Simplicity: Building a Resilient, Scalable Org with Zoom CTO Brendan Ittelson

Posted Feb 09, 2023 | Views 1.7K