At TransformX, we brought together a community of leaders, visionaries, practitioners, and researchers across industries to explore the shift from research to reality within Artificial Intelligence (AI) and Machine Learning (ML).
In this session, Clement describes the pervasiveness and capabilities of Transformers. He explores how we might ensure they are used in an ethical and transparent manner.
Clement Delangue is the CEO of HuggingFace, a 100,000+ member community using open source libraries for natural language processing.
Transformers are powering many digital experiences, which we may often take for granted, for example, Google search query predictions and the translation capabilities we see on many social media platforms. Most recently, Github’s Copilot was launched to help automate the writing of software code.
In 2017, the seminal paper on Transformers entitled ‘Attention Is All You Need’ was released. It proposed the use of attention mechanisms instead of recurrent or convolutional ones. In 2018 a team from Google published ‘BERT’ and released it as open-source.
The following year saw the release of the Hugging Face Transformers. Hugging Face is arguably one of the fastest-growing open-source projects of the past three years. This growth was made possible by three major trends;
Hardware has become more powerful and cheaper
Large datasets (often generated from scraping the web)
In Machine Learning (ML), models originally learn weights and biases from training on an initial data set that is normally fairly general for a particular problem domain, i.e., object detection.
In transfer learning, we optimize to train for particular (often niche or highly-specific) use-cases by reusing the previously learned weights and performing minimal (re)training on the new task.
Think of transfer learning as teaching someone to drive a fork-lift truck who already knows how to drive a car. There are some knowledge representations that were needed for the initial task (car driving) that can be reused for the niche task (fork-lift driving).
Transformers learn from ‘sequential transfer learning’ which can normally consist of the following steps.
2. Fine-tuning/Adaptation on a small dataset for your own problem domain. This is often far cheaper than initial pre-training.
Transformers have pushed the boundaries of modern machine learning. This has become apparent in multiple ways, one of which is greater prediction accuracy across a variety of use-cases.
If we consider the task of Question Answering on SQuAD 1.1, where, given a question and a context, a model should find an answer within that context. For example;
Question: What types of engines are steam engines?
Ground Truth Answers: external combustion, external combustion engines
In just six months, the accuracy of Transformers on SQuAD 1.1 went from 70% to 90%. This caused researchers to ask themselves if maybe SQuAD 1.1 was too easy for Transformers. So, they developed a harder dataset, SQuAD 2. However,in another six months, Transformer accuracy for this problem again went from ~70% to ~ 90%.
Transformers were the first to surpass human performance on GLUE, a collection of tasks for training, evaluating, and analyzing Natural Language Processing (NLP) models. Transformers also perform well on non-NLP tasks like computer vision, chemistry, and many others. Transformers also perform well in multi-modal tasks that require the understanding of different forms of information, like sound, images, or text. For example, DALL·E takes a text prompt and uses it to produce graphical images.
Clement explored how these models can become biased simply through the selection of training data and hyperparameters.
When BERT is used for next-word prediction and given the prompt ‘this man works as…’ it might predict the next possible word in the sentence as, ‘lawyer, doctor, mechanic’. However, Clement shows that when given the prompt ‘this woman works as…’, the same model might predict ‘nurse, teacher’. This appears to indicate a possible gender bias which could be a reflection of the nature and distribution of the training data used to train that Transformer.
To help address possible sources of bias, Meg Mitchell joined HuggingFace to create tools for improving AI fairness. HuggingFace also introduced Model Cards as a way to report potential limitations and biases in models. Model Cards are important because the frank disclosure of potential bias helps ML practitioners make pragmatic and informed choices as to how they might operationalize that model in a responsible way. HuggingFace has incentivized contributors to add model cards, resulting in 5,000 model cards and datasets added to the Hugging Face model hub.
Another approach to increase the ethical performance of Transformer models involves democratizing information by developing multilingual transformers. The English language is dominant in our world today. For example, a large proportion of research papers are published in English. To address this, HuggingFace has organized a large community sprint with over 500 scientists across the world to bring Transformer benefits to other languages. This effort especially includes ‘low resource languages’ that have relatively small amounts of suitable training data available, if any at all. This has allowed the creation of speech and translation models in over 100 languages. The utility and power of transformer models is clear, but, as the saying goes, ‘With great power comes great responsibility’.
See more insights from AI researchers, practitioners and leaders at Scale Exchange
Clement Delangue is co-founder and CEO of Hugging Face, a 100,000+ member community advancing and democratizing AI through open source and open science. Clement started his career in product at Moodstocks, a machine learning startup for computer vision (acquired by Google in 2016). He is passionate about building AI products.
Next up, we're excited to welcome Clement DeLangue. Clement is the co-founder and CEO of Hugging Face, a 100,000 plus member community advancing in democratizing AI through open source and open science. Hugging Face has more than a thousand companies using their library in production. Clement started his career in product at Mood Stocks, a machine learning startup for computer vision, which was acquired by Google. He is passionate about building AI products. Clement, over to you.
Hi everyone. Super happy to be here for the Scale TransformX conference. And I'm going to start this keynote with some kind of like letters, importance mainstream features that you might have noticed from some of the most popular products out there. And you'll see after I'll tell you what is the common denominator between all of them.
The first one you might have noticed that first Google got better and better in the last few months. Two, that they're starting to release new features. Like for example, the ability to get an answer to your question right away on the page of the results. You know, instead of showing you a list of links where you can see some answers, they are finding one answer that they think is the right one and showing it to you right away before all the results.
Something else that you might have noticed is that when you're writing an email on Gmail, or when you're writing on Google Docs or Microsoft Word, you're getting these very personalized auto-complete now. It's not just like what you had on your phone, just auto-completing one word and the same thing for everyone. It's taking into account what you wrote before, how you've been writing, and some of your information to suggest some very good and adequate texts for you to write much faster.
Speaking of texts, you might have seen that translation is now present on most social networks and platforms. It's super important for me. I'm French, as you might have heard from my accent, because it's starting to translate automatically on Facebook, on Twitter, on LinkedIn, on most platforms in hundreds of languages. Last, you might have heard of something called Github Copilots. Obviously Github is the leading software engineering platform. Now what it does and what it introduced with copilot is a way for developers to get an auto-complete of their code. A little bit what you have in text editors.
You'll tell me like, "What's the relation between all of these. Why am I talking about all these new features?" It's because it's all powered by transformers or transformer models as we call them. This is one of the reasons why during these keynotes, I want to talk to you about why I believe transformers are becoming the most impactful tech of the decade.
Let's go back to the beginning first, to see how these started. In 2017, few people noticed the release of the paper called "Attention is All you Need" by Bashwani and Associates, coming out of Google Brain proposed a new network architecture called the Transformer, based solely on the attention mechanisms. It will change the field of machine learning forever. Directly inspired from the paper BERTS, or B Directional Encoder Representations from Transformers was released a year later by a team from Google.
The same year was created Hugging Face transformers and released on Github. It's became in just a few months, the most popular library for machine learning, that more than 5,000 companies today are using. Here, for example, you can see a graph of the gross of the number of Github stars for Hugging Face transformers in the reds, compared to Apache Spark that gave birth to the company Databricks. Apache CAFCA that gave birth to the company Conference, and Mongo from Mongo DB.
Hugging Face transformers is being arguably the fastest growing open source project of the past three years. It has even been described by Sebastian Rudder, from Deep Mines as the most impactful tool for current NLP. An NLP, if you don't know it, is the machine learning field that relates to texts compared for example, to vision time series or speech.
So how did that happen? To me, this was made possible mostly by three major trends. First, hardware has gotten exponentially better, especially GPU's or TPUs. Leading companies like Nvidia or Google have made compute not only cheaper, but also more powerful and more accessible. Previously only large companies with special hardware were able to process the high levels of computing power needed to train very large transformer models.
Second, we've seen that the web mostly composed of texts, but also with images, voice, provided a large open and diverse data set. Sites like a Wikipedia, or Reddit provided gigabytes of texts and data sets for models to be properly trained.
And the last missing piece to the puzzle was transfer learning, for finally being able to create these new transformer models that are going to change the field forever. Before going further, I wanted to focus a little bit on the last piece on transfer learning, to understand how different it is from a traditional machine learning techniques. I think that explains some of their success and why they're so promising, and they could be the major technology trends of the decade.
The traditional way we train a machine learning model is to gather data sets to initialize our waste from scratch. And then we train our model on new data. Issue faced with a second task, task two here. You gather a new data set and randomly initialize your waste from scratch. Again, same thing, if you have a short task. Now it's a way transfer running works is quite different.
We use all the knowledge on the past task to learn a new task. As you can see here, you see the source task, the knowledge, the learning system. And it works especially well to move from one task to another, with taking advantage of all the learning and knowledge that has been gathered. It helps learn with a very limited number of new data pieces to get more accurate results, and we can fill in the gaps.
There are many different ways to do transfer learning, but the way we are doing it now in transformers is called sequential transfer learning, as there are two steps. The first step is called a pre-training. You take a base model, train it with a very large Corpus, usually a slice of the web. Spend millions of dollars of compute. Weeks, not even days, weeks of training. And you get a pre-trained language model or pre-trained transformer model.
Once you have your pre-trained model, it becomes much, much easier for the second step, which is a dictation, or fine tuning, on a small dataset, your own data. For example, for your own task, for your own domain, or for your own language. At the end, you get your own fine-tuned model that you can use in your application. For example, for Hugging Face, over 5,000 companies are using these fine-tuned language models.
This new architecture, and the fact that it's super easy to find them on your own domain, on your own task, on your own language, led to the emergence, not only of BERDS, but tons of similarly architectures transformer models, like GPT or BAFTA, ExcelNet, BERDS, and led to thousands of fine-tuned models that have been published on the Hugging Face model hub. Here, for example, you have the most popular ones you can see, BAFTA, Based, BERDS, T-5, GPT, GTBERDS, and many, many more. There are more than 20,000 models that have been shared on the Hugging Face model hub.
So the next question is why these models become so popular? I think part of the answer first is that the constantly elevated the state of GIs. What it means for your use case in the industry is much more accuracy for your predictions. For example, on SQADS that we're seeing here, the Stanford Question Answering Data Sets, question answering is the tasks that we've seen on my first lights from Google will give you to text as a context. Then you ask a question and it's finds the answer to your question in the context text. Here, you can see that the exact match. So the right answer score went from 70 to a most 90 in just six months, thanks to transform the models.
The funny thing is that as the authors realize that the benchmark might be too easy for transformer models, they decided to design a new version, SQUAD 2, that was supposed to be harder. However, these new models just nail these benchmarks, as easily as the first one. Here, you can see, they went from 70 to 90 exact match in just a matter of months again.
Interestingly easy progress. Didn't just apply to one type of task here, question answering, but across the boards. First to most of the NLP tasks. So the text ones. And then, two, starting with computer vision biology, chemistry time series, and going to other domains.
The interesting thing is that these models learn sufficient general knowledge to perform well on the huge range benchmarks. For example, the GLUE, General Language Understanding Evaluation benchmark, which evaluates models on the collection of tasks, was stopped right away by BERDS when it was released. In just a year, models were surpassing human baselines, promisingly moving us from the world where only humans could understand natural language, to a world where algorithms can start to help in classifying, understanding, and generating natural language, too. I'm not saying that we are human level for all the tasks, but the fact that we're getting there for a specific benchmark, for specific tasks like that is very exciting.
And what we've seen is that these models progressively went from being useful in the tasks that we know, like text classification, text generation, information extraction, to new use cases that weren't possible at all before. Some of dues or multimodel models, right? So it's the ability to use transform models, not only for a text, but sometimes texts with image, video with text, speech plus video. For example, you can see here from the daily model, from Open the Eye, which has an open source version on the Hugging Face model hub, that you give it a text, a prompt, an armchair with the shape on of an avocado. And it's going to create a brand new image based on that. This opens the door to so many new use cases in the industry. So many new contexts where we'll be able to apply these transformer models. And as we start to apply them, if we were, something I wanted to talk about is the ethics of them.
We can't talk about the impact of transformer models without talking about ethics. Because the truth is, these models are biased. They're biased because they're data sets, and because of the way we're building them. An example of this is a gender bias. Here, you can see, for example, the BERDS model for next word prediction. And if you can give the BERD model a prompt, "this man works as" and ask it to predict the next word, see that is going to give you lawyer, printer, doctor, waiter, mechanic. Whereas if you prompt it with "this woman works as", you're going to get nurse, waitress, teacher, maid, prostitute. You can see here a clear gender bias. This is not acceptable, and it's time for the fields not only to think about it, but also to take action.
This is one of the reasons why we've been lucky to see Meg Mitchell, one of the most renowned researcher on the topic, to join us. And she will help us. She will help Hugging Face create tools for AI fairness. We've already started working on it. Most interesting thing that I've seen the past few months are the model cards. What is it? They are a standardized way for researchers to communicate and share the potential limitations and biases of their models.
On the Hugging Face model hub, we've tried to incentivize contributors to add model cards. We've quite a lot of success. Now we have 5,000 data sets and model cards that have been added to the Hugging Face hub. Why is it important? It is because for example, now that we know that BERDS has some gender bias issue, communicate that well in the model card, then the companies or the engineers using BERDS can know, for example, not to use it for resume filtering because it's going to be biased. This is very important.
Another initiative that we've been taking is to try to democratize access to information by allowing these new models to work in more languages. As we know, English is very dominant in the world today, and a lot of the research resources are dedicated to English. This is something we need to change. So something that we've organized is a Big Community Sprints with other 500 scientists all over the world, working on bringing these new transformers to other languages, especially what we call "low resource languages", which are languages that have very small datasets and that very few people are working on. It allows for example, to create more speech models, more translation models, and over, I think, a hundred languages, thanks to this Community Sprint. For example, to the translation and give access to more information to more people. And this is just the beginning of, she thinks, a lot of today's society's problems, from climate change, vaccines, toxicity of public platforms, transformers need to help, and they could help.
For example, on social networks, with all the toxicity that you have, with all the violent to bases, the fake news comments, you can't have a human check all of them. First because there would be the price after a few weeks. And second, because the quantity of it is too big. We need machine learning and transformers to help. That's why every single company today needs to be intentional about driving more social impact with transformers. This is why I'm super excited about the impact of transforming models. For them not only to be the most impactful take of the decade, but also perhaps the most useful one.
If you're interested to keep the conversation going, feel free to check Hugging Face. Reach out to me directly on, on Twitter or LinkedIn. Thanks everyone and enjoy the rest of the event. Bye-bye.