Events

October 22, 2021

Netflix Explains Recommendations and Personalization

A TransformX Highlight

At TransformX, we brought together a community of leaders, visionaries, practitioners, and researchers across industries to explore the shift from research to reality within AI and Machine Learning.

Introducing Justin Basilico

For Justin Basilico, Director of Machine Learning and Recommender Systems at Netflix, ‘everything is a recommendation’.

In this TransformX session, Justin Basilico, Director of Machine Learning and Recommender Systems at Netflix describes how ‘everything at Netflix is a recommendation’. He explores recent trends in recommendations and how they are applied for each of Netflix's 200 million users.

- Justin Basilico, Director of Machine Learning and Recommender Systems at Netflix

What Are Justin’s Key Takeaways?

In this session, Justin described how ‘everything at Netflix is a recommendation’. Recommendations at Netflix need to be personalized, but this is a challenging and non-trivial task. Justin describes four key approaches that Netflix uses to solve personalization: Deep Learning, Causality, Bandits & Reinforcement Learning, and Objectives. If these terms aren't familiar to you, consider the following brief explanations;

Deep Learning - A field of machine learning which imitates the way humans learn by creating artificial neural networks in computer memory.
Causality - The analysis of a cause and effect relationship, i.e., Did you watch a movie because Netflix recommended it to you because you wanted to watch it anyway or both?
Bandits - A type of machine learning problem where you have several possible choices with uncertain outcomes. As a result, you must strategize about which combination of choices you choose to maximize your overall result.
Reinforcement Learning - A type of machine learning problem where, rather than making a single decision, you have to make multiple sequential decisions as part of a strategy.
Objectives - An objective is what you are optimizing your recommendation system for. When making movie recommendations, should you optimize for;
- Finding enjoyable movies quickly?
- The enjoyment of a single movie?
- The enjoyment of multiple movies over a longer period of time?

Personalization is Key to the Netflix User Experience

Netflix aims to maximize its user’s satisfaction by presenting entertaining and relevant video content. They apply personalization to many aspects of the user experience, including content search and ranking. It’s no surprise that at Netflix, ‘everything is a recommendation’, according to Justin.

Personalization is Non-Trivial

Personalization is extremely challenging at scale, and especially at the scale that Netflix operates. Every person is unique, and sometimes multiple people use the same customer profile. Offering relevant content recommendations is a problem that is context-dependent, mood-dependent, and non-stationary, in that the ‘right’ recommendation for a person may depend on who they are watching with or the time of day, i.e, there is no ‘right’ answer that always remains ‘right’.

There are also objective metrics beyond accuracy that must be considered. What objective should Netflix optimize for? Is it diversity, novelty, freshness, fairness,i.e. something else entirely? It’s in this context that Justin summarizes Netflix’s approach to these problems.

Is Deep Learning the Best Option?

Deep Learning has shown great success in computer vision, natural language processing, and reinforcement learning, but that trend seems to differ for recommendation systems.

Deep Learning became popular around 2012 but only in recommender systems around 2017. In 2019 it was found that traditional methods do just as well as deep learning when it comes to recommendations. This could be the reason for the delay in adoption. Those traditional methods, such as collaborative filtering, recommend items that similar users have chosen. This is the ‘people who liked this movie also watch..’ recommendation. A typical solution for collaborative filtering uses matrix factorization to combine sparse user-level and content-level matrices into two dense matrices. In contrast, a neural network uses embedding representations of categorical variables.

Deep Learning Can Be Combined With Traditional Methods

But does this similar performance mean Deep Learning is not useful? No, at Netflix, neural networks and matrix factorization features can be used to augment each other. Namely Embarrassingly Shallow Auto-Encoders (EASE), a super-efficient Neural Network model to train in a collaborative filtering setting.

Netflix utilizes rich combinations of data to make relevant recommendations. For example, a user’s data might include, what the user watched, when they watched it, and what device they watched it on. This user data in combination with categorical for movies for example, is fed into a deep learning model. Netflix saw a huge improvement relative to their baseline sequence prediction using traditional methods like matrix factorization..

If You Had to Watch a Movie, Would You Like It?

Recommenders can make biased recommendations which might cause you to think it is performing better than it actually is. The models described thus far are correlational recommendation algorithms. This raises the questions like, ‘Did you watch a movie because Netflix recommended it to you? Or because you liked it? Or both?’.

The causal question Netflix asks is, ‘If you had to watch a movie, would you like it?’. If causality is not accounted for, it is very easy to fall into a feedback loop, which can cause biases to be reinforced by the recommendation system (simulations show that this can reduce the usefulness of the system).

Netflix doesn’t just have one recommender; they have many. As such, it’s possible to have many unintended feedback loops. To handle that, they use a technique called, Propensity Correction. Here, the model not only predicts what someone might watch but also what the system might have shown the user in the past. This probability is used to help the model prevent unintentional feedback loops.

How Do You Optimize for ‘Long-Term Joy’?

Contextual bandits can break feedback loops through the exploration of various choices. This also helps to account for uncertainty around members’ interests and new items with little viewing history. Netflix thinks of the member as the context and each movie as an arm or ‘choice’. The context and the choice can be passed to different models to predict the probability of enjoyment.

There are challenges with using bandits in production. The data collected from bandits aren’t necessarily identically and independently distributed. Bandits collect data adaptively, which means initial noise may mean choosing an arm (or movie) less often than it should. To address this, Netflix takes inspiration from Doubly Adaptive Thompson Sampling (DATS).

The Secret to ‘Long-Term Joy’? Delayed Rewards

To optimize for user enjoyment over the long term, the recomendations must be optimized for 'rewards' that pay off over time . This helps the model build a strategy that focusses on long-term enjoyment as opposed to immediate gratification.

To handle delayed rewards, Reinforcement Learning is used. It could be used within a pages and sessions, or across sessions. Hoever, there are a lot of challenges associated with applying Reinforcement Learning, especially as the problem space of users, movies, and engagement is highly dynamic. To help with this, Netflix developed Accordion, a simulator for evaluating recommendations that accounts for the highly dynamic nature of the problem space..

Want to learn more?

See more insights from AI researchers, practitioners, and leaders at Scale Exchange

About Justin Basilico

Justin Basilico is a research/engineering director at Netflix. He leads an applied research team that creates machine learning-based personalization algorithms that powers Netflix’s recommendations. Prior to Netflix, he worked in the Cognitive Systems group at Sandia National Laboratories. He has an MS in Computer Science from Brown University and a BA in Computer Science from Pomona College.

Full Transcript

Nika Carlson (00:15):

Next up, we're excited to welcome Justin Basilico. Justin Basilico is the Director of Machine Learning and Recommender Systems at Netflix, where he leads an applied research team that creates the algorithms used to personalize the Netflix homepage through machine learning, recommender systems and large scale software engineering. Prior to Netflix, he worked in the Cognitive Systems Group at Sandia National Laboratories. He has an M.S. in computer science from Brown University and a BA in computer science from Pomona College. Justin, over to you.

Justin Basilico (00:54):

Hi, my name's Justin. I lead a machine learning team here at Netflix, working on the algorithms decide what recommendations to show on people's homepage. Now, having worked in Netflix for over 10 years now, there's a common set of questions I get about the work we're doing. And today, I'm going to go through some of those questions and really highlight some of the recent areas of work we've been doing here at Netflix. So the first question I get from people is high levels like, "Well why does Netflix spend so much time focusing on personalization?" And an answer to that is we really want to help members find entertainment that they really want to watch and they're going to enjoy watching so we can maximize their satisfaction and the value that they get out of Netflix. And then that in terms of hopefully maximize their retention so they'll customers with us and we can keep making a better product and more content to then fuel this nice cycle.

Justin Basilico (01:51):

But I like to think of it as beyond that. I really think about what we're trying to do is trying to really spark joy with people, help them find that next great TV show or movie that they absolutely love and becomes their new favorite and just make sure that they're spending their time in the way that they get the most out of what they're watching. So in order to achieve this goal, what is it that we try to personalize? The most obvious way you can see their personalization is in how we rank items on our service. So there's rows like Top Picks, where we rank things in a personalized way. But also, things like the genres and how we rank them based on people's preferences. But we go beyond that and actually construct the page and choose the rows and the sets of recommendations in a personalized way so that people can easily browse and find something great to watch no matter what it is they're in the mood for.

Justin Basilico (02:48):

We also take the search problem and go beyond just doing a textual search to turn it into a recommendation problem of helping them find, given the query, what are the relevant set of TVs and movies that might be interesting for them? We've gone to also try out new modes of interaction. For example, we recently launched this Play Something feature, which is when people are having trouble finding something to watch, they can go into this experience and the algorithm will try to suggest something for them, and then they can interact with it and provide feedback in order to find better items as well. We also personalize how we reach out to people even when they're not on the service by highlighting new shows we think that they might like, or other favorites that might be in our catalog that they haven't seen yet. So we like to say at Netflix, really every thing is a recommendation and everything is about helping those members find those new pieces of content that's really going to spark joy for them.

Justin Basilico (03:48):

So a common question I then get is isn't this solved yet? That Netflix has been working on this for a long time, that Netflix price started in 2006, which was a long time ago. [inaudible 00:03:59] diminishing returns and there must not be that much to work on anymore. And the answer to this is no, not at all. Personalization is a really hard problem because fundamentally, it deals with people and understanding and trying to predict what people want is extremely hard. Just think about yourself. Sometimes, it's very hard for you to predict even what you want and you know the most about yourself. And what we're trying to do is predict what people want all around the world when we have a whole variety of people and each person is unique with a variety of interests. And then sometimes, people do things like deciding, "Oh, multiple people want to watch together and we need to be able to recommend and personalize for those cases as well."

Justin Basilico (04:38):

And sometimes, people also decide they're going to share the same profile and not keep separate profiles so they can keep their recommendations separate. And we still want the recommendation system to work across all these different types of cases. Also, recommendations is hard because well they're really useful in helping people find something to watch when they're not sure what they want to watch. If someone really knows what they're watching for, a specific piece of content, hopefully we can get up on the homepage for them and it's right there. But if it's not, it's very easy for them to go into search and find it there. But what the recommendations really helpful for is when people aren't quite sure what it is they want to watch, they're looking around and we need to figure out what it is that they're really looking for, get it up in front of them and also help them understand why something that they might have never heard of before is really a great thing to watch in their current circumstance.

Justin Basilico (05:28):

And we have to do this handling the fact that people's tastes change over time, so they're non-stationary, they depend on the context they're in. Like I mentioned, maybe they're watching with other people, also can be very dependent on people's mood. So we need to be adaptive to that. From the machine learning side, we have large data sets because we have a lot of people we're dealing with. But then also, a small amount of data per user because with long form content like movies and TV shows, you don't get a huge number of interactions to learn for. And we also have to do with the fact that the interactions we do get are potentially biased by the output of the system, because when we show people recommendations, it's going to influence what they're going to watch. We're also surrounded by cold start problems on all sides. So we have a brand new person come on the service and they have very little data about them of what they've already well, where we have new titles come into the service that we need to figure out who's interested them.

Justin Basilico (06:21):

We also have to think about new UI paradigms and devices coming on and how we can make sure the personalization works across all of those. Also, recommendation is more than just accuracy and getting the right recommendation in front of people, but also about things like making sure there's a good diversity of TV shows and movies that we're showing each person that'll help people find novel things, not just things that they've already heard of, that there's freshness so that the recommendations don't get stale. And also, that they're presented in a way that that's fair. And I could keep going on and on, but I hope you feel like you understand that there's a huge variety of problems that we're trying to work on in this space. So that leads to this next question of okay, hopefully you're convinced that it's hard. What are we trying to do about it? So what I'm going to do now is go through and highlight some of the recent avenues that we're looking at improving our personalization and these also correspond to some recent areas of research within the recommendation system community.

Justin Basilico (07:17):

And this is just some of them, there's obviously a lot of other work being done at Netflix, but I think these are the most interesting to people from the machine learning community these days. So the first trend is that of using deep learning for recommendations. So deep learning became popular back, say, around 2012 when there was a lot of breakthroughs and computer vision, and then NLP showing how those models can perform really well. And it took a while for the deep learning to start taking off in recommender systems, it took about five years. It wasn't until around 2017 that that happened. So you might ask a question of what took so long. And before we go into that, there is another twist which is that a few years later, there was several papers looking at some of this initial deep learning research and finding out that traditional methods might actually do as well as deep learning methods, or sometimes even better, for recommendation systems. So leaves us with this head scratcher of what's going on with recommendations? Why is it falling in a different pattern than a lot of other machine learning problems?

Justin Basilico (08:17):

To understand that, let's take a step back and look at what the traditional way of framing a recommendation problem would be. So this is a classic formulation where you have a matrix where you take users you have on one side and items on another, and then you'd fill in this matrix based on when a user has had an interaction with an item that's either, say, typically if it's an implicit feedback, it might say that there's a positive, a one, when there was a good interaction. And then you have this very sparse matrix and you might think of most of the elements being missing or as being very soft negatives, because if someone didn't interact with something, they might indicate that they could have, but didn't. And the typical way of solving this type of recommendation problem falls in this collaborative filtering framing, where the idea is to recommend items that's similar users have chosen or liked in the past.

Justin Basilico (09:13):

So a typical solution for this, which is something that became very popular via the Netflix prize, to take a matrix factorization of you. So you take this big sparse matrix and then you break it down into two dense matrices that correspond to the user and item dimensions. And then you can learn this factorization where you basically try to reconstruct it from these two low rank matrices, minimizing a squared error plus some regularization term. So now, to step from this towards neural networks, you can then try to look at this problem from a different perspective. So another way of framing the same equation is thinking about having a network where you have two categorical variables, one with a very high dimensional, say, user ID and another one with a high dimensional item ID. And then you learn embeddings for them. And then rather than putting it through a network, you just take the two embeddings, you take a dot product and then you use that to train with, again, the squared error and add in a regularization term.

Justin Basilico (10:19):

So this is a very simple network approach. You can then take it a step further and say, "Okay, well I have these two embeddings. Instead of just doing the dot product, what if I plug in all my favorite types of deep learning models and then use that to then do the prediction?" And when you do this and you tune and try out lots of different approaches [inaudible 00:10:42] and you tune in the matrix factorization as well, what you find is that you can get them to about the same performance level. Sometimes, the matrix factorization can do a little better if you overfit too much in the feedforward. But this is really reflective. This is what we have seen when we were initially trying deep learning in our data sets. But also, what was shown in these papers that came out in 2019 and really was reflected of people not necessarily choosing and tuning their baselines well, or changing what metrics that are being looked at. So this sounds like well is there really anything that can be done with deep learning and recommendations?

Justin Basilico (11:26):

So to understand that, you have to think about well really what we just showed is that there's a really clear correspondence between typical recommendation techniques like matrix factorization and a bunch of different variants and neural networks. And really what that means is that we can take things we learned about those approaches either way and apply them. So we can take ideas that work with matrix factorization and transform to neural networks. We can also take what we've learned from neural networks and try them in matrix factorization. And one way that we have come up with doing this, that was done by one of my colleagues, Harold, is to really just take the idea of auto-encoders and the recommendation approach slim, and really just boil it down to a really simple problem of trying to take this initial rating matrix and then learn an item-by-item matrix X, where you just make sure that the diagonal is zero to avoid a trivial solution of learning the identity and just use that to construct the original matrix. So it creates a very, very simple, shallow auto-encoder.

Justin Basilico (12:28):

And this can work really well and can be done really fast because there's a closed form solution. Does that mean that that deep learning isn't useful in non-recommender systems? No. It doesn't mean that. What it means is that this problem where you just are looking at this user item matrix formulation is really just not representative of what we currently are trying to do today. Modern recommendation data is much more nuanced. We have much more rich information about the context and types of events that we're collecting when people are interacting with the recommendation system. And this includes both the impressions in terms of what is it we're actually showing the users, but also then the variety of interactions which can be with the items or also things like searches that we can use to understand what it is that people are looking for. We can then go beyond that and look at actually the item data itself, which is typically ignored in a matrix factorization approach because you're learning just embeddings from the interaction data only.

Justin Basilico (13:29):

And here, what we can do is look at the actual video, the audio, the text subtitles, or scripts along with metadata such as tags or popularities that we can feed into the recommender. On the user side, we can look at profile settings and while we don't collect or use any demographic data, we can look at things like is this a kid's profile? Or what maturity level is set? As well as what languages are set on the profile to understand what would be a good recommendation in our model? Zooming into one element of that, you can look at the user data when you have a contextual sequence and formulate the problem like this. So we have a sequence of data for each member. We might have the user, what country they're in, what type of device, and also what time and day it was when they visited. And then what was the item, the TV show or movie that they actually ended up watching?

Justin Basilico (14:27):

We can create this sequence where we have both the sequential and time information, and then given a new context when someone shows up, use this to predict what a good recommendation for them would be. And from that, you easily come up with a neural network architecture for it by taking these input interactions, putting in embedding tables for the various categorical variables like what video it is, and then use your favorite approaches from deep learning, be it traditional feed forward networks, recurrent networks, convolutional transformers to be able to come up with a good representation for this contextual sequence, that then you feed through something like a softmax layer over items to then predict what it is that we think someone will likely want to watch and enjoy in that session. And when we do this, we're able to see that these deep learning approaches are able to really increase the accuracy of the system, especially on offline metrics, by providing this additional context and time information, which is much harder to put into traditional recommendation approaches like factorization machines.

Justin Basilico (15:40):

So this leads to now the second trend which is out of causality. So what I've been talking about so far is really about using a correlational model for doing recommendations, and really most recommendation algorithms are correlational. And some of the early ones were only correlations where you just literally computed correlations between users and items to create them. But when you're looking at real world data, you have to ask the question did someone watch a movie because we recommended it to them? Did they watch it because they actually wanted to watch it and liked it? Or was it because of both? And really the cause of question what I think about asking is if you had to watch a movie, would you end up actually liking it? And this is moving from this correlational problem to this problem of what happens in terms of what is the outcome when we actually do the recommendation for the user? And what we find is that if you don't take causality into account, it's very easy for the problems to come up like feedback loops in the system.

Justin Basilico (16:38):

So what we find is that if you have a problem where you might have a system that might have a little bit of bias towards impressing certain items, that might then inflate the number of plays, it might show it and then inflate the number of plays, which then can lead to inflate popularity and create a feedback loop. And there was a nice paper showing that in simulations, you can see how this impacts the quality of the recommender system. And we found in reality that we do observe these types of oscillations, where here I have an example of something where there's oscillations in the genre of recommendations being served, and this didn't impact the overall quality of the model, but you can see this increasing set of oscillations that had us worried. And we had to go in and do something to fix it. And for us, because we don't have just one recommender, but a lot of different recommender systems, we have to deal with not just one feedback loop, but lots of feedback loops where each recommender can be feedbacking on itself. And then also be having feedback loops via each other.

Justin Basilico (17:33):

So what do we do? Why doesn't everything just break down? So in this closed loop where you show recommendations and people can only watch what's recommended, and that influences the training data, we can see that that's a real danger zone where you can potentially have these types of bad effects. However, a lot of times when we're training recommenders, we're in more of this open loop scenario where we have some search data that makes us think, "Hey, people are still able to watch things outside the control of our recommender. What can we do about using that data to actually then help stabilize and make things work?" One way of doing this is by adding a propensity correction term to the model. So taking the type of model I showed earlier, we can add in this additional head where what we're trying to do is not just predict what it is someone's going to watch, but also try to predict what it is that the system would've shown that user in the past.

Justin Basilico (18:25):

And understand that probability so that we can use it to correct what the model is learning to make it do so in a more causal way. And in particular, when we train this additional head, we want to make sure that when we're doing gradient dissent, we're not passing the gradient back from the propensity softmax, we're just using the same representation in order to able to do this adjustment. Of course, there's a lot of challenges in using causal models for recommendation. The first one [inaudible 00:18:58] is really how you handle unobserved co-founders and how you come up with the right causal graph. We can understand the causal graph from the perspective of understanding what's happening within our system, but because we're dealing with people, there's always this challenge of trying to figure out the right causal graph to use for them as well. We also find that causal models can have high variance, especially these propensity-based ones, which can cause challenges with using them in practice.

Justin Basilico (19:24):

There's also challenges in trying to scale them up and also how you evaluate these methods off-policy. And as I alluded to earlier, when you have these open loop problems, there's also this question of when it makes sense and how we want to introduce exploration. And that's what leads us to the third trend I want to talk about, which is that of using bandits and reinforcement learning for recommendation. So why use contextual bandits for recommendations? Well first answer is that it helps with that previous problem I was discussing of helping to break feedback loops. But they're also nice because they introduce exploration into the system where we can then use the exploration to learn about new members or new interests or new items over time because we can take it into account the uncertainty that we have in people's interests. They're also good about dealing with sparse and indirect feedback that we collect in the system and can also be modified to be able to handle and track changing trends over time.

Justin Basilico (20:23):

As an example of how you can frame recommendation as a bandit problem, we can think about if we wanted to decide what TV show or movie we want to highlight at the very top of the page when someone logs in that we might be able to play a trailer for and help them understand really well what this piece of content and recommendation might be so they can decide if that's a great thing for them to watch. And from that, we can decide from the whole set of videos we have in our library what is the one we want to show there. Mapping this to a contextual bandit problem then is pretty easy. So we can think of here the environment the bandit is in is this Netflix homepage. The context is the member plus all of information we have in terms of what people have watched in past and past impressions and so on. We can then think of the arm of this bandit as being the video we're going to decide to display at the top of the page. So there's one arm for each video.

Justin Basilico (21:20):

And then the policy we're trying to learn is deciding from this context what video we want to choose to recommend there. And then the reward is looking at then if once we do this recommendation, does someone decide to engage with this and play it and enjoy this video once we've recommended it? Going a little deeper inside the bandit itself, we can think of it as having this setup where we pass in the member, that's the context. We then have an arm for each video. And from that combination of member, context and features... Sorry, member, context and video, we can create features and then pass those features to either separate models for each video or one big model, like one of those neural networks we had earlier, to the prediction. And then when we predict, we can think of not just doing a point wise prediction, but actually trying to predict the distribution of the probability of enjoyment.

Justin Basilico (22:11):

And then use that distribution and either sample it with Thompson sampling or to use something like UCB to balance how we explore plus pick the one we think is most likely to be the best to then decide which one we actually want to show the member. Of course, there's challenges with using bandits in production. And one of them is that the data that you collect from them can violate the IID assumption that's pretty fundamental to a lot of machine learning models. And the reason for this is because bandits collect data adaptively and decide what they want to show, view some initial noise in the system that can mean choosing a specific arm less often, which can then over time keep its overall sample mean low. And we can try to adjust this like we did before by introducing propensity weighting. But this typically then creates a very high variance, especially in a case where we want to understand a whole distribution, it can be problematic.

Justin Basilico (23:07):

So to address this, what we can do is take inspiration from approaches such as Doubly Robust estimators. And this is something that one of my colleagues, Maria, did with some collaborators and came up with an approach called Doubly Adaptive Thompson Sampling, which replaces the distribution for the posterior with an Adaptive Doubly Robust estimator. And what we find with this is that this doubly adaptive approach is able to have lower regret in practice while still matching the theoretical bound that you might get from Thompson sampling itself. Of course, lots of challenge with using bandits in the real world. It's hard to design a good exploration, especially when you're not just trying to minimize your [inaudible 00:23:57] for a single bandit, but also thinking about how you want to support future algorithm or even user interface innovation in the future.

Justin Basilico (24:04):

It's also challenging to do member level A/B tests where we're trying to compare one full on-policy bandit at scale with all the feedback loops that can influence versus another, because you don't want the exploration that one is trying to do to then allow the other one to learn from it. Bandits are also challenging in this domain because they can be over large action spaces. So this simple bandit that's just choosing one item already has a huge number of arms it can select from, but then if you want to think about selecting a whole ranking or whole slate, which would be potentially sets of rankings, those spaces grow combinatorially. We also have layers of bandits that can influence each other. And then we need to also handle delayed rewards, because it's not that we can just always observe the reward of if someone really enjoyed something in a very quick amount of time.

Justin Basilico (24:53):

So to handle this problem of this long-term reward, it sounds a bit like a reinforcement learning problem, because here what we have is an objective where we want to optimize people's long-term member joy, that involves many visits and interactions with the recommendation system across time. And so we want to be able to try to use approaches from reinforcement learning to take this on. So there's a few ways you can think about reinforcement learning for recommendations. The first one is to actually use it to take on the problem of creating a ranking or a slate that we might have with a bandit. So we can treat choosing each item in a list or each item in a slate as a sequential problem and apply reinforcement learning there. We can also look at it within a session. So using the reinforcement learning to try to optimize the likelihood of a good interaction across multiple canvases or pages in a session, or we could optimize it across multiple sessions to go for a really long timeframe.

Justin Basilico (25:59):

And ideally, we'd have reinforcement learning algorithms that scale really well across all of these different timeframes. Of course, there's a lot of challenges in doing reinforcement learning for recommendations. It's a very high-dimensional problem. As I mentioned, the item space is very large, but we also have the challenge that the state representation which are member histories can also be a really large space. So handling these high-dimensional spaces is a challenge. We're also doing everything off-policy, which means that we don't want to just take a new reinforcement learning algorithm and deploy it and have to learn everything from scratch. But we need to be able to learn from all the data we already have to be able to get it to a place where hopefully it's even better than what's already been observed in production using the system actions. It's a concurrent problem where we don't observe full trajectories, but we're learning simultaneously from many interactions.

Justin Basilico (26:47):

The action space is evolving as new actions come in and some leave over time. And then there's non-stationary in terms of the rewards you might get from different actions. And then finally, we have the challenge of reward function design, which is how do we express the objective in a good way to get these models to learn? One thing we try to do in this space of reinforcement learning is also think about how do we do this simulation to understand when these models are working in terms of their long-term effects. The typical recommendation simulation is pretty easy, because it's just about taking some items, scoring them, ranking them, and then look at what people interact with. We've done work to extend that to look beyond just a single ranking to actually create whole pages and understand if we change up any piece of the recommendation system, what that does to the page at the page level.

Justin Basilico (27:35):

And recently, we have a paper at [inaudible 00:27:38] looking at extend this even further to look at what happens across time, by including a user model and the recommendation. And in addition, figure out how do we put in a visit model because now if we're looking at what happens across different sessions, what we show in one session is going to impact whether or not someone visits in the future. So we need to have a good model to handle that so we can see what the long-term effects would be. Of course, there's a lot of active work in the area of doing reinforcement learning for recommendations in the community. There's some great work on looking at how to learn embeddings for actions, to handle the high dimensional space, using adversarial models for simulation, using approaches like policy gradient to train recommenders, combining reinforcement learning and other approaches by treating it as a multitask problem, using it to handle diversity, training slates and using multiple recommenders. And I think can still continue to be a lot of really interesting work in this avenue in the future.

Justin Basilico (28:42):

This leads to my final trend which is that of objectives. And the question here is well what is our recommender really trying to optimize? And I mentioned this quickly before, but what we really want to do is try to optimize for this long-term member joy, but we have to do that while accounting for a bunch of other factors like avoiding trustbusters, coldstarting, fairness and making sure that things are findable so people don't feel like we're hiding something from them. And if there wasn't this recommender there that they would be able to find something else. And we need to be able to do this objective by working through a whole set of different layers of metrics. So starting with the training objective all the way through the offline metric, the online metric we measure an AP test, and then a goal which would be that optimized for long-term member joy. And for anyone who works in the industry, you find there's this problem where there's often this misaligned metric.

Justin Basilico (29:41):

So you might start off with a really simple thing of saying well I have RMSE, but then I evaluate it as a ranking metric. And then I look at just member engagement, but I really want to optimize for this joy. And each of these things can actually be not very well aligned and that can lead to problems because really the recommenders that we build can only be as good as the metrics we measure them on. So if we're measuring the wrong things and we're using that to decide which ones to roll out or how to optimize them, that's going to be a limitation on how well they can perform. So that's why we spend a lot of time thinking about how do we improve these metrics and objectives. And one piece of work we had done in this space is by leveraging bandit style replay metrics, but then addressing the problems that we have of having a low number of matches by taking inspiration from what's done typically with ranking systems to reward policies that are able to rank arms that have high reward, even when they're not the top one.

Justin Basilico (30:41):

And they're able to push down ones that are low reward. And what we find is when you use metrics like this, we're able to improve the correlation of this type of metric with the A/B test results. Of course, there's a lot of challenges with metrics in terms of how do you deal with the difference between what you want and what you encapsulate in a metric, these really deep questions like where does enjoyment come from? How does that vary by person? How do we measure that at scale? How do we look at effects beyond a typical A/B test horizon, which might be a month or two? But are we going to see those gains continue into the future? We also want to be able to incorporate fairness into the objectives. And there's a couple pieces of work looking at this both from a calibration dimension of user tastes and also F fairness of user item cold-start.

Justin Basilico (31:25):

And then even just going beyond the fairness, the algorithms, just making sure that no matter what we're doing here in the objective space, we're going to be sure that we're going to have a positive impact on society. So to conclude, I went through a few recent trends of work. We've been working in the personalization space, deep learning, causality, bandits and reinforcement learning and objectives. And if these types of problems sound interesting to you, we're hiring. You can go visit our research site and look for more information, look at jobs. We're also looking for summer 2022 interns. And with that, I thank you so much for coming and watching this talk. Please reach out to me if you have any questions.