Reinforcement Learning From Human Feedback

Loading...

Welcome back!

We'd like to know you better so we can create more relevant courses. What do you do for work?

Subscribe to receive AI news, events and course updates from DeepLearning.AI!

Course Syllabus

AI Python for Beginners is a sequence of 0 connected courses. You can navigate to the other courses by clicking on the cards below

Explore Courses
Community
My Learnings

You’ve achieved today’s streak!

Complete one lesson every day to keep the streak going.

Su

Mo

Tu

We

Th

Fr

Sa

You earned a Free Pass!

Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.

Before we can tune a large language model, we need to prepare our data. RLHF requires two datasets, a preference dataset and a prompt dataset. In this course, we're going to be tuning the OSS LAMA2 model on a summarization task. Each example and our dataset is a post from Reddit and a corresponding summary. So, let's take a look at the data. Let's start by taking a look at the preference data set. As a quick reminder, this is the data set that's used to tune the reward model, and it's often one of the trickiest parts of RLHF because it's the data set that's been annotated by humans, and different people have different preferences. There's usually a lot of work that goes into creating one of these data sets, but for this course, we're going to use a data set that's already been created and pre-processed. So, let's start by defining the path to this data set. I've gone ahead and created a data set for you called Sample Preference.jsonl. This is a small version of the data set that we're actually going to tune the model on. For best results, we recommend a data set of around 5,000 to 10,000 examples. But since we're just doing a bit of data exploration, we're just going to load in a tiny sample of the data into memory. So first I'm going to import JSON so that way we can load this data and then we'll create an empty list called preference data. The next thing we're going to do is we're going to loop over this JSONL file. And as we loop over over this file we are going to append the data to our preference data list. So, let's execute the cell and then we can take a look at what the data looks like. So, I'm going to define sample1 and that will be the first element in this preference data list. So, if we look at the type, we can see that this is a dictionary and if we look at the type, we can see that this is a dictionary. And if we look at the keys, so as you can see, this dictionary has four keys. There's input text, candidate zero and we can print this. And this right here is our prompt. So the prompt here says I live right next to a huge university. I've been applying for a variety of jobs, et cetera, et cetera. You'll notice that if you get to the end of this prompt, it ends with this bracket, summary, close bracket, colon. And in fact, all of the samples in this data set actually end this way. So, if we look at a different sample in our original preference data list, we can extract out the key input text. And we'll just look at the last few characters. And if we print this, you can see that this prompt also And we'll just look at the last few characters. And if we print this, you can see that this prompt also ends with summary colon. And it's bumped this index one more time. So, we're going to take a look at another sample in this data set. And this one also ends with summary colon. So, all of our examples in this data set, all of the prompts end this way. And the reason this is important is because you need your dataset examples to match your expected production traffic. So, during training, this dataset here contains the specific formatting or specific keyword or instruction of summary. And it's important that at inference time, our dataset should be formatted in the same way and contain the same instructions. So later, when we look at inference data, when we actually use this Toon model, we'll see that we include the same summary indicator there. And this is so that the model can recognize the pattern. All right, so this is our first key here. It's input text and input text is our prompt. So let's take a look at the next two keys in our dictionary, which are candidate zero and candidate one. So, I'm going to go ahead and print out both of these. So, we'll print candidate zero and candidate one. And these are two possible completions for the same prompt. So, the task was to summarize this input text and candidate zero summary is when applying through a massive job portal is it just one HR person seeing all of them and candidate one is when applying to many jobs through a single university jobs portal is it just one HR person reading all my applications so the human labeler was shown both of these candidates and they were asked to pick which one they prefer. And we can see the preference in the final key of this dictionary, which is the choice. So, let's go ahead and print out the final key, which is choice. And if we do that, you'll see that the value for this choice key is one. So that means that the labeler preferred candidate one. They thought that this summary right here was a better summary than candidate zero. So in this case, we would refer to candidate one as being the winning candidate and we would call candidate zero the losing candidate since candidate one was preferred by the human labeler. So, this is what the labeler of this particular example thought was the better summary, but you might have a different preference. So, take a minute and read through this entire input text here and see if you agree. You can also take a look at the other samples in this preference data set and look at the corresponding summaries and see if you agree with the labelers. And remember that it's okay if you have a different opinion, picking the right labelers and making sure you provide the right criteria for your specific problem is difficult and it depends a lot on your use case. But this is essentially what the preference data set looks like. We're going to train our reward model on these triplets of our input text, which again is the prompt, and then the winning candidate and the losing candidate. And when we do that, we'll get a scalar value indicating how good the completion is. But we'll look at that a little bit more deeply in the next lesson. For now, let's take a look at the second data set that we need. This is the prompt data set. So, once the reward model has been trained, we're going to use it in the reinforcement learning loop to tune the base large language model. This process requires a prompt data set, which consists of sample prompts. So, let's take a look at this prompt data set. Like before I have created a smaller version of this data set, which we will load into memory and take a look at in this notebook. So first we will define a path to this small data set. I've called this sample prompt dot JSONL. So, we can create this. And then again, we'll make an empty list, and we will then loop over this JSONL file. And each time we loop, we will append the information to this prompt data list. And when we do that, we can actually take a look at how big this list is, and you'll see that it's very tiny. So, we're just loading in six examples of our much larger prompt data set that we'll use in the next lesson when we actually tune the base large language model. Now, a quick note on your prompts in this data set, it is important that the prompts in the preference data set and this prompt data set come from the same distribution. In this case, all the prompts are a data set of Reddit posts, so they do come from the same distribution. So now, we can take a look at some examples in this data set. To help us visualize this data, I'm going to define this function called printD, so printing the dictionary. And what we'll do is we'll take the key and value and we will just print out the text key and then along with the actual key and then the text value and its corresponding value. So, this will just help us to visualize the information in this prompt data list a little better. So, let's define this function and then we can use print D to print out the first element in our prompt data list. So, we will extract the first element and we'll execute the cell. And you can see here that we have the key input text. And then, the value is I noticed this the very first day I took a picture to see if it was one of my friends, et cetera, et cetera. And you might notice that this ends again in the same summary colon indicator. So, this looks fairly similar to the preference dataset, but we just have one single key, which is the input text field, AKA the prompt. So, if we take a look at another example in this dataset, we can use the same printD function. And this time we'll just extract the second element in this list. And if we print this again, you can see that there's only one key and that key is called inputText. And the corresponding value is a prompt. So no, I loved my health class. My teacher was amazing. Most days we just went outside, et cetera, et cetera. And it also ends with the summary colon. So, that is our prompt data set. It's just a data set of prompts. So, I encourage you to take a look at the other samples in this prompt data set. And also again, in the preference data set, but essentially, these are the two main data sets that we're going to need in our RLHF tuning workflow. So in the next lesson, we are going to use both of these data sets to actually tune our base large language model. So, I will see you there.

course detail

How Was Your Experience

Thank you for taking the time to provide feedback on your course experience! Please take a moment to rate the course and share any comments you may have.

Would you recommend this short course to people in your network? (0=Not likely, 10=Extremely likely)
012345678910
Feedback about the Course:
Feedback about the Platform:

Loading...

Learn Code

Next Lesson

Reinforcement Learning From Human Feedback

Introduction

How does RLHF work

Datasets for RL training

Tune an LLM with RLHF

Evaluate the tuned model

0%