Now that you've had a chance to get some intuition about embeddings, let's check out some different applications. As we've done before, we'll start by setting up our credentials and authenticating. We'll also need to specify the region we'll be running this service in, and then we can import the Vertex AI Python SDK and initialize it. Now that we've done that setup, we're ready to start loading in our data. For this tutorial, we're going to use the Stack Overflow dataset of question and answers, which is hosted in BigQuery, a serverless data warehouse. So, we'll start by importing the BigQuery Python client. And once we've done that, we can write a function that will take in a SQL query as a string and execute this query in BigQuery. So, let's go ahead and paste in that function. So again, this function is going to take in some SQL, and it will run that query in BigQuery, and it will return the results as a pandas data frame, which we can then use in our notebook for some different applications. Now, you don't really need to know a whole lot about how this BigQuery function works, so don't worry if you don't understand the details, we're just going to use this to get the data set for this lesson. So, the next thing we'll do is we'll create a list of the language tags that we want to query, and that's just because this data set is very large and it won't fit in memory, So, we don't want to actually pull in all the data. We just want to pull in a small subset of Stack Overflow posts for a few different programming languages. So, to get our data, we'll start by creating an empty data frame. And then, we're going to loop over this list of languages. And we will execute a SQL query in BigQuery. So here is the query we'll be running. And if you're not familiar with SQL, again, don't worry too much about it, but this will basically pull the first 500 posts for a particular language tag from the Stack Overflow data set. And then, once we've done that, we want to concatenate all the results into one single data frame, which we will use in this notebook. So, let's run the cell, and this might take a minute or so. You can see that it's pulling the data for each of the four languages that we're interested in. Now, note that if you ran into any errors while trying to execute the BigQuery code, or you just don't want to use BigQuery at all, you can just run this cell right here, which will just pull in the data for you from a CSV file, so you don't have to worry about using BigQuery. But, that's an optional step if you ran into any errors or you just didn't want to execute that code. So now, that we have all of our data, let's go ahead and examine it and see what it actually looks like. Here's our data frame. We can see at the bottom here that it's 2,000 rows by three columns. So that's 2,000 different Stack Overflow posts and it's 500 posts for each of the languages that we queried for. And these three columns here we have first the input text which is the title of the Stack Overflow post concatenated with the question of the Stack Overflow post. And then, the output text column is the accepted answer from the community for that Stack Overflow post. And then finally, we've got a column for the category which is the programming language. So now, that we've got our data, we can now start embedding it and using it for some different applications. So, we'll first load in our text embedding model, which we've done before. This is going to be the text embedding gecko model. And now, in the past labs, we went ahead and just started using the embeddings function to create embeddings. But because we have a lot of data in this notebook, we're going to need to do a little bit of extra work. We're going to need some helper functions to batch the data and send it to the embedding API. So first we'll start off by defining some helper functions. And the first function we're going to use here is called generate batches. And this function basically takes in our data and creates batches of size five. And the reason we need to do that is according to the documentation and at this point, the API we're using can only handle up to five text instances per request. So, we'll need to take our data and split it into batches of five in order to get our text embeddings. So, we can try out this function and we'll take just the first 200 rows of our data frame, and we'll call generateBatches() on those 200 rows, and we can see what the result is of using this function. If we call generateBatches() on this subset of our data frame, you can see that it creates these nice batches that are of size five and that'll be very useful when we want to embed all of the data in our data frame. The next helper function we'll define is called encodeTextToEmbeddings, which is a wrapper around the function getEmbeddings, which you've used in the previous labs to get embeddings for our text input. So, let's try running this function on a batch of our data. We just created a batch of five sentences, so we can run this encode text to embeddings function, and then we can print out the length of the result. So here, if we run encode text to embeddings, we get back five embeddings, and that's because we passed in five text instances, and they're each of size 768. And that number 768 should look familiar by now because that's the number of dimensions that the text embedding gecko returns for any text input you provide. So, in addition to making sure you batch your instances into batches of size five, there's one other thing you need to be aware of, and that's just that most Google Cloud services do have rate limits on how many requests you can send per minute. So, we have written a helper function for you called encode text to embedding batched, and this function will manage both batching the data and manage the rate limits. So, if you want to use this for your own projects, this is the code that you would actually execute, but we do want to be mindful of rate limits in this online classroom. So, we're not going to actually generate embeddings for all 2,000 rows of our data right now. Instead, we're just going to load this data in. But again, for your own projects, you can use this helper function here, encode text to embeddings batched, and you pass in the data you want embedded, and it will handle both batching the data for you and also making sure that the rate limits are handled appropriately. Next, we are going to load in these embeddings that we've generated ahead of time. Just to make sure these embeddings map properly to the Stack Overflow questions, we'll also just reload in a new CSV file of Stack Overflow questions just to make sure that they match and aren't different from what we've loaded in from BigQuery. And then next, using pickle, we can load in this pickle file, which has all of our embeddings for us. So, let's go ahead and take a look at this embeddings vector. We'll print out the shape and also the array as well. So, we can see that our data is now of size 2,000 by 768. So that's our 2,000 stack overflow posts, and each one is represented by a 768-dimensional embeddings vector. Now, that we have this data and we have it embedded, we can get started with some different applications. The first application we'll try out is clustering our data, and we're going to use the K-Means algorithm to cluster these posts. So first, we'll import K-Means from Scikit-Learn, and then we'll also import PCA from Scikit-Learn, which we used in an earlier lesson, and this will help us to visualize our clusters. Now, just to make our visualization a little bit easier, we're actually just going to visualize the first 1,000 rows of our data set. So, we'll take our full data set of question embeddings and we'll only be looking at the first 1,000. These are the posts that were tagged as Python or HTML. For our clustering, we'll first need to define the number of clusters and we'll set that to two. And then, we will create and fit our k-means model on our clustering data set that we just created, which is the first 1,000 rows of our stack overflow data set. And once we've done that, we can also extract the labels and this will just tell us for each item in our data set, which of the two clusters does it belong to. Now, as before, we can't visualize all of these 768 dimensions, so we'll use PCA just to represent our data for 2D visualization. So, this is code we used in a previous lesson, but we'll create our PCA model with the components number set to 2. We'll fit this model, and then we will transform the model on our clustering data set. So, once we've done this, we can now import matplotlib and we can plot this data. So, let's go ahead and plot and visualize what our clusters look like. So here, we have our data, and you can see that it forms two pretty distinct clusters. And these are questions that were tagged as HTML over here on the left with these red circles. And then on the right, we have questions that were tagged as Python, and it's pretty good at dividing these Stack Overflow posts into two distinct categories. And as a reminder, the clustering model didn't have these two labels. All it had were the embeddings of the Stack Overflow posts, but it was able to separate the data into two fairly distinct clusters. And we've just added these labels back in to make it easier for us to visualize. So far, we've talked a lot about how embeddings can help us find similar data points. but that also means that we can use embeddings to identify points that are different or outside of our data distribution. So, we'll now use embeddings to help us with anomaly detection or outlier detection. To do this, we're going to use this isolation forest class in scikit-learn. This will return an anomaly score for each sample in our data set using this isolation forest algorithm. And note that you don't need to know the details of how this algorithm works, just that it's an unsupervised learning algorithm that detects data anomalies. So, all of the questions in our data set are about programming, so we're going to add in a question about a very different topic, baking. Here's some input text. Let's say someone's asking, I'm making cookies, but I don't remember the correct ingredient proportions and I've been unable to find anything on the web. This is definitely pretty different from the questions we have in our data set. Once we've defined this input text, we can embed it and because it's just a single instance, we can call the getEmbeddings function on our embeddings model. Now, we'll take this embedding, and we'll append it to our array of other embeddings for our stack overflow data. And then, we'll also need to convert this data into an array, and this will just help us later for visualization and for running the isolation forest model. So, let's take a look at these new embeddings array. The shape is now 2001 by 768, and that's because we've added this additional question to our data set before we had 2,000. Stack Overflow questions, and now we have 2,001 because we've added this extra question about baking. Now, before we can fit our isolation forest model, we need to do one more thing, and this is just going to make things a little bit easier when we want to visualize the results. We will add in a new row to our data frame that contains this baking question. So, we'll add in the input text. There's no output text for this particular question because it wasn't a real Stack Overflow post. And we'll say that the label is baking. So, let's add this to our data frame. And again, we're just doing this because it will help us to visualize the results in just a little bit. So now, we're ready to create our isolation forest model right here using scikit-learn. And once we've created the model, we can fit and predict on our embeddings array. And this model will return negative 1 for outliers and one for inliers. So, once it's been fit, we can take our Stack Overflow data set and filter for all of the rows that have been predicted with negative 1. So, let's see what the results look like. The last question here is our baking question. And so this was, in fact, identified as an outlier. And that is because it's pretty different from all of the other examples in this data set. You might notice that there are also some programming questions about the programming language R that were identified as outliers. So if you're interested, maybe you can go in and check out the input text and see why these particular posts might have been identified as outliers. Maybe they were mislabeled, and they weren't actually about the programming language R, or maybe it was some other reason. And then finally, before we move on to the next application, we'll just drop this baking question from our dataset because we won't need it for the last application in this lesson. So that's what this cell does here. We'll just remove it. And now our stack overflow dataset has been returned to only being about programming questions, and it only has 2,000 rows. Now, for our final application in this lesson, we'll see how we can also use these embedding vectors as features for supervised learning. Embeddings take as input some text, and they produce some structured output that can be processed by a machine. So, this means that we can pass these vectors to any of our favorite supervised classification algorithms. In this lesson, we will be using random forest, but feel free to swap this out for another classifier in scikit-learn if there's another one you prefer. Now, there are many different ways that we could frame a classification problem around this data set. Maybe you could think about trying to predict if a post mentions pandas, or we could try and re-query the data and get the score for the different posts and try and predict how many upvotes each post had. But in this notebook, what we'll try out is just predicting the category of the post. So to do that, we'll need a couple of other things from scikit-learn. We will also import accuracy score and the train test split utility. And then, for this prediction task, we're going to define an array called X, which will be our embeddings. So, let's go ahead and do that right here. And this is just our embeddings data that we've created previously in this lesson. And for the labels, we will extract the categories for each of these posts. So, we're just pulling the category column from our stack overflow data frame. And now, we have our X and our Y, but we got to do one more thing. We need to shuffle the data and we need to split it into training and testing sets. So, we'll use the scikit-learn train test split utility, and we will set an 80-20 split, so that means that our test data will be 20% of our original data set. And now, once we've done that, we have an X train, an X test, a Y train, and a Y test data set, which means we're ready to fit our random forest classifier. So, we'll start by creating the classifier, and we'll set the number of estimators to 200, but you can feel free to change this if you like. And once we've created this classifier, we can fit the model. And we'll fit the model on our training embeddings and also on their corresponding categories. And once this model is finished, we'll be able to predict on some test data. So now, we'll call predict and we will pass in our X test data set. And again, this is just the embeddings in our test set. And we can finally compute the accuracy score to see how well this model did. So 0.70, not bad for a very minimal pre-processing. So now, you've seen a few different ways that we can apply embeddings, we can cluster them, we can use them for classification or to detect data points outside of our data distribution. And in the next tutorial, we're going to take a quick break from embeddings and talk a little bit about text generation. So, I will see you there.