Welcome to the short course on pandas for data processing. Pandas is a powerful tool for machine learning and data science. And in this short course, we'll go over the basics of pandas so that you have a sense of when it might be useful for your machine learning or data science project. Let's jump in. In this short course, you learn how to use pandas to explore your data set, to filter data, to find a subset that's more interesting to you, to visualize the data, to compute summaries, to carry out data cleaning and data preparation, an important step for many machine learning and data science applications. And you also see how indexing works in pandas. Let's start by looking at how you can use the pandas library to explore a data set. In particular, in the next few minutes, you learn how to load data from a file, view different rows and columns of your data sets, as well as have pandas show you a few summaries of your data. The running example I'm going to use is a data set including titles, locations, and so on of a group of machine learning engineers and data scientists, as well as their salaries. In pandas, you store this type of data in a data structure called a data frame. And here's a small sample of five examples. You notice that this data set has a few columns. For example, salaries are stored in one column, and it has, in this case, five rows. For example, this person's data is captured in this row. This data was provided by Work Helix in partnership with Revellio Labs. I'm also grateful to Daniel Rock for helping to get us access to this data. So let's look at how to load and explore the data in code. In order to use pandas, the first thing you have to do is import the pandas library, and the standard command for that is import pandas as pd. And then to load the data set, this is the command to load a CSV file. There's a CSV stands for comma separated values. To load the CSV file into a data frame, which I'm calling df. So let's just run that. In order to see what's in the data frame df, we can use the hit command. So df hit, let's say six, will show you the first six rows of this data frame. And in fact, if you don't specify any parameter, head defaults to showing you five rows, the first five rows of the data frame. Here, hit means you're looking at the examples at the very top of the table. There's also the tail method or the tail function, which lets you look at the examples at the bottom of this data set. And similarly, you can say you want to look at a certain number of examples from the bottom of this table. If you want to know the size of your data set, you can use df.shape. Notice that is df.shape, and I don't add parentheses here because this is an attribute. This is just a piece of data stored in df. This is not a function or method. And if you want to get a quick summary of what's in your data set, then use df.describe. And this will tell you for the different columns of the data set, at least the numeric values. It tells you how many values there are, what's the mean value, standard deviation, and so on. So in this data set, it looks like the mean salary is this, standard deviation is this, minimum 25th percentile, and so on. df.info also shows useful information about the data frame df, and it tells you for each of the different columns, for example, how many non-null values there are. In this case, all of the values are non-null. We'll see later what to do about missing values, and we can also tell what are the types of these different columns. We can also look at df.columns. Again, this is an attribute, not a function. So no parentheses after this to look at the names of the different columns. Now, when you're examining data, one thing you often want to do is look at a subset of the columns. So this is a syntax, df, square brackets, and then I'm going to pass it a Pythonic list of the names of the columns I want to pull out. So I want to look at just the titles and salaries. Then I do this, and it prints out this new data frame with just the title column and the salary column. Notice that here on the left, there are these numbers 0, 1, 2, 3, 4, and so on. This is an index into the dataset, and we'll talk more about indices later in this course. But for now, I hope you take away that this syntax lets you pick one or more columns for you to look at. So if you just want to look at the titles, you can type this. You get out a new data frame with the titles. Just a heads up, if you look at others' code, if you're pulling out just one column, you don't need that inner square bracket, but this returns a different data structure that looks like this. So if you wanted to return a data frame, including those inner square brackets so that this title is a Python list of just one element, this gets you a nice data frame like this. In the command that you just saw from the data frame df, the outer square brackets is the pandas bracket notation, and the inner square brackets creates a list with two elements, with a string that says title, a string that says salary, and this calls this Python to pull out the two columns with the data for title and salary and returns a data frame. Finally, one nifty command I often use would be to look at a sample of the data. So df.sample5, this will pick five examples at random and show you what they are. So you don't always want to look at the first five or the last five examples. This will show you a randomly chosen sample every time if you want to get a sense of your data set. If you want to fix the random number seed, you can also add this. So this will give you the exact same five examples every time, no matter how many times you run it. So for machine learning applications where the data may not already be sorted in random order, I find this to give you a better random sample of the data than if you always look at the head or the tail of your data set. You notice the indices here as well tell you which are the five examples in this case, that the sample method or function has chosen to show you. Just to summarize, in this lesson, we learned how to read a data set from a CSV file, how to use head, tail, shape, describe, info, columns, how to extract one or more columns, and how to use the sample method. Just to note again, shape and columns are attributes. They're part of the data structure for the DataFrameDF, which is why there's no parentheses after shape and columns. So you now know how to load a data set and browse and explore it. In the next lesson, let's take a look at how you can filter a data set.