P is for Pandas
Pandas is a very popular data analysis library for python. It’s an invaluable tool for transforming and munging data. Some of my favourite features include:
- It can read data from a ton of formats including csv, json, and database tables
- There are lots of convenience features built in such as easy plotting, filling in missing data, dropping duplicates, filtering data etc
- It’s compatible with a lot of other python data science libraries (e.g. numpy, scikit-learn, etc)
It makes it really easy to read and explore your data since you can read, filter, and plot your data in just a few lines of code. This blog is not going to be a comprehensive tutorial (there are actual tutorials linked in the Other Resources section below). It’s intended to give you a brief idea of the kinds of things that pandas can do (to save you from rewriting code to do that yourself). The documentation is quite good and will give you an idea of the types of analysis you should use pandas for (hint: most of them).
The main data types in pandas are series and dataframes. You can think of a dataframe as a spreadsheet or a table in a database. Dataframes are made up of series (just think of a series as a column in a spreadsheet/table).
Using pandas
Installing and importing pandas
If you’re already familiar with installing python packages use conda install pandas
or pip install pandas
. For more in depth instructions there is a guide here. Once you have it installed you can import it using
import pandas as pd
People conventionally rename pandas to pd (e.g. pd.DataFrame instead of pandas.DataFrame) since you will end up typing pandas/pd a lot.
Here I’m going to use the movie data used in Brandon Rhode’s “Pandas From the Ground Up” tutorial. This includes CSV (comma separated value) files with movie titles and release dates as well as casting. After we’ve downloaded the data we want to load it into a dataframe. We can do that using
titles = pd.read_csv('titles.csv')
If we want to look at the first 10 rows of the dataframe we can use the head
method
print(titles.head(10))
which will output the first N rows (in this case 10). This is an easy way to see what the data looks like and if the data is well formatted.
title | year | |
---|---|---|
0 | Lusty Neighbors | 1970 |
1 | The Adventures of Priscilla, Queen of the Desert | 1994 |
2 | Captain Sindbad | 1963 |
3 | Cold Dark Mirror | 2015 |
4 | Ostatnia Rawa Ryska Riedla | 1997 |
5 | Forever and Ever | 2018 |
6 | Western Conviction | 2018 |
7 | 5 Estrellas | 2018 |
8 | Ropewalk | 2000 |
9 | Sally in Our Alley | 1931 |
Selecting and filtering data
We can easily select rows in our dataframe. For example if we wanted to find movies released in 1991
titles[titles['year'] == 1991]
title | year | |
---|---|---|
71 | Le voleur d’enfants | 1991 |
132 | Madreseye piremardha | 1991 |
196 | Pyat pokhishchennykh monakhov | 1991 |
217 | Canh bac | 1991 |
509 | Itakwil man ako ng langit | 1991 |
Let’s break down the above statement a bit. We use titles[‘year’]
to select the year column in the data frame. titles[‘year’] == 1991
returns a series of booleans (True if the year is 1991 and False otherwise). Finally, titles[titles['year'] == 1991]
says “give me the rows from titles where the condition is True”. We don’t just have to look for rows with exact matches. Let’s search for movies containing “The Hobbit” in the title
titles[titles['title'].str.contains('The Hobbit')]
title | year | |
---|---|---|
121727 | The Hobbit: The Battle of the Five Armies | 2014 |
146926 | The Hobbit: An Unexpected Journey | 2012 |
166752 | The Hobbit: The Desolation of Smaug | 2013 |
179646 | The Hobbit: The Swedolation of Smaug | 2014 |
Counting values
Let’s say we want to count how many movies were released per year. We can do this using
titles['year'].value_counts()
which gives a sorted list from highest count to lowest. In our data set we see that 2017 released the most movies followed by 2016. I suspect that this dataset was compiled in 2017 and that 2018 had more movies released in it than the previous year.
year | |
---|---|
2017 | 9888 |
2016 | 8198 |
2015 | 7564 |
2014 | 7159 |
2013 | 6896 |
The value_counts
function also has a normalize parameter which divides the counts by the total number of rows.
titles['year'].value_counts(normalize=True)
year | |
---|---|
2017 | 0.0437497 |
2016 | 0.0362722 |
2015 | 0.0334671 |
2014 | 0.0316752 |
2013 | 0.0305115 |
Here we can see that about 4% of all movies ever released were released in 2017.
Plotting data
If we want to plot the number of movies released per year over time how would we do that? In pandas we can just do
counts_per_year = titles['year'].value_counts()
# This just sorts it so that the years will be in order
counts_per_year = counts_per_year.sort_index()
counts_per_year.plot()
It looks like the number of movies being released has been increasing over time. There are also movies in the dataset that are scheduled for release (including one in 2115!) which explains the sharp dropoff. The .plot()
methods that wrap matplotlib so you can customize your plots as much as you want.
Grouping and merging data together
A common (but slightly more advanced) use case for pandas is grouping data together. Imagine we want to see which actors/actresses mostly played leading roles in their career (who appeared in at least 50 movies). To do so we will first load in the cast dataset
cast = pd.read_csv('cast.csv')
print(cast.head())
title | year | name | type | character | n | |
---|---|---|---|---|---|---|
0 | Closet Monster | 2015 | Buffy #1 | actor | Buffy 4 | 31 |
1 | Suuri illusioni | 1985 | Homo $ | actor | Guests | 22 |
2 | Battle of the Sexes | 2017 | $hutter | actor | Bobby Riggs Fan | 10 |
3 | Secret in Their Eyes | 2015 | $hutter | actor | 2002 Dodger Fan | nan |
4 | Steve Jobs | 2015 | $hutter | actor | 1988 Opera House Patron | nan |
Here we have the title of the movie, the year it was released, the name of the actor/actress, if they were an actor/actress, the character name, and the listing in the credits (n). A listing number of 1 means they played a leading role and higher numbers usually correspond to extras. Some of the values of n are “nan”, which stands for “not a number”. We want to drop those rows to only get listed roles.
listed_roles = cast.dropna(subset=['n'])
Now we want to find how many movies easy actor appeared in
num_roles = cast.groupby('name').size()
name | 0 |
---|---|
James Millican | 67 |
Ei Kimura | 1 |
Brian Torpe | 1 |
Steve Edis | 1 |
Carlos Esteban Fonseca | 3 |
Note that this is roughly equivalent to the value_counts
method. Now we want to select the actors who appeared in at least 50 movies
in_lots_of_roles = num_roles[num_roles>50].reset_index(name='total_roles')
name | total_roles | |
---|---|---|
0 | A. Bromley Davenport | 52 |
1 | A.K. Hangal | 87 |
2 | Abdur Razzak | 63 |
3 | Abhi Bhattacharya | 63 |
4 | Abhishek Bachchan | 52 |
Now we need to get all the roles each of those actors played. We can do this by merging this dataset with the listed_roles
dataframe.
merged = pd.merge(in_lots_of_roles, listed_roles)
name | total_roles | title | year | type | character | n | |
---|---|---|---|---|---|---|---|
0 | A. Bromley Davenport | 52 | A Maid of the Silver Sea | 1922 | actor | Old Tom Hamon | 3 |
1 | A. Bromley Davenport | 52 | A Sister to Assist ‘Er | 1927 | actor | Jim Harris | 4 |
2 | A. Bromley Davenport | 52 | Bonnie Prince Charlie | 1923 | actor | Sir John Cope | 11 |
3 | A. Bromley Davenport | 52 | Boy Woodburn | 1922 | actor | Matt Woodburn | 3 |
4 | A. Bromley Davenport | 52 | Captivation | 1931 | actor | Colonel Jordan | 6 |
Finally, we can see which actors played mostly leading roles in their career. We will group by the actors name, then take the average of their listing number. The closer to 1.0 the number is the more leading roles they played. We will sort the values to see the actors closest to 1.0 at the top.
merged.groupby('name')['n'].mean().sort_values()
name | n |
---|---|
William S. Hart | 1 |
Yilmaz Güney | 1.01724 |
Mary Pickford | 1.03509 |
Tom Mix | 1.06186 |
Mary Miles Minter | 1.09804 |
It appears that William S. Hart had a leading role in every film that he appeared in.
Summary
Pandas is an extremely powerful python library for doing data analysis. There can be a bit of a learning curve but it is worth spending some time exploring the library. Once you can perform some basic tasks (sorting, filtering, groupbys, plotting) then you will become a data analysis wizard!