Matplotlib: Quick and pretty (enough) to get you started.

Dorjey Sherpa
6 min readOct 7, 2020

The reason Python is so popular among the Data Scientist is because of all the built-in libraries within Python.

In Data Science, effective data visualizations are key to communicate your findings. After having done a series of data cleaning and data analysis, one has to communicate the findings from that data analysis and is usually done through visual aids: graphs and charts.

“Visualizing information can give us a very quick solution to problems. We can get clarity or the answer to a simple problem very quickly.” — David McCandless

One of the many libraries is Matplotlib, a plotting tool to create data visualizations. Other data visualization libraries such as Seaborn and Pandas DataFrames plot method are built upon Matplotlib. In this article, I will be strictly illustrating static plots — which I believe Matplotlib is best for. It’s quick, easy and intuitive.

If you are new to data science or even well versed in Python, I would highly recommend reading the documentation on Matplotlib, especially if you are reading this to learn data visualizations.

First you would import Matplotlib with pyplot (a Matplotlib module).

#imports
import pandas as pd
import matplotlib.pyplot as plt

Pandas will help import your data and make it readable.

I found this article to be very thorough and easy to understand for beginners. I would especially read that article if I cannot follow the codes.

When importing a library as something else, you are creating a shortcut to call on that library. pd and plt are the naming conventions for Pandas and Matplotlib.pyplot. If you want, you can name them as anything… as long as it is understandable to you and the people viewing your code.

You would then call your data in the correct format. For the graphs below, I used the data provided by Flatiron School for a project. It consists of Internet Movie Database, Box Office Mojo database and The Movie Database.

#International Movie Database 
title_akas = pd.read_csv('../zippeddata/imdb.title.akas.csv.gz')
#Box office Mojo Database
movie_db = pd.read_csv('../zippeddata/movie_db.csv')

Once you import your data and do some exploratory analysis, you will want to present your findings as graphs and visual aids to show the relationship in the data.

#getting the top regions with the highest number of movies released.
movies_released_region = title_akas.region.value_counts()[:10]
#plotting the top regions with the highest number of movies released.
movies_released_region.plot(kind= "bar", figsize = (10,6))
Figure 1

As you can see, it is very easy to get a visual aid to show your findings. In this case we used pandas .plot method. However, there are a lot of things missing and we do not really understand the relations between x-axis and y-axis. So we need to include the title, the axis label and rotate the regions label to make it more legible.

movies_released_region.plot(kind= "bar", figsize = (10,6))
plt.title("Movies Released by Top 10 Countries)", fontdict = font)
plt.xlabel("Countries", fontdict = font)
plt.ylabel("Number of Movies Released", fontdict = font)
plt.xticks(rotation=0);
Figure 2

And BAM! This figure is easier to read and understand than the previous one. In order to make it more legible, we called plt (Matplotlib) to label the title, axis and even rotate the region names.

This is one easy way to just see what your data is saying and to understand your relationships in the data.

Let’s look at another example.

Figure 3

Figure 3 was created using Matplotlib’s figure and axes method. It is a built in function. Here is the code:

# defining what studio_df is
studio_df=movie_db.groupby("studio")["roi_percentage", "total_gross","total_profit"].sum().sort_values(by="roi_percentage", ascending = False)[:25]
# setting the "playground"
fig, ax = plt.subplots(figsize = (18,9))
x= studio_df.index
y= studio_df["roi_percentage"]
# "Prettify-ing" and making the graph readable
# assigning xlabel, ylabel and title and rotating xticklabels to make it more readable
ax.bar(x,y, linewidth=1)
ax.set_title("Top 25 Studios with highest ROI from 2010-2018 by Studio", fontdict=font)
ax.set_ylabel("ROI Percentage", fontdict=font )
ax.set_xlabel("Studios", fontdict=font)
plt.xticks(rotation=45);

One might think that this looks very similar to the graph above… so what’s the point of this method? To understand the importance, we need to first understand what figure and axes are.

Figure 4

The best way to understand this, for me, was to think of figure and axes as children’s playground. Figure is the fence or around the playground.

Axes is the playground where you will plot all your graphs. You can then define how many playgrounds you want by calling the subplot method.

matplotlib.pyplot.subplots() has many arguments and I would highly suggest to visit the documentation here. The main ones being the arguments for nrows and ncols. “n” stands for “blank number of” and we are defining the number of rows and number of columns. Below I have stated that I want 1 number of row and 2 number of columns and the figure size.

fig, ax = plt.subplots(1,2, figsize = (10,8))
Figure 5

In figure 3, I used the figure and axes method but did not define how many rows and columns I wanted. It was blank, so matplotlib presented me with its default figure and axes to plot 1 graph. But as defined in the code for figure 5, we have 2 playgrounds where you can plot two different graphs.

But when would you use it?

You would use subplots when it makes sense to have two graphs side by side so it’s easier to compare two different things.

For example, we know that certain movie studios make a high profit but what about the return on investment. Studios that make high profit, are they also the studio to make high return on investment? So to illustrate this we deploy subplots!

# defining what studio_df is
studio_df=movie_db.groupby("studio")["domestic_gross", "foreign_gross", "total_gross","total_profit"].sum().sort_values(by="total_profit", ascending = False)[:5]
# setting the "playground"
fig, ax = plt.subplots(1,2, figsize = (20,9))
# assigning what the x and y values are. assigning x1label which will be used in for xticklabels
x1= studio_df.index
x1label= studio_df.index
y1= studio_df["total_profit"]
# "Prettify-ing" and making the graph readable
# assigning xlabel, ylabel and title and rotating xticklabels to make it more readable
ax[0].bar(x1,y1, linewidth=1)
ax[0].set_title("Total Profit from 2010-2018 by Studio", fontdict=font)
ax[0].set_ylabel("USD in Millions", fontdict=font )
ax[0].set_xlabel("Studios", fontdict=font)
ax[0].set_xticklabels(x1label, rotation=90);

# assigning what the x and y values are. assigning x1label which will be used in for xticklabels
x2= studio_df.index
x2label= studio_df.index
y2= studio_df["roi_percentage"]
# "Prettify-ing" and making the graph readable
# assigning xlabel, ylabel and title and rotating xticklabels to make it more readable
ax[1].bar(x2,y2, linewidth=1)
ax[1].set_title("Top 5 Studios with highest ROI from 2010-2018 by Studio", fontdict=font)
ax[1].set_ylabel("ROI Percentage", fontdict=font )
ax[1].set_xlabel("Studios", fontdict=font)
ax[1].set_xticklabels(x2label, rotation=45);
Figure 6

The ax[0] and ax[1] lets matplotlib know which graph its going to. To understand how you would plot on a multiple “playgrounds” or subplots, it would look something like this:

fig, ax = plt.subplots(2,2, figsize = (10,8))
ax[0,0].set_title("I start here") #row 0 column 0
ax[0,1].set_title("I am here") #row 0 column 1
ax[1,0].set_title("Now I am here") #row 1column 0
ax[1,1].set_title("I stop here") #row 1 column 1
Figure 7

ax[nrows,ncols] is how you would direct matplotlib and python about where the information goes. In the example above, I only added the title, but that would be the case for actually plotting a graph, labeling the axis, even adding colors, etc.

I hope this was helpful in understanding Matplotlib. Once you master or feel 95% comfortable with the logic of Matplotlib, then you can head over to Seaborn where you can make your graphs prettier. However, I should warn you before you head to Seaborn, Matplotlib is much more flexible when it comes customizability. Additionally, don’t forget that Seaborn is based on Matplotlib. So check out the documentation for Matplotlib and all its unused potential.

--

--

Dorjey Sherpa

Data Scientist — Data Analyst — Data Enthusiast