Python Lesson 30: MATPLOTLIB Histograms, Pie Charts, and Scatter Plots (MATPLOTLIB pt. 3)

Hello everybody,

Michael here, and today’s post will be on creating histograms and pie-charts in MATPLOTLIB (this is the third lesson in my MATPLOTLIB series).

For this lesson, we’ll be using this dataset:

This dataset contains information on the top 125 US grossing movies of 2021 (data from BoxOfficeMojo.com-great source if you want to do data analyses on movies) . Let’s read it into our IDE and learn about each of the variables in this dataset:

import pandas as pd
films2021 = pd.read_excel(r'C:/Users/mof39/OneDrive/Documents/2021 movie data.xlsx')
films2021.head()

Now, what do each of these variables mean? Let’s take a look:

  • Rank-In terms of overall US gross during theatrical run, where the movie ranks (from 1-125)
  • Movie-The name of the movie
  • Total Gross-The movie’s total US gross over the course of its theatrical run (as of January 25, 2022)
  • Screens played in (overall)-The number of US theaters the movie played in during its theatrical run
  • Opening Weekend Gross-The movie’s total opening weekend US gross
  • Opening Weekend % of Total Gross-The movie’s US opening weekend gross’s percentage of the total US gross
  • Opening Weekend Theaters-The number of US theaters the movie played in during its opening run
  • Release Date-The movie’s US release date
  • Distributor-The studio that distributed the movie
  • Rotten Tomatoes Score-The movie’s Rotten Tomatoes Score-0 represents a 0% score and 1 represents a 100% score. For movies that had no Rotten Tomatoes score, a 0 was listed.
    • These are the critics scores I used, not the audience rating (which, if you’ve read Rotten Tomatoes reviews, can vary widely from the critic’s scores).

Now, let’s get started with the visualization creations! First off, let’s explore creating histograms in MATPLOTLIB. For our first MATPLOTLIB histogram, let’s use the Rotten Tomatoes Score column to analyze the Rotten Tomatoes score distribution among the 125 movies on this list:

import matplotlib.pyplot as plt
%matplotlib inline

plt.figure(figsize=(10, 8))
plt.hist(films2021['Rotten Tomatoes Score'])
plt.ylabel('Frequency', size = 15)
plt.xlabel('Rotten Tomatoes Score', size=15)
plt.title('Rotten Tomatoes score distribution among 2021 movies', size=15)

First of all, remeber that since we’re using MATPLOTLIB to create these visualizations, you’ll need to include the lines import matplotlib.pyplot as plt and %matplotlib inline in your code (before you create the plot).

Now, to create the histogram, I used five lines of code. The first line of code simply sets the graph size to 10×8-you’d need to execute this line of code (though you can change the dimensions as you wish). The plt.hist() line of code takes in a single paramter-the column you want to use for the histogram. Since histograms are created with just a single column, you’d only need to pass in one column as the parameter for this function-in this case, I used the Rotten Tomatoes Score column. The next three lines of code set name and size of the graph’s y-label, x-label, and title, respectively.

So, what conclusions can we draw from this graph? First of all, since there are 10 bars in the graph, we can conclude that the Rotten Tomatoes score frequencies are being distributed in 10% intervals (e.g. 0-10%, 10-20%, 20-30%, and so on). We can also conclude that most of the 125 movies in this dataset fall in either the 80-90% interval or the 90-100% interval, so critics seemed to enjoy most of the movies on this list (e.g. Spider-Man: No Way Home, Dune, Free Guy). On the other hand, there are very few movies on this list that critics didn’t enjoy-most of the 0s on this list have no Rotten Tomatoes critic score-as only 11 of the movies on this list had either no critic score or had a score in the 0-10% or 10-20% intervals (e.g. The House Next Door: Meet The Blacks 2).

Now, the graph looks great, but what if you wanted fewer frequency intervals? In this case, let’s cut down the amount of intervals from 10 to 5. Here’s the code to do so:

import matplotlib.pyplot as plt
%matplotlib inline

plt.figure(figsize=(10, 8))
plt.hist(films2021['Rotten Tomatoes Score'], bins=5)
plt.ylabel('Frequency', size = 15)
plt.xlabel('Rotten Tomatoes Score', size=15)
plt.title('Rotten Tomatoes score distribution among 2021 movies', size=15)

As you can see, our graph now only has 5 bars rather than 10. How did I manage to make this change? Pay attention to this line of code:

plt.hist(films2021['Rotten Tomatoes Score'], bins=5)

I still passed in the same column into the plt.hist() function. However, I added the optional bins parameter, which allows me to customize the number of intervals in the histogram (I used five intervals in this case). Since there are only five intervals in this graph rather than 10, the intervals entail 20% score ranges (0-20%, 20-40%, 40-60%, 60-80%, 80-100%).

  • You can use as many intervals as you want for your histogram, but my suggestion is that you take a look at the maximum value of any column you want to use for your histogram and pick a bins value that evenly divides by that maximum value (in this case, I used 5 for the bins since 100 evenly divides by 5).
  • Speaking of maximum value, you’ll only want to use quantiative (numerical) values for your histogram, as quantiative values work best when measuring frequency distribution.

Awesome work so far! Next up, let’s explore pie-charts in MATPLOTLIB. To start with pie-charts, let’s create one based off the Distributors column in the data-frame:

import matplotlib.pyplot as plt
%matplotlib inline

distributors = films2021['Distributor'].value_counts()
distributors = distributors[:6]

plt.figure(figsize=(10,8))
plt.title('Major Movie Distributors in 2021', size=15)
distributors.plot(kind='pie')

So, how did I manage to generate this nice looking pie chart? First of all, to create the pie chart, I wanted to get a count of how many times each distributor appears in the list, so I used PANDAS’ handy-dandy .value_counts() function to get the number of times each distributor appears in the list-I stored the results of the .value_counts() function in the distributors variable. As for the distributors[:6] line of code, I included this since there were over 20 distributors on this list, I only wanted to include the top 6 distributors (the 6 distribtuors that appear the most on this list) to create a neater-looking pie chart.

You’ll recognize the plt.figure() and plt.title() lines from the histogram example, as their functionalities are to set the figure size of the graph and the graph’s title, respectively. However, pay attention to the distributors.plot(kind='pie') line. Whenever you create a data-frame out of value counts (as I did with the distributors variable), running plt.[insert code here] won’t work. You’d need to use the syntax data-frame.plot(kind='kind of graph you want to create')-and yes, remember to pass in the value for kind as a string.

So, what can we infer from this pie-chart? For one thing, Warner Bros. had most of the top-US grossing movies of 2021, with 18 movies on this list coming from Warner Bros. (Dune, Space Jam: A New Legacy, Godzilla vs. Kong). Surprisingly, there are only 7 Disney movies on this list (well, 14 if you count the 7 Searchlight Pictures films-Searchlight Pictures is a subsidiary of Disney as of March 2019). Even more surprising? Warner Bros. released all of their 2021 films on a day-and-date model, meaning that all of their 2021 films were released in theaters AND on their streaming service HBO MAX, so I’m surprised that they (not Disney) have the most movies on this list.

OK, so our pie chart looks good so far. But what if you wanted to add in the percentages along with the corresponding values (values refering to the amount of times a distributor’s name appears in the dataset)? Change this line of code from the previous example:

distributors.plot(kind='pie', autopct=lambda p : '{:.0f}%  ({:,.0f})'.format(p,p * sum(distributors)/100))

In the .plot() function, I added an extra parameter-autopct. What does autopct do? Well, I could say this function displays the percentage of the time each distributor appears in the list, but that’s oversimplifying it. Granted, all percentages are displayed alongside their corresponding values (e.g. the Lionsgate slice shows 12% alongside the 7 label, indiciating that Lionsgate appears 7 times (and 12% of the time) on the distributors data-frame). However, this is accomplished with the help of a handy-dandy lambda function (for a refresher on lambda functions, refer to this lesson-Python Lesson 12: Lambdas & List Comprehension) that, in summary, calculates the amount of times each distributor’s name appears in distributors and displays that number (along with the corresponding percentage) in the appropriate slice of the pie chart.

Awesome work so far! Now, last but not least, let’s create a scatterplot using the Total Gross and Screens played in (overall) columns to analyze the relationship between a movie’s total US gross and how many US theaters it played in during its run:

import matplotlib.pyplot as plt
%matplotlib inline

plt.figure(figsize=(10,8))
plt.title('Screens played in', size=15)
plt.xlabel('Total screens played in during theatrical run', size=15)
plt.ylabel('Total US gross (in hundereds of millions of dollars)', size=15)
plt.scatter(films2021['Screens played in (overall)'], films2021['Total Gross'])
  • I could only get part of the scatter plot since the output was too big to be displayed without needing to scroll down.

So, how did I manage to generate this output? First of all, as I’ve done with every MATPLOTLIB visual I’ve created in this post, I include the .figure(), .title(), .xlabel(), and .ylabel() functions to help with the plotting of this graph. To actually generate and plot the scatterplot, I use the .scatter() function and passed in two parameters-the x-axis (Screens played in (overall)) and the y-axis (Total Gross).

So, what can we conclude from this scatterplot? It appears that the more screens a movie played in during its theatrical run, the higher its total gross-however, this trend isn’t noticeable for movies that played in under 2000 screens nationwide (namely the foreign films and limited-release films). Oh, and in case you’re wondering, there is one point in the scatterplot that you can’t see which corresponds to Spider-Man: No Way Home (which still has a handful of showing left at my local movie theater as of February 3, 2022). Not surprising that the Spider-Man: No Way Home point is all the way at the top, since it grossed approximately $677 million in the US during its (still-ongoing) theatrical run. Just for perspective, the #2 ranked movie on this list-Shang-Chi and the Legend of the Ten Rings-grossed approximately $224 million during its theatrical run (and played on just 36 fewer screens than Spider-Man: No Way Home). The highest grossing non-MCU (Marvel Cinematic Universe for those unaware) movie-F9: The Fast Saga (ranked at #5)-grossed approximately $173 million in comparison.

Thanks for reading,

Michael

Leave a Reply