May 2023 - Michael's Programming Bytes

R Lesson 29: An Integral Part of R Calculus

Hello everybody,

Michael here, and in today’s post, I’ll be discussing an integral part of R calculus-integrals (see what I did there?).

The integral facts about integrals

What are integrals, exactly? Well, we did spend the last two posts discussing derivatives, which are metrics used to measure the rate at which one quantity changes with respect to another quantity (i.e. like the change in Rotten Tomatoes critic scores from one MCU movie to the next as we discussed in this post-R Lesson 27: Introductory R Calculus). Integrals, on the other hand, measure rates of accumulation of a certain quantity over time.

Simple enough, right? Well, is it possible to think of integrals as reverse derivatives? Yes it is! I did mention that derivatives measure change of a given quantity from point A to point B while integrals measure the accumulation of a quantity over a given time period (which could be from point A all the way to point XFD1048576). In the context of mathematical functions, derivatives break up a function into smaller pieces to measure the rate of change at each point while integrals put the pieces of the function back together to measure the rate of change across the entire duration of the function (which obviously can be infinity)

Calculating integrals, the manual way, part 1

Before we dive into R calculations of integrals, let’s first see how to calculate integrals, the manual way, by hand.

Let’s take this polynomial as an example:

Now, how would we calculate the integral of this polynomial. Take a look at the illustration below:

Just so you know, the polynomial in green is the integral of the original polynomial. With that said, how did we get the polynomial 3/4x^4-2/3x^3+2x^2-7x+C as the integral of the original polynomial?

First of all, just as we did with derivatives, we would need to calculate the integral of each term in the polynomial one-by-one. To do so, you’d need to add one to the exponent of each term, then divide that term by the new exponent. Still confused? Let me explain it another way:

The integral for 3x^3 would be 3/4x^4 since 3+1=4 and 3 divided by 4 is, well, 3/4.
The integral for -2x^2 would be -2/3x^3 since 2+1=3 and 2 divided by 3 is, well, 2/3.
The integral for 4x would be 2x^2 since 1+1=2 and 4 divided by 2 is 2.
The integral for -7 would be -7x since constants have a power of 0 and 0+1=1 (and anything divided by 1 equals itself)
You will likely have noticed an additional value in the integral that you may not be aware of-the constant C. I’ll explain more about that right now.

So, you may be wondering what’s up with the C at the end of the integral equation. Remember how earlier in this post I mentioned that you can think of derivatives as reverse integrals. You may recall that in our previous lesson-R Lesson 28: Another Way To Work With Derivatives in R-I discussed that during the process of calculating a derivative for a polynomial, the derivative of any constant in that polynomial is 0. This means that when finding the derivative of any polynomial, the constant disappears.

Now, since integrals can be considered as reverse derivatives, we should remember that when we integrate a polynomial, there was likely a constant that disappeared during differentiation (which I forgot to mention is the name of the process used to find the derivative of a polynomial). The C at the end of an integral represents the infinite number of possible constants that could be used for a given integral.

An integral illustration to this lesson

For my more visual learners, here is an illustration of how integrals work:

Just as a derivative would measure the change from one point to another in this curve, the integral would measure the area under the curve. The area in yellow represents the negative integral while the area in red represents the positive integral.

Still confused? Don’t worry-we’ll definitely go more in depth in this lesson!

Calculating integrals, the R way, part 2

OK, now that we’ve discussed how to calculate integrals the manual way, let’s explore how to calculate integrals the R way. You’ll notice that R won’t just spit out the integral of a given polynomial but rather calculate the integral using an upper and lower limit. Don’t worry-I’ll explain this more later, but for now, let’s see how the magic is done:

integral <- function(x) { 3*x^3-2*x^2+4*x-7 }
result <- integrate(integral, lower=1, upper=2)
result
5.583333 with absolute error < 6.6e-14

In this example, I’m showing you how R does definite integration. What is definite integration? Let me explain it like this.

In the previous section of this post, all we were trying to do was to calculate the integral polynomial of a certain expression. This is known as indefinite integration since we were simply trying to find the integration function of a given polynomial with the arbitrary constant C. As I mentioned in the previous section, the constant C could represent nearly anything, which means there are infinite possible integrals for any given polynomial.

However, with definite integration (like I did above), you’ll be calculating the integral at an upper and lower limit-this is certainly helpful if you’re looking for the integral over a specific range in the polynomial function rather than just a general integral, which can stretch for infinity. In R, to calculate the integral of a function over a given range, specify values for the lower and upper parameters (in this case I used 1 and 2). As you can see from the result I obtained, I got ~5.58 with an absolute error of 6.6e-14, which indicates a very, very, very small margin of error for the integral calculation. In other words, R does a great job with definite integration.

Keep in mind that the integration calculation approach I discussed above will only work with a finite range of integration (e.g. lower=1, upper=2). It won’t work with an infinite range of integration (e.g. from negative infinity to positive infinity).

Plotting an integration function

Now that we know how to calculate integrals, the next thing we’ll explore is plotting the integration function. Here’s how we’d do so-using the polynomial from the first section and an integration range of (0,5):

integral <- function(x) { 3*x^3-2*x^2+4*x-7 }
integrated <- function(x) { integrate(integral, lower=0, upper=50)$value }
vectorIntegral <- Vectorize(integrated)
x <- seq(0, 50, 1)
plot(x,vectorIntegral(x), xlim=c(0,50), xlab="X-values", ylab="Y-values", main="Definite Integration Example", col="blue", pch=16)

So, how did I manage to create this plot? Let me give you a step-by-step explanation:

I first set the function I wish to integrate as the value of the integral value.
I then retrieved the integral of this function at the range (0,5). I also grabbed the value of the integral at this range and nested this result into its own function, which I then stored as the value of the integrated variable.
I then vectorized the value of the integral at the range (0,5) and stored that value into the vectorIntegral variable.
I then created an x-axis sequence to use in my plot that contained the parameters (0, 50, 1) which represent the lower limit, upper limit, and x-axis increment for my plot, respectively. This sequence is stored in the x variable.
Last but not least, I used the plot() function to plot the integral of the polynomial 3x^3-2x^2+4x-7. One thing you may be wondering about is the x, vectorIntegral(x) parameter in the function. The x parameter gathers all the x values for the plot (in this case the integers 0 to 5) while the vectorIntegral(x) parameter calculates all of the correpsonding y-values for each possible x-value and gathers them into a vector, or array, for the plot.
- Why choose vectorization to calculate the corresponding y-values? Well, it’s easier than looping through each possible x-value in the integral range to get the correpsonding y-values, since vectorization simply takes in all possible x-values (0-50 in this case) as the input array and returns an output array containing all possible y-values for each possible x-value (which in this case all seem to be between 4,000,000 and 5,000,000).

Calculating integrals, the manual way, part 3

So, now that I’ve shown you how to do definite integration the R way, let me show you how to do so the manual way. Let’s examine this illustation:

So in this illustration, I’m trying to calculate the integral for the polynomial 3x^3-2x^2+4x-7 using the (7,10) range. How do I do so. Well, first I perform some indefinite integration by finding the integral of the given polynomial-only thing here is that I don’t need the constant C. Next, since my integration range is (7,10), I evaluate the integral function for x=10 and subtract that result from the result I get after evaluating the integral function for x=7. After all my calculations are complete, I get 5342.25 as the value of my integral (rounded to two decimal places) at the integration range of (7,10).

If you’re wondering what that weird-looking S means, that’s just a standard integral writing notation.
To calculate the integral of any given expression for a given range, always remember to first find the integral of the polynomial and then evaluate that integral for x=both the upper and lower limits. Subtract the result of the upper limit evaluation from the result of the lower limit evaluation. And remember that, as we saw in our R integral calculations, there will always be a very very very small margin of error.
In calculus function notation, the capital F represents the f(x) of the integral while the lowercase f represents the f(x) of that integral’s derivative.

Thanks for reading!

Michael

R Lesson 28: Another Way To Work With Derivatives in R

Hello everybody,

Michael here, and in today’s lesson, I’ll show you another cool way to work with derivatives in R.

In the previous post-R Lesson 27: Introductory R Calculus-I discussed how to work with derivatives in R. However, the derivatives method I discussed in that post doesn’t cover the built-in derivatives method R has to calculate derivatives…rather, methods R uses to calculate derivatives (there are two ways of approaching this). What might that method look like? Well, lets dive in!

Calculating derivatives, the built-in R way, part 1

Now, how can we calculate derivatives with the built-in R way? First, we’ll explore the deriv() function. Let’s take a look at this code below, which contains a simple equation for a parabolic curve:

curve <- expression(3*x^2+4*x+5)
print(deriv(curve, "x"))

expression({
    .value <- 3 * x^2 + 4 * x + 5
    .grad <- array(0, c(length(.value), 1L), list(NULL, c("x")))
    .grad[, "x"] <- 3 * (2 * x) + 4
    attr(.value, "gradient") <- .grad
    .value
})

In this example, I’m using the polynomial 3x^2+4x+5 to represent our hypothetical parabolic curve. To find the derivative function of this polynomial, I ran R’s built in deriv() function and passed in both the curve expression and "x" (yes, in double quotes) to find the derivative expression-as the derivative always relates to x (or whatever character you used to represent an unknown quantity). Now, as you see from the output given, things don’t look too understandable. However, pay attention to the second .grad line (in this case, 3 * (2 * x) + 4), as this will provide the derivative function of the parabolic curve equation-6x+4. We would use this equation to evaluate the rate of change between points in the parabolic curve.

Let’s say we wanted to evaluate the derivative polynomial-6x+4-at x=5. If we use this x value, then the derivative of the parabola at x=5 would be 34.

If you thought the output I’m referring to read as 3(2x)+4, you’d be wrong. Remember to multiply the 3 by the 2x to get the correct answer of 6x (6x+4 to be exact).
When you’re writing a polynomial in R, remember to include all the multiplication signs (*)-yea, I know it’s super annoying.

In case you wanted a visual representation of this parabola, here it is:

I created this illustation of a parabola using a free online tool called DESMOS, which allows you to quickly create a visual representation of a line, parabola, or other curve. Here’s the link to DESMOS-https://www.desmos.com/calculator/dz0kvw0qjg.

Calculating derivatives, the built-in R way, part 2

Now let’s explore R’s other built-in way to calculate derivatives-this time using the D() function. For this example, let’s use the same parabolic curve equation we used for the deriv() example:

curve <- expression(3*x^2+4*x+5)
print(D(curve, "x"))

3 * (2 * x) + 4

As you can see, we passed in the same parameters for the D() function that we used for the deriv() function and got a much simpler version of the output we got for the deriv() function-simpler as in the derivative expression itself was the only thing that was returned.

Since the derivative expression returned from the D() function is the same as the expression returned from the deriv() function,

Calculating derivatives, the manual way, part 3

Yes, I know I was mostly going to focus on the two built-in R functions used to calculate derivatives-deriv() and D()-but I thought I’d include this bonus section for those of you (especially those who enjoy exploring calculus) who were wondering how to manually calculate derivatives.

Let’s take the same polynomial we were working with for the previous two examples:

How would we arrive at the derivative expression of 6x+4? Check out this illustration below:

From this picture, here are some things to keep in mind when calculating derivatives of polynomials (and it’s not the same as calculating derivatives of regular numbers like we did in the previous post R Lesson 27: Introductory R Calculus):

Go term-by-term when calculating derivatives of polynomials. In this example, you’d calculate the derivative of 3x^2, then the derivative of 4x and lastly the derivative of 5.
How would you calculate the derivatives of each term? Here’s how (and it’s quite easy).
The derivative of 3x^2 would be 6x, as you would multiply the number (3) by the power (2) to get 6. You would then reduce power of the x^2 by 1 to get x (any variable in a polynomial without an exponent is raised to the power of 1). Thus, the derivative of 3x^2 would be 6x.
The derivative of 4x would simply be 4. Just as we did with 3x^2, we’d multiply the number (4) by the power (1) in this case to get 4. We would also reduce the power of x by 1 to simply get 4, since x has a power of 1 and 1-1=0. In a polynomial, constants (numbers without variables next to them) have a power of 0.
I mean, we could’ve written the derivative polynomial as 6x+4x^0, but 6x+4 looks a lot nicer.
As for the constant in this polynomial-5-it has a derivative of 0 since the derivative of a constant is always 0 (after all, any constant in a polynomial has a power of 0, so this makes perfect sense). Thus, the derivative of 5 isn’t included in the derivative polynomial of 6x+4.

Thanks for reading,

Michael

R Lesson 27: Introductory R Calculus

Hello everybody,

Michael here, and in today’s post, I’m going to revisit an old friend of ours-the language R. As you readers may recall, R was the first language I covered on this blog, and since we’re only a few posts away from the blog’s fifth anniversary, I thought it would be fun to revisit this blog’s roots as an analytics blog (remember the Michael’s Analytics Blog days everyone).

Today’s post will provide a basic introduction on doing calculus with R (including graphing). Why am I doing R calculus? Well, I wanted to do some more fun R posts leading up to the blog’s fifth anniversary and I did have fun writing the trigonometry portion of my previous post-Python Lesson 41: Word2Vec (NLP pt.7/AI pt.7)-that I wanted to dive into more mathematical programming topics. With that said, let’s get started with some R calculus!

Setting ourselves up

In this lesson, we’ll be using this dataset-

MCU-movies Download

This dataset contains the Rotten Tomatoes scores for all MCU (Marvel Cinematic Universe) movies from Iron Man (2008) to Guardians of the Galaxy Vol. 3 (2023). Both critic and audience Rotten Tomatoes scores are included for all MCU movies.

Now, let’s open up our R IDE and read in this CSV file:

MCU <- read.csv("C:/Users/mof39/OneDrive/Documents/MCU movies.csv", fileEncoding="UTF-8-BON")
> MCU
                                         Movie Year RT.score Audience.score
1                                     Iron Man 2008     0.94           0.91
2                              Incredible Hulk 2008     0.67           0.69
3                                   Iron Man 2 2010     0.71           0.71
4                                         Thor 2011     0.77           0.76
5           Captain America: The First Avenger 2011     0.80           0.75
6                                 The Avengers 2012     0.91           0.91
7                                   Iron Man 3 2013     0.79           0.78
8                          Thor The Dark World 2013     0.66           0.75
9          Captain America: The Winter Soldier 2014     0.90           0.92
10                     Guradians of the Galaxy 2014     0.92           0.92
11                     Avengers: Age of Ultron 2015     0.76           0.82
12                                     Ant-Man 2015     0.83           0.85
13                  Captain America: Civil War 2016     0.90           0.89
14                              Doctor Strange 2016     0.89           0.86
15               Guardians of the Galaxy Vol 2 2017     0.85           0.87
16                      Spider-Man: Homecoming 2017     0.92           0.87
17                              Thor: Ragnarok 2017     0.93           0.87
18                               Black Panther 2018     0.96           0.79
19                      Avengers: Infinity War 2018     0.85           0.92
20                        Ant-Man and the Wasp 2018     0.87           0.80
21                              Captain Marvel 2019     0.79           0.45
22                           Avengers: Endgame 2019     0.94           0.90
23                   Spider-Man: Far From Home 2019     0.90           0.95
24                                 Black Widow 2021     0.79           0.91
25   Shang-Chi and the Legend of the Ten Rings 2021     0.91           0.98
26                                    Eternals 2021     0.47           0.77
27                     Spider-Man: No Way Home 2021     0.93           0.98
28 Doctor Strange in the Multiverse of Madness 2022     0.74           0.85
29                      Thor: Love and Thunder 2022     0.63           0.77
30              Black Panther: Wakanda Forever 2022     0.84           0.94
31           Ant-Man and the Wasp: Quantumania 2023     0.47           0.83
32               Guardians of the Galaxy Vol 3 2023     0.81           0.95

As you can see, we have read the data-frame into R and displayed it on the IDE (there are only 31 rows here).

Now, before we dive into the calculus of everything, let’s explore our dataset:

Movie-the name of the movie
Year-the movie’s release year
RT.score-the movie’s Rotten Tomatoes score
Audience.score-the movie’s audience score on Rotten Tomatoes
R tip-when you are reading in a CSV file into R, it might help to add the fileEncoding="UTF-8-BON" parameter into the read.csv() function as this parameter will remove the junk text that appears in the name of the dataframe’s first column.

Calculus 101

Now, before we dive headfirst into the fun calculus stuff with R, let’s first discuss calculus and derivatives, which is the topic of this post.

What is calculus? Simply put, calculus is a branch of mathematics that deals with the study of change. Calculus is a great way to measure how things change over time, like MCU movies’ Rotten Tomatoes scores over the course of its 15-year, 32-movie run.

There are two main types of calculus-differential and integral calculus. Differential calculus focuses on finding the rate of change of, well, any given thing over a period of time. Integral calculus, on the other hand, focuses on the accumulation of any given thing over a certain period of time.

A good example of differential calculus would be modelling changes in a city’s population over a certain period of time; differential calculus would be used in this scenario to find the city’s population change rate over time. A good example of integral calculus would be modelling the spread of a disease over time (e.g. COVID-19) in a certain geographic region to analyze that region’s infection rate over a certain time period.

Now, what is a derivative? In calculus, the derivative is the metric used to measure the rate of change at any given point in the measured example. In this example, the derivative (or rather derivatives since we’ll be using two derivatives) would be the change in Rotten Tomatoes scores (both critic and audience) from one MCU movie to the next.

It’s R calculus time!

Now that I’ve explained the gist of calculus and derivatives to you all, it’s time to implement them into R! Here’s how to do so (and yes, we will be finding the derivatives of both critic and audience scores). First, let’s start with the critic scores derivatives:

criticScores <- MCU$RT.score
criticDerivatives <- diff(criticScores)
criticDerivatives

[1] -0.27  0.04  0.06  0.03  0.11 -0.12 -0.13  0.24  0.02 -0.16  0.07  0.07 -0.01 -0.04  0.07  0.01  0.03 -0.11  0.02 -0.08  0.15 -0.04 -0.11  0.12 -0.44  0.46 -0.19 -0.11  0.21 -0.37
[31]  0.34

To calculate the derivatives for each critic score, I first placed all of the critics’ scores (stored in the column MCU$RT.score) into the vector criticScores. I then used R’s built-in diff() function to calculate the difference in critic scores from one MCU movie to the next and-voila!-I have my 31 derivatives.

Even though there are 32 MCU movies, there are only 31 differences to calculate and thus only 31 derivatives that appear.

Calculating the derivatives of the audience scores works exactly the same way, except you’ll just need to pull your data from the MCU$Audience.score column:

audienceScores <- MCU$Audience.score
audienceDerivatives <- diff(audienceScores)
audienceDerivatives
 [1] -0.22  0.02  0.05 -0.01  0.16 -0.13 -0.03  0.17  0.00 -0.10  0.03  0.04 -0.03  0.01  0.00  0.00 -0.08  0.13 -0.12 -0.35  0.45  0.05 -0.04  0.07 -0.21  0.21 -0.13 -0.08  0.17 -0.11
[31]  0.12

Plotting our results

Now that we’ve calculuated the derivatives of both the critic and audience scores, let’s plot them!

Here’s how we’d plot the critic scores:

plot(1:(length(criticScores)-1),criticDerivatives, type = "l", xlab = "MCU Movie Number", ylab = "Change in critic score")

In this example, I used R’s plot() function (which doesn’t require installation of the ggplot2 package) to plot the derivatives of the critic scores. The y-axis represents the change in critic scores, while the x-axis represents the index for a specific MCU movie (e.g. 0 would be Incredible Hulk while 31 would be Guardians of the Galaxy Vol.3).

However, this visual doesn’t seem to helpful. Let’s see how we can fix it!

First, let’s create a vector of the MCU movies to use as labels for this plot:

movies <- MCU$Movie

Next, let’s remove Iron Man from this vector since it won’t have a derivative (after all, it’s the first MCU movie).

movies <- movies[! movies %in% c('Iron Man')]

Great! Now let’s revise our plot to first add a title:

plot(1:(length(criticScores)-1),criticDerivatives, type = "l", main="Changes in MCU movie critic reception", xlab = "MCU Movie Number", ylab = "Change in critic score")

You can see that the plot() function’s main paramater allows you to add a title to the graph.

Next let’s add some labels to our data points-remember to only run this command AFTER you have the initial graph open!

text(1:(length(criticScores)-1),criticDerivatives, labels=movies, pos=3, cex=0.6)

Voila! With the text() function, we’re able to add labels to our data points so that we can tell which movie corresponds with which data point!

Remember to include the same X and Y axes in the text() function as you did in the plot() function! In this case, the X axis would be 1:(length(criticScores)-1) and the Y axis would be criticDerivatives.

Now that we have a title and labelled data points in our graph, let’s gather some insights. From our graph, we can see that the critical reception for the MCU’s Phases 1 & 2 was up-and-down (these include movies from Iron Man to Ant-Man). The critical reception for MCU’s Phase 3 slate (from Captain America: Civil War to Spider-Man: Far From Home) was its most solid to date, as there are no major positive or negative derivatives in either direction. The most interesting area of the graph is Phases 4 & 5 (from Black Widow onwards), as this era of the MCU has seen some sharp jumps in critical reception from movie to movie. Some of the sharpest changes can be seen from Shang-Chi and the Legend of the Ten Rings to Eternals (a 44% drop in critic score) and from Eternals to Spider-Man: No Way Home (a 46% rise in critic score).

All in all, some insights we can gain from this graph is that MCU Phase 3 was its most critically well-recieved (and as some fans would say, the MCU’s prime) while the entries in Phase 4 & 5 have been hit-or-miss critically (ahem, Eternals).

Now that we’ve analyzed critic derivatives, let’s turn our attention to analyzing audience score derivatives. Here’s the plot we’ll use-and it’s pretty much the same code we used to create the updated critic score derivative plot (except replace the word critic with the word audience in each axis variable and in the title):

plot(1:(length(audienceScores)-1),audienceDerivatives, type = "l", main="Changes in MCU movie audience reception", xlab = "MCU Movie Number", ylab = "Change in audience score")

text(1:(length(audienceScores)-1),audienceDerivatives, labels=movies, pos=3, cex=0.6)

The change in audience reception throughout the MCU’s 15-year, 32-movie run looks a little different than the change in critic reception over that same time period. For one, there are fewer sharp changes in audience score from movie to movie. Also interesting is the greater number of positive derivatives in audience score for the MCU’s Phase 4 & 5 movies-after all, there were far more negative derivatives than positive for the MCU’s Phase 4 & 5 critical reception (this is also interesting because many fans on MCU social media accounts that I follow have griped about the MCU’s quality post-Avengers Endgame). One more interesting insight is that the sharpest changes in audience reception came during the peak of Phase 3 (namely from Black Panther to Avengers: Endgame). As you can see from the graph above, the change in audience reception is fairly high from Black Panther to Avengers: Infinity War then drops from Avengers: Infinity War to Ant-Man and the Wasp. The audience score drops even further from Ant-Man and the Wasp to Captain Marvel before sharply rising from Captain Marvel to Avengers: Endgame. I personally found this insight interesting as some of my favorite MCU movies come from Phase 3 (like Black Panther with its 96% on Rotten Tomatoes-critic score), though I do recall Captain Marvel wasn’t well liked when it came out in March 2019 (but boy oh boy was Avengers: Endgame one of the most hyped things of 2019).

Thanks for reading,

Michael