Hello everybody,
It’s Michael, and welcome to the roaring (20)20s! With that said, it’s fitting that my first post of the 2020s will be about the 2010s-more specifically, Billboard’s Top 200 Albums of the 2010s, which I will analyze using linear regression.
Here’s the dataset-Billboard 200.
As always, let’s open up R, upload the file, and learn more about our dataset:

This dataset shows all of the album that ended up on Billboard’s Top 200 Decade-End Chart along with information about each album (such as number of tracks).
In total, there are 200 observations of 7 variables:
Rank-Where the album ranked on the Billboard’s Top 200 Decade-End Chart (anywhere from 1 to 200)Album-The name of the albumArtist-The singer/group (or distributor in the case of some movie soundtrack) who created the albumGenre-The main genre of the album (note that albums can fit into several sub-genres, which I will explore in this analysis)Tracks-How many tracks are on the albumMetacritic-The album’s Metacritic score- For those that don’t know, Metacritic is a movie/TV show/music review site, much like Rotten Tomatoes (except RT doesn’t review music)
Release Date-The album’s release date- Even though this is a 2010s Decade-End Chart, there are interestingly a handful of albums from the late ’00s. Guess they still held up into the ’10s.
Now, let’s check to see if there’s missing data (remember to install the Amelia package):
- Also remember to type the command
missmap(file)(or whatever variable you called your file) to see the missing-ness map.
As you can see, 97% of observations are present while 3% are missing. All of the missing observations are in the Metacritic column-this is because not all albums have a Metacritic score (there are plenty of other music review sites such as HipHop DX, but I only went off of Metacritic reviews to maintain consistency in the dataset).
Now, I don’t know if I’ve mentioned this before, but when there are missing values in a column in R, there are three things you can do:
- Don’t use the column in analysis
- Fill in missing column values with the mean of the column (meaning the mean you get from all non-NA values inn the column)
- Fill in the missing column values with an arbitrary fixed value
The first option sometimes works, but not for this dataset, as I want to use the Metacritic column in some way in this analysis. The second option might work, but I won’t use it since I feel that imputing the mean Metacritic score for any NAs in the column wouldn’t make much sense (plus this option won’t work with non-numeric columns). In this case, the third option is my best best; I will fill in any missing values in the Metacritic column using the number 0. You could pick any number-I just chose 0 since doing so gives me an easy way to spot the albums without Metacritic scores. Plus 0 won’t impact the mean of the Metacritic column in any significant way.
Here’s the line of code to make the magic happen:
file$Metacritic[is.na(file$Metacritic)] <- 0
Once we run this line of code, here’s what the Metacritic column looks like now:
All the NAs are filled with zeroes, which, in my opinion, makes the column a lot neater looking.
Now, let’s do some linear regression. Here’s a simple model with one independent and one dependent variable:
> model1 <- lm(file$Metacritic~file$Genre)
> summary(model1)Call:
lm(formula = file$Metacritic ~ file$Genre)Residuals:
Min 1Q Median 3Q Max
-65.345 -7.384 2.661 13.667 55.692Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.222e-14 1.546e+01 0.000 1.000000
file$GenreChristmas 2.433e+01 2.187e+01 1.113 0.267277
file$GenreCountry 3.581e+01 1.619e+01 2.211 0.028262 *
file$GenreElectronic 4.700e+01 2.046e+01 2.298 0.022708 *
file$GenreFolk 6.650e+01 2.445e+01 2.720 0.007156 **
file$GenreJazz 3.750e+01 2.445e+01 1.534 0.126801
file$GenreMovie Soundtrack 2.231e+01 1.715e+01 1.300 0.195099
file$GenreMusical 8.500e+01 3.093e+01 2.748 0.006584 **
file$GenreOpera 5.400e+01 3.093e+01 1.746 0.082465 .
file$GenrePop 6.534e+01 1.586e+01 4.121 5.71e-05 ***
file$GenreR&B 4.925e+01 1.729e+01 2.849 0.004889 **
file$GenreRap 6.133e+01 1.591e+01 3.855 0.000160 ***
file$GenreReggae -6.545e-14 3.093e+01 0.000 1.000000
file$GenreRock 6.908e+01 1.729e+01 3.996 9.31e-05 ***
file$GenreSoul 7.550e+01 2.046e+01 3.691 0.000294 ***
file$GenreVarious -1.459e-13 2.445e+01 0.000 1.000000
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 26.78 on 184 degrees of freedom
Multiple R-squared: 0.314, Adjusted R-squared: 0.2581
F-statistic: 5.614 on 15 and 184 DF, p-value: 2.224e-09
In this model, I used Metacritic as the dependent variable (I did say I was going to use it in this analysis) and Genre as the independent variable. I chose these two variables because I wanted to analyze whether certain genres tend to get higher/lower Metacritic scores.
What does all of the output mean? Since my last linear regression post was over a year ago, let me give you guys a refresher:
- The residual standard error refers to the amount that the dependent variable (
Metacritic) deviates from the true regression line. In this case, the RSE is 26.78, meaning the Metacritic score deviates from the true regression line by 27 (rounded to the nearest whole number). Since Metacritic scores only go up to 100, 27 is quite a large RSE. - The R-squared is the measure of a model’s goodness-of-fit; the closer to 1 means the better the fit. The difference between the Multiple R-Squared and the Adjusted R-Squared is that the former isn’t dependent on the amount of variables in the model while the latter is. In this case, the Multiple R-Squared is 31.4% while the adjusted R-squared is 25.81%. This implies that there isn’t much of a correlation between an album’s genre and Metacritic score.
- It’s not an official rule, but I’d say the Multiple R-Squared should be at 51% for there to be any correlation between a dependent variable and any independent variable(s). The Adjusted R-Squared can be slightly lower than 51%.
- In the post R Analysis 2: Linear Regression & NFL Attendance, I mentioned the idea that “correlation does not imply causation”, which holds true here. In the context of this model, just because there is a slight correlation between an album’s genre and its Metacritic score, this doesn’t mean that certain album genres will tend to score higher/lower on Metacritic.
- Disregard the F-statistic and corresponding p-value. However, if you want more context on both of these things, please check out the link in the previous bullet point for a more in-depth explanation.
- Notice how the independent variable-
Genre-is split up into several different sub-categories. These sub-categories represent all of the album genres listed in this dataset.- The asterisks right by the subcategories (after the
Pr(>|t|)column) are significance codes, which in this case represent each individual genre’s significance to the album’s Metacritic score. In other words, the significance codes show which genres are likely to have an impact on an album’s Metacritic score. Any genre with two or three asterisks will significantly influence an album’s Metacritic score; such genres include:- Folk
- Musical
- Pop
- Rap
- R&B
- Rock
- Soul
- The asterisks right by the subcategories (after the
Now, I want to try another model using Metacritic, but this time I will use two independent variables-Artist and Genre-and see if this improves the accuracy of the model:
model2 <- lm(file$Metacritic ~ file$Genre + file$Artist)
summary(model2)
Call:
lm(formula = file$Metacritic ~ file$Genre + file$Artist)Residuals:
Min 1Q Median 3Q Max
-41.667 0.000 0.000 0.687 47.333Coefficients: (6 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.167e+01 2.844e+01 -0.762 0.448674
file$GenreChristmas 2.167e+01 3.399e+01 0.637 0.525909
file$GenreCountry 1.502e+01 3.425e+01 0.438 0.662358
file$GenreElectronic -6.473e+01 3.785e+01 -1.710 0.091599 .
file$GenreFolk 1.977e+01 3.549e+01 0.557 0.579287
file$GenreJazz 5.917e+01 3.134e+01 1.888 0.063118 .
file$GenreMovie Soundtrack 2.167e+01 2.150e+01 1.008 0.316958
file$GenreMusical 1.067e+02 3.399e+01 3.138 0.002477 **
file$GenreOpera 7.567e+01 3.399e+01 2.226 0.029188 *
file$GenrePop 1.727e+01 2.719e+01 0.635 0.527498
file$GenreR&B 3.297e+01 2.963e+01 1.112 0.269683
file$GenreRap 2.167e+01 3.134e+01 0.691 0.491590
file$GenreReggae 2.167e+01 3.399e+01 0.637 0.525909
file$GenreRock 9.467e+01 3.399e+01 2.785 0.006858 **
file$GenreSoul 1.427e+01 3.549e+01 0.402 0.688886
file$GenreVarious 2.167e+01 3.876e+01 0.559 0.577891
file$Artist21 Pilots 7.000e+00 2.633e+01 0.266 0.791120
file$Artist21 Savage 8.100e+01 2.280e+01 3.552 0.000683 ***
file$ArtistA Boogie wit da Hoodie -1.858e-13 2.280e+01 0.000 1.000000
file$ArtistAdele 7.940e+01 3.115e+01 2.549 0.012978 *
file$ArtistAlicia Keys -1.130e+01 3.331e+01 -0.339 0.735394
file$ArtistAriana Grande 8.115e+01 2.666e+01 3.044 0.003271 **
file$ArtistBarbara Streisand 5.940e+01 3.115e+01 1.907 0.060612 .
file$ArtistBeyonce 7.720e+01 2.428e+01 3.180 0.002182 **
file$ArtistBillie Eilish 8.640e+01 3.115e+01 2.773 0.007083 **
file$ArtistBlake Shelton 7.065e+01 3.747e+01 1.886 0.063440 .
file$ArtistBob Marley NA NA NA NA
file$ArtistBrantley Gilbert 6.650e+00 3.747e+01 0.177 0.859634
file$ArtistBruno Mars 7.140e+01 2.719e+01 2.626 0.010585 *
file$ArtistBryson Tiller -1.130e+01 3.331e+01 -0.339 0.735394
file$ArtistCamila Cabello 8.240e+01 3.115e+01 2.645 0.010052 *
file$ArtistCardi B 8.400e+01 2.280e+01 3.684 0.000445 ***
file$ArtistCarrie Underwood 6.865e+01 3.508e+01 1.957 0.054280 .
file$ArtistCast of Hamilton NA NA NA NA
file$ArtistChris Brown -1.130e+01 3.331e+01 -0.339 0.735394
file$ArtistChris Stapleton 9.165e+01 3.747e+01 2.446 0.016922 *
file$ArtistColdplay 6.940e+01 3.115e+01 2.228 0.029075 *
file$ArtistDaBaby 1.931e-13 2.280e+01 0.000 1.000000
file$ArtistDaft Punk 1.734e+02 4.079e+01 4.251 6.37e-05 ***
file$ArtistDisney -8.179e-14 2.150e+01 0.000 1.000000
file$ArtistDJ Khaled 6.100e+01 2.280e+01 2.675 0.009266 **
file$ArtistDrake 7.500e+01 1.493e+01 5.024 3.63e-06 ***
file$ArtistDrake & Future 7.000e+01 2.280e+01 3.070 0.003033 **
file$ArtistDreamWorks 8.081e-14 2.633e+01 0.000 1.000000
file$ArtistDuck Dynasty -7.019e-13 2.633e+01 0.000 1.000000
file$ArtistEd Sheeran 6.890e+01 2.824e+01 2.440 0.017179 *
file$ArtistEminem 6.567e+01 1.700e+01 3.864 0.000244 ***
file$ArtistEric Church 8.615e+01 3.508e+01 2.456 0.016503 *
file$ArtistFall Out Boy -1.000e+00 2.633e+01 -0.038 0.969811
file$ArtistFetty Wap 6.800e+01 2.280e+01 2.982 0.003920 **
file$ArtistFlorida Georgia Line 6.650e+00 3.508e+01 0.190 0.850186
file$Artistfun 6.440e+01 3.115e+01 2.067 0.042367 *
file$ArtistFuture 7.350e+01 1.862e+01 3.948 0.000184 ***
file$ArtistG-Eazy 7.400e+01 2.280e+01 3.245 0.001791 **
file$ArtistGoyte -4.000e+00 2.633e+01 -0.152 0.879682
file$ArtistHozier 8.640e+01 3.861e+01 2.238 0.028365 *
file$ArtistHunter Hayes 6.650e+00 3.747e+01 0.177 0.859634
file$ArtistImagine Dragons -2.300e+01 2.280e+01 -1.009 0.316560
file$ArtistJ Cole 7.033e+01 1.700e+01 4.138 9.49e-05 ***
file$ArtistJason Aldean 3.715e+01 3.382e+01 1.098 0.275734
file$ArtistJay-Z 6.000e+01 2.280e+01 2.631 0.010426 *
file$ArtistJohn Legend 6.070e+01 3.331e+01 1.823 0.072583 .
file$ArtistJuice WRLD 3.050e+01 1.862e+01 1.638 0.105806
file$ArtistJustin Bieber 7.040e+01 2.666e+01 2.641 0.010160 *
file$ArtistJustin Timberlake 7.190e+01 2.824e+01 2.546 0.013054 *
file$ArtistKane Brown 6.650e+00 3.747e+01 0.177 0.859634
file$ArtistKanye West 7.500e+01 2.280e+01 3.289 0.001566 **
file$ArtistKanye West & Jay-Z 7.600e+01 2.280e+01 3.333 0.001367 **
file$ArtistKaty Perry 6.090e+01 2.824e+01 2.157 0.034405 *
file$ArtistKelly Clarkson 7.300e+01 2.633e+01 2.773 0.007099 **
file$ArtistKendrick Lamar 9.300e+01 1.862e+01 4.995 4.06e-06 ***
file$ArtistKesha 1.404e+02 4.079e+01 3.442 0.000972 ***
file$ArtistKevin Gates 8.100e+01 2.280e+01 3.552 0.000683 ***
file$ArtistKhalid 1.770e+01 3.059e+01 0.579 0.564710
file$ArtistLady Antebellum 6.965e+01 3.508e+01 1.986 0.050951 .
file$ArtistLady Gaga 7.780e+01 2.428e+01 3.205 0.002025 **
file$ArtistLana Del Rey 6.640e+01 3.115e+01 2.131 0.036524 *
file$ArtistLil Baby 2.779e-14 2.280e+01 0.000 1.000000
file$ArtistLil Baby & Gunna 7.600e+01 2.280e+01 3.333 0.001367 **
file$ArtistLil Uzi Vert 7.500e+01 2.280e+01 3.289 0.001566 **
file$ArtistLil Wayne 6.633e+01 1.700e+01 3.903 0.000214 ***
file$ArtistLionel Richie 7.840e+01 3.115e+01 2.517 0.014113 *
file$ArtistLittle Big Town 8.165e+01 3.747e+01 2.179 0.032638 *
file$ArtistLizzo 8.400e+01 2.280e+01 3.684 0.000445 ***
file$ArtistLMFAO 1.334e+02 4.079e+01 3.270 0.001659 **
file$ArtistLorde 8.340e+01 3.115e+01 2.677 0.009219 **
file$ArtistLuke Bryan 4.832e+01 3.425e+01 1.411 0.162647
file$ArtistLuke Combs 6.650e+00 3.747e+01 0.177 0.859634
file$ArtistMacklemore & Ryan Lewis 7.400e+01 2.280e+01 3.245 0.001791 **
file$ArtistMaroon 5 6.007e+01 2.719e+01 2.209 0.030413 *
file$ArtistMeek Mill 7.700e+01 2.280e+01 3.377 0.001193 **
file$ArtistMeghan Trainor 6.340e+01 3.115e+01 2.035 0.045580 *
file$ArtistMetalica -8.437e-14 2.633e+01 0.000 1.000000
file$ArtistMichael Buble NA NA NA NA
file$ArtistMigos 7.400e+01 1.862e+01 3.975 0.000167 ***
file$ArtistMiley Cyrus 6.540e+01 3.115e+01 2.099 0.039351 *
file$ArtistMiranda Lambert 9.265e+01 3.747e+01 2.473 0.015804 *
file$ArtistMumford & Sons -7.500e+00 2.280e+01 -0.329 0.743190
file$ArtistNicki Minaj 6.900e+01 1.862e+01 3.706 0.000414 ***
file$ArtistOf Monsters and Men 6.790e+01 3.861e+01 1.759 0.082931 .
file$ArtistOne Direction 6.840e+01 2.666e+01 2.566 0.012402 *
file$ArtistOneRepublic -8.000e+00 2.633e+01 -0.304 0.762141
file$ArtistPanic! At the Disco 7.440e+01 3.115e+01 2.388 0.019597 *
file$ArtistPentatonix 2.167e+01 3.134e+01 0.691 0.491590
file$ArtistPharrell 7.140e+01 3.115e+01 2.292 0.024884 *
file$ArtistPhillip Phillips 6.540e+01 3.115e+01 2.099 0.039351 *
file$ArtistPink 8.140e+01 3.115e+01 2.613 0.010953 *
file$ArtistPost Malone 2.550e+01 1.862e+01 1.370 0.175117
file$ArtistQueen 7.000e+01 2.633e+01 2.659 0.009690 **
file$ArtistRihanna 7.440e+01 2.824e+01 2.635 0.010324 *
file$ArtistRIhanna 6.690e+01 2.824e+01 2.369 0.020542 *
file$ArtistRobin Thicke 4.400e+00 3.115e+01 0.141 0.888085
file$ArtistSade 8.640e+01 3.861e+01 2.238 0.028365 *
file$ArtistSam Hunt 6.650e+00 3.747e+01 0.177 0.859634
file$ArtistSam Smith 7.140e+01 2.824e+01 2.529 0.013672 *
file$ArtistScotty McCreery 6.650e+00 3.747e+01 0.177 0.859634
file$ArtistShawn Mendes 4.090e+01 2.824e+01 1.449 0.151874
file$ArtistSia 7.140e+01 3.115e+01 2.292 0.024884 *
file$ArtistSony Pictures -8.989e-14 2.633e+01 0.000 1.000000
file$ArtistSusan Boyle NA NA NA NA
file$ArtistSZA 7.470e+01 3.331e+01 2.243 0.028026 *
file$ArtistTaylor Swift 7.965e+01 2.666e+01 2.988 0.003854 **
file$ArtistThe Band Perry 6.650e+00 3.747e+01 0.177 0.859634
file$ArtistThe Black Eyed Peas 6.440e+01 3.115e+01 2.067 0.042367 *
file$ArtistThe Black Keys 1.000e+01 2.280e+01 0.439 0.662319
file$ArtistThe Lumineers NA NA NA NA
file$ArtistThe Weeknd 5.920e+01 3.059e+01 1.935 0.056961 .
file$ArtistThomas Rhett 6.650e+00 3.747e+01 0.177 0.859634
file$ArtistTravis Scott 7.450e+01 1.862e+01 4.001 0.000153 ***
file$ArtistUniversal Studios 2.167e+01 2.150e+01 1.008 0.316958
file$ArtistUsher 4.570e+01 3.331e+01 1.372 0.174332
file$ArtistVarious 3.005e-14 2.280e+01 0.000 1.000000
file$ArtistWarner Bros. -1.194e-13 2.633e+01 0.000 1.000000
file$ArtistXXXTentacion NA NA NA NA
file$ArtistZac Brown Band 3.032e+01 3.425e+01 0.885 0.379001
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 18.62 on 71 degrees of freedom
Multiple R-squared: 0.8721, Adjusted R-squared: 0.6415
F-statistic: 3.782 on 128 and 71 DF, p-value: 3.626e-09
So how does this model differ from the previous model. Let’s analyze:
- The residual standard error is smaller than that of the previous model by nearly 8 (assuming you round model 1’s RSE to 27 and model 2’s RSE to 19). In the context of Metacritic scores, 19 is a much smaller RSE than 27, so this helps improve the model’s accuracy.
- This model’s Multiple R-Squared and Adjusted R-Squared are considerably larger than those of the previous model (87.21% and 64.15%, respectively, while the previous model’s Multiple R-Squared and Adjusted R-Squared were 31.4% and 25.81% respectively). This model’s R-Squared (both multiple and adjusted) imply that there is a strong correlation between an album’s genre and Metacritic score if the album’s artist is factored in to the analysis.
- Just as with the previous model, the independent variables are divided into sub-categories which encompass all the possible values for that variable. All of the genres listed in the dataset are present, as are all of the artists listed in the dataset. The significance codes are also present here, but this time they are present for both the genre and artist subcategories. Any genre that has either two or three asterisks besides it does significantly affect an album’s Metacritic score-same logic applies for any artists with two or three asterisks beside their name.
- So, what are the genres that most significantly impact an album’s Metacritic score (this is different from the previous model):
- Musical
- Rock
- Now, which artists are most likely to have a significant impact on an album’s Metacritic score?:
- 21 Savage
- Ariana Grande
- Beyonce
- Billie Eilish
- Cardi B
- Daft Punk
- DJ Khaled
- Drake
- Drake & Future (they did an album together so I listed them both as the artist)
- Eminem
- Fetty Wap
- Future
- G-Eazy
- J Cole
- Kanye West
- Kanye West & Jay-Z
- Kelly Clarkson
- Kendrick Lamar
- Kesha
- Kevin Gates
- Lady Gaga
- Lil Baby & Gunna
- Lil Uzi Vert
- Lil Wayne
- Lizzo
- LMFAO
- Lorde
- Macklemore & Ryan Lewis
- Meek Mill
- Migos
- Nicki Minaj
- Queen
- Taylor Swift
- Travis Scott
- Yes, 34 of the 120 artists listed have a significant impact on an album’s Metacritic score. And 23 of them are rappers (though keep in mind that if two artists made an album/mixtape together, I listed them on the same bullet point).
- Personally, I think it’s interesting that Queen is on this list. But that could be just because of the 2018 Queen movie Bohemian Rhapsody.
- So, what are the genres that most significantly impact an album’s Metacritic score (this is different from the previous model):
- Remember how I said to disregard the F-statistic and corresponding p-value when I was analyzing the previous model. Since this linear regression model has two independent variables, the F-statistic and corresponding P-value are important. The f-statistic is a numerical measure of the relationship (or lack thereof) between the dependent variable and any independent variables. However, the F-statistic must be analyzed in conjunction with the P-value in order to get a sense of the independent variables’ relationship with the dependent variable.
- The concepts of null and alternative hypotheses are important here.
- The null hypothesis states that the independent variables DON’T have a significant impact on the dependent variable while the alternative hypotheses states the opposite.
- If the P-value is less than 0.001, you can safely reject the null hypotheses. Since the P-value in this model is much less than 0.001, you can reject the null hypothesis. This means that the combination of an album’s genre and respective artist does impact the album’s Metacritic score.
- The concepts of null and alternative hypotheses are important here.
So, which model is better? After analyzing each model, I’d say model #2 is much better than model #1-and not just because model #2 has two independent variables (though that certainly helps make the model more accurate). The fact that model #2 has a lower RSE, higher Multiple/Adjusted R-Squared, 36 statistically significant subcategories (2 for genre and 34 for artist), and a P-value low enough to reject the null hypothesis all help make model #2 the better model.
Thanks for reading and here’s to lots of great content in 2020,
Michael