Hello everybody,
It’s Michael here, and today’s post is an R analysis about the COVID-19 pandemic using ANOVA & K-means clustering. I haven’t done many analyses on current events, so I thought I’d change things up here a little (especially since COVID-19 is an important moment in modern history).
Here’s the dataset-COVID-19 Worldwide (4-12-20)
First of all, let’s load our dataset into R and learn more about each variable:

This dataset contains 213 samples and 14 variables. Let’s learn more about each variable:
Country-The name of the country whose COVID-19 cases are being analyzed. Keep in mind that not every value in this column is a country-some exceptions include:World-The total amount of coronavirus cases worldwideReunion-This is a French island located in the Indian OceanDiamond Princess-This is a Carnival-owned cruise ship that first disembarked from Yokohama, Japan
Total Cases-The total amount of COVID-19 cases in a particular country.- Keep in mind that these are the total amount of cases as of April 12, 2020. The numbers have (very likely) fluctuated a lot since this writing.
New Cases-The amount of new COVID-19 cases in a particular country on April 12, 2020Total Deaths-The total amount of people that have died in a particular country from COVID-19New Deaths-The amount of people in a particular country who died from COVID-19 on April 12, 2020.Total Recovered-The total amount of people in a particular country who have recovered from COVID-19.Active Cases-The amount of active cases in a particular country as of April 12, 2020.Serious Critical-The amount of serious or critical cases in a particular country.Total Cases 1M pop-The total amount of cases in a particular country per 1 million residents.Deaths 1M pop-The total amount of deaths in a particular country per 1 million residents.Total Tests-The total amount of COVID-19 tests a particular country has administered.Tests 1M pop-The total amount of tests a particular country has administered per 1 million residents.Continent-The continent a country is located on (if applicable)Lockdown-The measures a particular country has taken to stop the spread of COVID-19, which can include:Curfew-country has enacted a curfew for its citizens, which restricts their movement at certain hours of the dayLockdown-country has enacted a mass lockdown/quarantine (terms are used interchangeably), requiring all citizens to stay at home at all hours of the day with few exceptions (e.g. getting groceries, performing an essential job)No large gatherings-country bans all large gatherings (e.g. concerts, festivals)- OK, so almost every country I’ve analyzed bans or strongly discourages large gatherings. This label is for countries that have only banned large gatherings without enacting any other significant restrictions.
None-country has implemented no major restrictions on their citizensPartial curfew-country has enacted a curfew only in certain regions or for certain groups of people (e.g. elderly)Partial lockdown-country has enacted a lockdown only in certain regions or for certain groups of peopleSocial distancing-country has enacted social distancing measures, such as closing schools, bars, and restaurants- Almost every affected country has enacted some form of social distancing, but this label only applies to those countries who have only enacted social distancing rules without any other major restrictions.
Stay-at-home-country has issued a nationwide stay-at-home order, which requires citizens to stay at home except for essential activities (e.g. grocery shopping, working at an essential business) and outdoor exercise-provided social distancing is maintained.Travel restrictions-country has enacted travel restrictions, restricting citizens from leaving and/or foreigners from entering.- President Trump’s Europe travel ban doesn’t factor into this label.
Various-country has enacted various measures throughout the nation- The US is the only country that has this label, as all 50 states have responded differently to the COVID-19 crisis (for instance, 43 states have issued stay-at-home orders, but 7 haven’t). Likewise, each state has their own definition of “essential businesses”.
- Something I want to mention about factor variables that I didn’t mention earlier-levels in factor variables are organized alphabetically. So for the
Lockdownvariable,Curfewwould have a factor level of 1, as it’s the first value alphabetically. Likewise,Variouswould have a factor level of 11, as it’s the last value alphabetically.
Now, let’s see how tidy our dataset is by seeing how many values are missing (remember to install the Amelia package and use the line misssmap(file) to obtain the missingness map):
As you can see, 19% of all values in the dataset are missing, while the other 81% of values are present.
- Yes, I know the dataset is not as tidy as the ones I’ve used in other posts. But I think this is a good things, as several real-life datasets you’ll encounter could be considerably messier than this one.
So, what should I do about the missing values. First of all, since the New Deaths variable has too many missing values, I won’t use it in any of my analyses. As for the rest of the variables with missing values, I will simply replace those missing values with zeroes. As to why I’d do that, replacing the missing values with zeroes won’t impact any of the variables’ means. Here’s how to do that:
file$New.Cases[is.na(file$New.Cases)] <- 0
> file$Serious.Critical[is.na(file$Serious.Critical)] <- 0
> file$Tests..1M.pop[is.na(file$Tests..1M.pop)] <- 0
> file$Total.Tests[is.na(file$Total.Tests)] <- 0file$Deaths..1M.pop[is.na(file$Deaths..1M.pop)] <- 0
> file$Total.Deaths[is.na(file$Total.Deaths)] <- 0
> file$Total.Recovered[is.na(file$Total.Recovered)] <- 0
These seven lines of code will replace any missing values within these seven variables with zeroes.
Now, for this analysis I will be doing two one-way ANOVA models. Here’s my first one-way ANOVA model:
In this model, I used Total Cases as the dependent variable and Lockdown as the independent variable; I’m analyzing the relationship a country’s containment measures would have on their total coronavirus cases.
As you can see, Lockdown isn’t that significant of a variable, as it only has a single period besides it, which means its p-value is between 0.05 and 0.1, which is too large to be considered statistically significant. The f-value is also low-1.749-which, along with the high p-value, means that there isn’t enough evidence to reject the null hypothesis (that a country’s containment measures don’t have a significant impact on their coronavirus cases).
Nonetheless, let’s performs a Tukey’s HSD test to see if we can get some interesting insights:
TukeyHSD(model1, conf.level=0.99)
Tukey multiple comparisons of means
99% family-wise confidence levelFit: aov(formula = file$Total.Cases ~ file$Lockdown)
$`file$Lockdown`
diff lwr upr p adj
Lockdown-Curfew 15932.078 -112699.14 144563.30 0.9999954
No large gatherings-Curfew 22064.565 -477726.47 521855.60 1.0000000
None-Curfew 10803.501 -99222.47 120829.47 0.9999995
Partial curfew-Curfew 53433.565 -446357.47 553224.60 0.9999989
Partial lockdown-Curfew 4838.065 -355855.05 365531.18 1.0000000
Self-quarantine-Curfew -3134.435 -502925.47 496656.60 1.0000000
Social distancing-Curfew 61926.065 -298767.05 422619.18 0.9999023
Stay-at-home-Curfew 20215.565 -479575.47 520006.60 1.0000000
Travel restrictions-Curfew 3225.565 -496565.47 503016.60 1.0000000
Various-Curfew 546553.565 46762.53 1046344.60 0.0027658
No large gatherings-Lockdown 6132.487 -489368.42 501633.40 1.0000000
None-Lockdown -5128.577 -93648.53 83391.37 1.0000000
Partial curfew-Lockdown 37501.487 -457999.42 533002.40 1.0000000
Partial lockdown-Lockdown -11094.013 -365818.68 343630.65 1.0000000
Self-quarantine-Lockdown -19066.513 -514567.42 476434.40 1.0000000
Social distancing-Lockdown 45993.987 -308730.68 400718.65 0.9999929
Stay-at-home-Lockdown 4283.487 -491217.42 499784.40 1.0000000
Travel restrictions-Lockdown -12706.513 -508207.42 482794.40 1.0000000
Various-Lockdown 530621.487 35120.58 1026122.40 0.0038187
None-No large gatherings -11261.064 -502260.94 479738.81 1.0000000
Partial curfew-No large gatherings 31369.000 -660560.36 723298.36 1.0000000
Partial lockdown-No large gatherings -17226.500 -616454.91 582001.91 1.0000000
Self-quarantine-No large gatherings -25199.000 -717128.36 666730.36 1.0000000
Social distancing-No large gatherings 39861.500 -559366.91 639089.91 1.0000000
Stay-at-home-No large gatherings -1849.000 -693778.36 690080.36 1.0000000
Travel restrictions-No large gatherings -18839.000 -710768.36 673090.36 1.0000000
Various-No large gatherings 524489.000 -167440.36 1216418.36 0.1478284
Partial curfew-None 42630.064 -448369.81 533629.94 0.9999999
Partial lockdown-None -5965.436 -354375.13 342444.26 1.0000000
Self-quarantine-None -13937.936 -504937.81 477061.94 1.0000000
Social distancing-None 51122.564 -297287.13 399532.26 0.9999773
Stay-at-home-None 9412.064 -481587.81 500411.94 1.0000000
Travel restrictions-None -7577.936 -498577.81 483421.94 1.0000000
Various-None 535750.064 44750.19 1026749.94 0.0028637
Partial lockdown-Partial curfew -48595.500 -647823.91 550632.91 0.9999999
Self-quarantine-Partial curfew -56568.000 -748497.36 635361.36 0.9999999
Social distancing-Partial curfew 8492.500 -590735.91 607720.91 1.0000000
Stay-at-home-Partial curfew -33218.000 -725147.36 658711.36 1.0000000
Travel restrictions-Partial curfew -50208.000 -742137.36 641721.36 1.0000000
Various-Partial curfew 493120.000 -198809.36 1185049.36 0.2174452
Self-quarantine-Partial lockdown -7972.500 -607200.91 591255.91 1.0000000
Social distancing-Partial lockdown 57088.000 -432179.94 546355.94 0.9999974
Stay-at-home-Partial lockdown 15377.500 -583850.91 614605.91 1.0000000
Travel restrictions-Partial lockdown -1612.500 -600840.91 597615.91 1.0000000
Various-Partial lockdown 541715.500 -57512.91 1140943.91 0.0327333
Social distancing-Self-quarantine 65060.500 -534167.91 664288.91 0.9999987
Stay-at-home-Self-quarantine 23350.000 -668579.36 715279.36 1.0000000
Travel restrictions-Self-quarantine 6360.000 -685569.36 698289.36 1.0000000
Various-Self-quarantine 549688.000 -142241.36 1241617.36 0.1052700
Stay-at-home-Social distancing -41710.500 -640938.91 557517.91 1.0000000
Travel restrictions-Social distancing -58700.500 -657928.91 540527.91 0.9999995
Various-Social distancing 484627.500 -114600.91 1083855.91 0.0914743
Travel restrictions-Stay-at-home -16990.000 -708919.36 674939.36 1.0000000
Various-Stay-at-home 526338.000 -165591.36 1218267.36 0.1443175
Various-Travel restrictions 543328.000 -148601.36 1235257.36 0.1149652
Since Lockdown isn’t statistically significant in this model, none of these pairwise differences are statistically significant. The pairwise difference that is the closest to being statistically significant is that of Various-partial lockdown, which has a pairwise difference of 0.033 (rounded to three decimal places).
Variousandpartial lockdownare somewhat similar to each other, as some of the US’s (the only country with thevariouslabel) containment measures include stay-at-home orders in 43 states, which have the same intent as partial lockdowns but allow for outdoor exercise.
Now, let’s see what happens when we change the dependent variable but keep the same independent variable-Lockdown (keep in mind I’m still doing one-way ANOVA):
Ok, so this time I used Total.Tests as the dependent variable, which means I’m analyzing the effect of a country’s containment measures on their total amount of COVID-19 tests. And this time, it looks like Lockdown is a very statistically significant variable, as it has the highest significance code (three asterisks) and a p-value far less than 0.001-as well as an equally high corresponding f-value of 54.95. In this case, we can reject the null hypothesis and conclude that a country’s containment measures do have an effect on the amount of COVID-19 tests they have.
Since Lockdown has more statistical significance this time, the Tukey’s HSD test should have plenty of statistically significant pairwise differences. Let’s find out:
>TukeyHSD(model2, conf.level=0.99)
Tukey multiple comparisons of means
99% family-wise confidence levelFit: aov(formula = file$Total.Tests ~ file$Lockdown.)
$`file$Lockdown.`
diff lwr upr p adj
Lockdown-Curfew 96151.24 -32955.08 225257.552 0.1662125
No large gatherings-Curfew 73822.83 -427814.18 575459.833 0.9999766
None-Curfew -10792.37 -121224.72 99639.985 0.9999995
Partial curfew-Curfew 348388.83 -153248.18 850025.833 0.2508108
Partial lockdown-Curfew 606940.83 244915.50 968966.152 0.0000001
Self-quarantine-Curfew 18835.83 -482801.18 520472.833 1.0000000
Social distancing-Curfew 644769.83 282744.50 1006795.152 0.0000000
Stay-at-home-Curfew 373840.83 -127796.18 875477.833 0.1654945
Travel restrictions-Curfew 41059.83 -460577.18 542696.833 0.9999999
Various-Curfew 2737994.83 2236357.82 3239631.833 0.0000000
No large gatherings-Lockdown -22328.41 -519659.44 475002.620 1.0000000
None-Lockdown -106943.60 -195790.50 -18096.706 0.0005266
Partial curfew-Lockdown 252237.59 -245093.44 749568.620 0.7128433
Partial lockdown-Lockdown 510789.59 154754.75 866824.425 0.0000104
Self-quarantine-Lockdown -77315.41 -574646.44 420015.620 0.9999610
Social distancing-Lockdown 548618.59 192583.75 904653.425 0.0000014
Stay-at-home-Lockdown 277689.59 -219641.44 775020.620 0.5800281
Travel restrictions-Lockdown -55091.41 -552422.44 442239.620 0.9999984
Various-Lockdown 2641843.59 2144512.56 3139174.620 0.0000000
None-No large gatherings -84615.19 -577428.56 408198.178 0.9999022
Partial curfew-No large gatherings 274566.00 -419918.99 969050.989 0.9230867
Partial lockdown-No large gatherings 533118.00 -68323.64 1134559.643 0.0400545
Self-quarantine-No large gatherings -54987.00 -749471.99 639497.989 0.9999999
Social distancing-No large gatherings 570947.00 -30494.64 1172388.643 0.0190437
Stay-at-home-No large gatherings 300018.00 -394466.99 994502.989 0.8706277
Travel restrictions-No large gatherings -32763.00 -727247.99 661721.989 1.0000000
Various-No large gatherings 2664172.00 1969687.01 3358656.989 0.0000000
Partial curfew-None 359181.19 -133632.18 851994.561 0.1904062
Partial lockdown-None 617733.19 268036.66 967429.727 0.0000000
Self-quarantine-None 29628.19 -463185.18 522441.561 1.0000000
Social distancing-None 655562.19 305865.66 1005258.727 0.0000000
Stay-at-home-None 384633.19 -108180.18 877446.561 0.1202463
Travel restrictions-None 51852.19 -440961.18 544665.561 0.9999991
Various-None 2748787.19 2255973.82 3241600.561 0.0000000
Partial lockdown-Partial curfew 258552.00 -342889.64 859993.643 0.8741210
Self-quarantine-Partial curfew -329553.00 -1024037.99 364931.989 0.7887362
Social distancing-Partial curfew 296381.00 -305060.64 897822.643 0.7474791
Stay-at-home-Partial curfew 25452.00 -669032.99 719936.989 1.0000000
Travel restrictions-Partial curfew -307329.00 -1001813.99 387155.989 0.8523867
Various-Partial curfew 2389606.00 1695121.01 3084090.989 0.0000000
Self-quarantine-Partial lockdown -588105.00 -1189546.64 13336.643 0.0133169
Social distancing-Partial lockdown 37829.00 -453246.04 528904.045 1.0000000
Stay-at-home-Partial lockdown -233100.00 -834541.64 368341.643 0.9320348
Travel restrictions-Partial lockdown -565881.00 -1167322.64 35560.643 0.0211145
Various-Partial lockdown 2131054.00 1529612.36 2732495.643 0.0000000
Social distancing-Self-quarantine 625934.00 24492.36 1227375.643 0.0058011
Stay-at-home-Self-quarantine 355005.00 -339479.99 1049489.989 0.7029526
Travel restrictions-Self-quarantine 22224.00 -672260.99 716708.989 1.0000000
Various-Self-quarantine 2719159.00 2024674.01 3413643.989 0.0000000
Stay-at-home-Social distancing -270929.00 -872370.64 330512.643 0.8377204
Travel restrictions-Social distancing -603710.00 -1205151.64 -2268.357 0.0095176
Various-Social distancing 2093225.00 1491783.36 2694666.643 0.0000000
Travel restrictions-Stay-at-home -332781.00 -1027265.99 361703.989 0.7785425
Various-Stay-at-home 2364154.00 1669669.01 3058638.989 0.0000000
Various-Travel restrictions 2696935.00 2002450.01 3391419.989 0.0000000
In this Tukey’s HSD test, there are several statistically significant pair-wise differences (those with a p-value of 0.01 or less). One thing I’ve noticed is that most of the statistically significant pair-wise differences have a lot in common. Here are three examples of this:
Partial lockdown-curfew(p-value of0.0000001)-Partial lockdowns and curfews are similar since both are only partial restrictions of citizens’ movements outside of their homes, whether at certain hours of the day (in the case ofcurfew) or only for certain (usually high-risk) people (in the case ofpartial lockdown)Social distancing-lockdown(p-value of0.0000014)-Social distancing merely asks people to drastically limit their contact with others outside of the home and to stay at least 6-feet away from people when in public places like parks and grocery stores. A lockdown is a more strict form of social distancing which asks people to stay in their homes 24/7 except for essential activities (e.g. grocery shopping, working in essential jobs); for the most part, you can’t go outside for exercise during a lockdown (though countries under lockdown like the UK allow for outdoor exercise).Social distancing-self quarantine(p-value of0.0058011)-Self-quarantine is a more strict form of social distancing where someone wouldn’t leave their residence (or invite anybody into their residence) for at least two weeks; self-quarantining could also mean staying in a separate room for at least two weeks from other people living in your house. Self-quarantining is usually only necessary if you’ve been to a COVID-19 hotspot or have been in contact with someone who has COVID-19 within the last two weeks (e.g. coworkers, friends).
Alright, now since there are so many variables in this set, I thought I’d do another analysis that I haven’t done in quite some time-a k-means clustering. And yes, this is the first time I’m doing two different analyses in one post.
Anyway, let’s start building our cluster, first using the variables Total Cases and Total Tests:
In this example, I created a cluster using the second and eleventh variables in the dataset (Total Cases and Total Tests). I also displayed the head, or first six observations in the cluster (use tail if you want to see the last six observations).
Now let’s do some k-means clustering:
So, since it’s been a while since I’ve done a k-means clustering analysis, here’s what each of these output blocks mean:
- The first line always goes
K-means clustering with (X) clusters of sizes _____. This tells you how many clusters were created and the sizes of each cluster.- In this example, the clusters have sizes of 3, 9, 1, 33, 166, and 1, which means that cluster 1 has 3 observations, cluster 2 has 9 observations, clusters 3 and 6 have only one observation, cluster 4 has 33 observations, and cluster 5 has 166 observations.
- The next block of code gives you all of the cluster means for each variable used in the clustering analysis-
Total CasesandTotal Tests. Means for each variable are grouped by cluster, so the means forTotal CasesandTotal Testsin cluster 1 are calculated from all the values forTotal CasesandTotal Testsin cluster 1 (the same logic applies to the other 5 clusters). - After the cluster means, the next block of code contains the clustering vector, which shows you which observations belong to which cluster. Granted, the positions of the observations aren’t explicitly mentioned, but this vector starts with the first observation and ends with the 213th observation, so you can tell which observation you’re looking at. For instance, the first three observations belong to cluster 5 (which correspond to Afghanistan, Albania, and Algeria); the last three observations also happen to belong to cluster 5 (which correspond to Yemen, Zambia, and Zimbabwe).
- After the clustering vector, the next block contains the WCSSBC (within cluster sum-of-squares by cluster), which measures the variability of observations in a certain cluster. The smaller the WCSSBC is, the more compact the cluster. I’m guessing the two WCSSBC of 0 belong to clusters 3 and 6, as each cluster only has one observation. The WCSSBC of the other clusters are all over 10 million.
- In the same block of code as the WCSSBC, you’ll see the
between_SS/total_SSamount, which is the ratio of the between sum-of-squares to the total sum-of-squares. In simple terms, this ratio measure the overall goodness-of-fit of the model-the higher this ratio is, the better the model fits the data. Since this ratio is 98.3%, the model is an excellent fit for the data. - The
available componentscolumn isn’t relevant for our analysis, so I won’t go into detail here.
Now, let’s graph our clusters (remember to install the ggplot2 package):
Here’s the code to create the graph:
plot(cluster1, col=covid19$cluster, main="COVID-19 tests & cases", xlab = "COVID-19 cases", ylab="COVID-19 tests", pch=19)
In this graph, there are six clusters, which are grouped by color. Let’s analyze what each cluster means:
- The bright green cluster represents the worldwide total of COVID-19 cases on April 12, 2020, which was roughly 1.8 million at the time (but is roughly 2.4 million as of April 19, 2020). Since the worldwide total of tests administered wasn’t factored into this dataset, 0 is used as the worldwide total of tests administered.
- The pink cluster represents the COVID-19 cases in the US, as well as the number of tests the US has administered. The reason why the pink cluster looks so high is that even though the US had administered roughly 2.7 million tests as of April 12, 2020, the US also was the only country that had over 500,000 cases (the number is now closer to 800,000 as of April 19, 2020).
- Not counting the worldwide total, the US still has the most COVID-19 cases as of this writing.
- The black cluster represent the countries that had anywhere between 0 and 200,000 COVID-19 cases as well as anywhere between a million and 1.5 millions tests administered.
- The other three clusters (which are red, dark blue, and light blue) represent the countries that had anywhere between 0 and 200,000 COVID-19 cases but only administered between 0 and 750,000 tests.
All in all, the clusters are quite compact, but this graph does show that even though a country may have administered plenty of COVID-19 tests, this doesn’t mean that that country will have fewer cases. For instance, Italy had only administered just over a million tests but had approximately 156,000 cases. On the other hand, the US had administered roughly 2.7 million tests but had over 500,000 cases (so in other words, the US administered twice as many tests as Italy but still had over triple the amount of Italy’s cases).
Sure, administering more COVID-19 tests would help keep a country’s case count in check (just look at Germany), but if you only analyze the amount of tests a country has given, you’d be ignoring other factors that affect a country’s COVID-19 case count (such as whether or not a country has imposed a nationwide lockdown or something as simple as the country’s total population).
Thanks for reading, and remember to stay safe, wash your hands, and practice good social distancing. Together, we can flatten the curve!
Also, thank you to all the grocery store workers, firefighters, policemen, scientists, doctors, and anybody else who is helping us get through these rough times.
Michael