COVID-19 Data Analysis: What is the Data Really Telling Us? A Closer Look
--
COVID cases increasing, COVID cases decreasing. Infection rate increasing, infection rate decreasing. Death rate increasing, death rate decreasing. These three sentences have dominated headlines for almost a year now, but what does it all mean? What statistics should I analyze to determine my risk of catching COVID-19?
Before we continue, as a gentle reminder to all, it’s important to stay safe by taking precautions to minimize the risks of transmission. In this article, I will walk through what different statistics mean, and show you how to work with publicly available COVID-19 data sets. I hope that this article will provide you with insight on how to make sense of the data available to us.
Finding Data
First of all, where do you find data? Information should be posted regularly on the website belonging to your county and state department of health. In the State of California, you can visit covid19.ca.gov to view numbers of cases, deaths, COVID-19 hospitalized patients, ICU beds available, and the testing positivity rate under “Dashboard”. A breakdown of these statistics, as well as what is considered open, by county, is also available under “Dashboard” and “County map”.
The next places you can check are the CDC COVID Data Center (for national data) and the Johns Hopkins Coronavirus Research Center (for national and international data). This information is updated daily, and the JHU data is even part of a publicly available data set you can use to generate your own graphs and comparisons using machine learning through a tool like CoCalc.io or Kaggle.com.
Finally, Google Maps has recently added COVID-19 cases per 100,000 people (a 7 day average). To access this data in Google Maps, click on the sandwich button located on the search bar and click on “COVID-19 Info”. If you zoom into the map, you will be able to see state/provincial/regional data in some countries and county-wide data in the United States. Otherwise, country-level data will be displayed.
Getting Started with Data Analysis
The first thing you want to do is upload your files. If you wish to work with COVID data through CoCalc.com, follow the in the two figures below to get started. If you want to use Kaggle instead, search Kaggle for COVID-19 datasets from Johns Hopkins, copy the dataset, and simply press the “Edit” button.
In order to view the data, we’re going to insert the lines below under hello_world.ipynb. First, we need to import the pandas library, then tell the program which data file to read (in this case it’s called ‘cases_country.csv’. Finally, the last line tells the program to print out the first ten rows of the data. Once you have inserted these lines of code, press run and the table below should be displayed.
import pandas as pd
df_country = pd.read_csv(‘cases_country.csv’)
df_country.head(10)
Next, instead of ordering the rows by their original order in the data set, we want to choose a different statistic to focus on — for now, we’ll say number of deaths. Next, we want to add a color gradient to emphasize the “intensity” of the number. As generated below, when the data was updated on June 4th at 8:33 pm, the number of deaths in the US was far greater than the country with the second largest number of deaths, the UK, so the shade of red used for the UK is much closer to the shade of red used for Italy (the country with the third largest number of deaths) than the shade of red used for the United States. We’ve also applied a color gradient to the mortality rate, and you’ll see that many more data points share a similarly dark intensity of red — since several mortality rates are around 14–16.
df_deaths_sorted = df_country.sort_values(‘Deaths’, ascending=False)
df_deaths_sorted.head(10).style.background_gradient(cmap=’Reds’,subset=[‘Deaths’, ‘Mortality_Rate’])
Let’s take the data we have and put it into a nice graph and put things into perspective. As I mentioned, the number of deaths in the US (in this sample) is much larger than the other countries within the top ten. This is easier to visualize when you import plotly.express and create a graph with the x-axis being Country_Region and y-axis being Deaths (as shown below).
import plotly.express as px
fig = px.scatter(df_deaths_sorted.iloc[range(0,10)], x=’Country_Region’, y=’Deaths’, size=’Deaths’, size_max=60, color=’Country’)
fig.show()
If you wanted to return to your table, you could take your project to the next step by adding deaths/1M pop back in, and setting a different color for your mortality rate gradient to distinguish “deaths/1M pop” and “mortality_rate”.
Analyzing the Data
After looking at the figure above, you may have already noticed a difference between “deaths”, “deaths/1M pop”, and “mortality rate” from the figures above. “Deaths” counts the total number of COVID-related deaths in the country, while “deaths/1M pop” keeps the total number of deaths in proportion with the population inside the country. However, there are still multiple ways to calculate the rate of fatalities from COVID-19 in proportion to a country’s population. Here’s how BBC News explains the difference between death rate and mortality rate:
There are, in fact, two kinds of fatality rate. The first is the proportion of people who die who have tested positive for the disease. This is called the “case fatality rate”. The second kind is the proportion of people who die after having the infection overall; as many of these will never be picked up, this figure has to be an estimate. This is the “infection fatality rate”.
This explains why even though San Marino has a much higher death rate/1M pop, the mortality rate is extremely low (in fact, on the low end for the ten countries/regions listed here), while Belgium’s mortality rate is extremely high.
In addition, you may have noticed that the US is no longer in the lead — this is because the US led in the number of deaths, but not when put into perspective. It is San Marino, which was not even in the top ten for deaths, that had the highest deaths/1M pop; this is likely because San Marino is extremely small, with a population of 34,000 people. Thus, once one person contracts COVID-19 in San Marino, it creates a more significant chance in proportional COVID-19 measures than in a country like the US, where the population is around 330 million.
TL;DR
As we analyze COVID-19 data, it’s important not to let headlines get in the way — one line isn’t telling the whole story. There are numerous different methods to help us put everything into perspective, or proportions of the population within a region, but ultimately there is no “perfect” metric to analyze the impact of the coronavirus. And this article isn’t even going into the economic data behind the impacts of COVID-19, which is something that deserves an article or two just on its own.
Note: This article referenced slides and images that are part of the “COVID-19 Data Analysis with Python: What is the Data Really Telling Us?” workshop offered by the TeenTechSF STEM Workshop Team. For updates on when this workshop (or others) will be offered, check out the TeenTechSF Eventbrite page and social media for updates.