Unleash the Power of Data using R with Covid Trial Datasets

Anu Ganesan
4 min readMar 25, 2021

Data is the universal truth guiding today’s world of Data Science. At the same time, it is not as simple as it looks to extract actionable insights. Turning data into meaningful insights require deep insights into the domain and the environment from where the data originates.

The year of 2020 turned into a painful year. Covid has changed our lives forever. These difficult times have taught us how to be resilient and empathetic. Hope is setting its foot with the arrival of Covid Vaccines. Let’s learn the power of data using Covid Trial Datasets which has led to the emergence of vaccines in record time.

We will be using R as the programming language to explore and visualize Covid Trial Datasets.

Below is a visualization of Covid Vaccince Trial Datasets that we will be analyzing:

Covid Vaccine Trial Datasets

Step 1: Load the Covid Trial Dataset

CovidDF = data.frame(read.csv(“CovidTrials.csv”))

Step 2: Analyze Datasets and its variables

> str(COVID)
‘data.frame’: 5061 obs. of 27 variables:

There are 5061 observations(rows) with 27 variables(columns)

Analyze Gender Variable:

> unique(CovidDF$Gender)
[1] “All” “Female” “Male” “”

> length(which(CovidDF$Gender == “All”))
[1] 4881

> length(which(CovidDF$Gender == “Male”))
[1] 40

> length(which(CovidDF$Gender == “Female”))
[1] 131

> length(which(CovidDF$Gender == “”))
[1] 9

> nrow(CovidDF);
[1] 5061

There are total 5061 observations (rows) with 40 males, 131females,4881 all and 9 unspecified gender.

The value “All” for gender variable indicates that it can be of any gender. It is hard to achieve precise result when modelling data with gender. Here comes the nightmare of every data engineers trying to deal with incomplete data.

Analyze Study Results Variable:

> unique(CovidDF$Study.Results)
[1] “No Results Available” “Has Results”

> length(which(CovidDF$Study.Results == “No Results Available”))
[1] 5033

> length(which(CovidDF$Study.Results == “Has Results”))
[1] 28

The above result indicates that the Covid Trial datasets were captured when the study was still happening with 5033 results under “No Results Available”.

Step 3: Analyze Age Variable

Age variable consists of free-form values with different formats used to capture the age of volunteers in different locations.

Below is sample age values captured in trial datasets

18 Years and older (Adult, Older Adult)

18 Years and older (Adult, Older Adult)

Child, Adult, Older Adult

18 Years to 48 Years (Adult)

It would have been much better if age was captured as a integer or in range format. Since it is having different values and formats, we will use convertAge function to approximate age to a certain numeric value.

convertAge <- function(age) {

newage = str_replace_all(age, “Child”, “”)

newage = str_replace_all(newage, “Adult”, “”)

newage = str_replace_all(newage, “Older”, “”)

newage = str_replace_all(newage, “older”, “”)

newage = str_replace_all(newage, “and”, “”)

newage = str_replace_all(newage, “Years”, “”)

newage = str_replace_all(newage, “to”, “”)

newage = str_replace_all(newage, “up”, “”)

newage = str_replace_all(newage, “\\(“, “”)

newage = str_replace_all(newage, “\\)”, “”)

newage = str_replace_all(newage, “,”, “”)

newage = str_trim(newage, side = c(“both”, “left”, “right”))

newage = ifelse(grepl(“Months”, newage, fixed = TRUE) == TRUE,str_replace_all(newage, newage, “2”), ageStrip)

newage = ifelse(grepl(“Month”, newage, fixed = TRUE) == TRUE,str_replace_all(newage, newage, “2”), ageStrip)

newage = ifelse(grepl(“Days”, newage, fixed = TRUE) == TRUE,str_replace_all(newage, newage, “1”), ageStrip)

newage = substr(newage, nchar(newage)-2+1, nchar(newage))

newage = as.numeric(newage)

}

ageVector <- sapply(CovidDF$Age, convertAge)

CovidDF$NewAge =ageVector

CovidDF$NewAge[is.na(CovidDF$NewAge)] <- 0

CovidDF$NewAge = as.numeric(CovidDF$NewAge)

Step 4: Visualize using plots in R

hist(CovidDF$NewAge, xlab=”Age group”, main=”Age of Volunteers in Covid Vaccine Trails”)

Covid Vaccine Trial Age Group Frequency

Above histogram provides information about the number of volunteers grouped by age.

Box plot of the same information gives the median, 1'st and 3'rd quartile information

Age of volunteers for Covid Vaccine Trial

Convert Variable Conditions to a factor,a vector of integer values with a corresponding set of character values. The integer value of the conditions factor can then be used to plot scatter plot against the new age variable which is also numeric.

ConditionFactor = as.factor(CovidDF$Conditions)

CovidDF$NewConditions = as.numeric(ConditionFactor)

plot(x=CovidDF$NewAge, y=CovidDF$NewConditions, xlab = “Age”)

The above scatter plot shows more conditions to be registered for age group 18 and 0. The age group 0 indicates that the trial data didnot have any age value.

Data is the oil for all new age technologies but it is also important to understand the need to properly engineer data without which highly accurate machine learning models would be near to impossible.

Follow us to keep updated on Predera AIQ Services and Solutions in the field of Data Engineering, Machine Learning and MLOps

--

--