5 Simple R Functions to Know Your Data Better
It only takes 5 simple R functions to understand data for better decision making.
Use Case 1: IT Salary Survey EU 2020
The data source for this use case is obtained from the below kaggle url
R Function #1
> Salary = data.frame(read.csv(“Salary.csv”))
> str(Salary)
‘data.frame’: 1253 obs. of 23 variables:
There are 1253 observations(rows) with 23 variables(columns)
> which.max(Salary$Age)
[1] 1104
> Salary$Age[1104]
[1] 69
The maximum age of IT professional is 69
R Function #2
> names(Salary)[names(Salary) == “Annual.brutto.salary..without.bonus.and.stocks..one.year.ago..Only.answer.if.staying.in.the.same.country”] <- “AnnualSalary”
> which.max(Salary$AnnualSalary)
[1] 854
> Salary$AnnualSalary[854]
[1] 500000000
> which.min(Salary$AnnualSalary)
[1] 713
> Salary$AnnualSalary[713]
[1] 11000
> plot(x=Salary$Age, y=Salary$AnnualSalary, xlab=”Age”, ylab=”Salary”, type=”l”)
We obtain below scatter plot by removing the anamoly
> plot(x=Salary$Age, y=Salary$AnnualSalary, xlab=”Age”, ylab=”Salary”, ylim=c(0, 500000))
R Function #3
> hist(Salary$AnnualSalary, xlab=”Salary group”, main=”IT Salaries”)
> hist(Salary$Age, xlab=”Age group”, main=”IT Age Group”)
R Function #4
> boxplot(Salary$Age, main = “IT Age Range”, ylab = “Age”)
> boxplot(Salary$AnnualSalary, main = “IT AnnualSalary Range”, ylab = “AnnualSalary”)
The above box plot doesnot showcase the median, 1'st and 3'rd quartile of IT Employee’s annual salary. Lets remove the anamoly and box plot by providing salary range in the y-axis.
> boxplot(Salary$AnnualSalary, main = “IT AnnualSalary Range”, ylab = “AnnualSalary”, ylim=c(0,500000))
R Function #5
Below tapply command finds the average Salary by age
> tapply(Salary$AnnualSalary, Salary$Age, mean)
43 years old Average Salary= 7786049 years old Average Salary = 61500
50 years old Average Salary =28800
51 years old Average Salary = 50000
54 years old Average Salary = 95000
59 years old Average Salary = 69000
65 years old Average Salary = 50000
66 years old Average Salary = 50000
Use Case 2: Air Traffic Passenger List
The data source for this use case is obtained from the below kaggle url
R Function #1
>AirTraffic = data.frame(read.csv(“AirTraffic.csv”))
> str(AirTraffic)
‘data.frame’: 680985 obs. of 16 variables:
There are 680985 observations(rows) with 16 variables(columns)
> which.max(AirTraffic$Total)
[1] 680985
> AirTraffic$Total[680985]
[1] 150195
The maximum number of Air Traffic passenger flown by air carrier is 150195
R Function #2
> plot(x=AirTraffic$Year, y=AirTraffic$Total, xlab=”Year”, ylab=”No of Passengers”, type=”l”)
The above scatter plot shows the reduction in the number of passengers for the year 2020 due to covid19
R Function #3
>hist(AirTraffic$Scheduled, xlab=”Metric Flown”, main=”Air Traffic”)
R Function #4
> boxplot(AirTraffic$Total, main = “Air Traffic”, ylab = “Total Flown Metric”)
R Function #5
Below tapply command finds the average flown metrics by year
> tapply(AirTraffic$Total, AirTraffic$Year, mean)
2014 = 7426.3462015 = 7511.870
2016 = 7563.536
2017 =7604.168
2018 = 7845.246
2019 = 8024.988
2020 = 5809.144
Understanding data not only improves domain knowledge but also acts as the fuel to improve data engineering and machine learning projects.
Follow us to learn more about Data Engineering from our regular updates