5 Simple R Functions to Know Your Data Better

Anu Ganesan
4 min readMar 28, 2021

--

It only takes 5 simple R functions to understand data for better decision making.

Use Case 1: IT Salary Survey EU 2020

The data source for this use case is obtained from the below kaggle url

https://www.kaggle.com/parulpandey/2020-it-salary-survey-for-eu-region?select=IT+Salary+Survey+EU++2020.csv

R Function #1

> Salary = data.frame(read.csv(“Salary.csv”))

> str(Salary)
‘data.frame’: 1253 obs. of 23 variables:

There are 1253 observations(rows) with 23 variables(columns)

> which.max(Salary$Age)
[1] 1104
> Salary$Age[1104]
[1] 69

The maximum age of IT professional is 69

R Function #2

> names(Salary)[names(Salary) == “Annual.brutto.salary..without.bonus.and.stocks..one.year.ago..Only.answer.if.staying.in.the.same.country”] <- “AnnualSalary”

> which.max(Salary$AnnualSalary)
[1] 854
> Salary$AnnualSalary[854]
[1] 500000000
> which.min(Salary$AnnualSalary)
[1] 713
> Salary$AnnualSalary[713]
[1] 11000
> plot(x=Salary$Age, y=Salary$AnnualSalary, xlab=”Age”, ylab=”Salary”, type=”l”)

We obtain below scatter plot by removing the anamoly

> plot(x=Salary$Age, y=Salary$AnnualSalary, xlab=”Age”, ylab=”Salary”, ylim=c(0, 500000))

R Function #3

> hist(Salary$AnnualSalary, xlab=”Salary group”, main=”IT Salaries”)

> hist(Salary$Age, xlab=”Age group”, main=”IT Age Group”)

R Function #4

> boxplot(Salary$Age, main = “IT Age Range”, ylab = “Age”)

> boxplot(Salary$AnnualSalary, main = “IT AnnualSalary Range”, ylab = “AnnualSalary”)

The above box plot doesnot showcase the median, 1'st and 3'rd quartile of IT Employee’s annual salary. Lets remove the anamoly and box plot by providing salary range in the y-axis.

> boxplot(Salary$AnnualSalary, main = “IT AnnualSalary Range”, ylab = “AnnualSalary”, ylim=c(0,500000))

R Function #5

Below tapply command finds the average Salary by age

> tapply(Salary$AnnualSalary, Salary$Age, mean)
43 years old Average Salary= 77860

49 years old Average Salary = 61500

50 years old Average Salary =28800

51 years old Average Salary = 50000

54 years old Average Salary = 95000

59 years old Average Salary = 69000

65 years old Average Salary = 50000

66 years old Average Salary = 50000

Use Case 2: Air Traffic Passenger List

The data source for this use case is obtained from the below kaggle url

https://www.kaggle.com/parulpandey/2020-it-salary-survey-for-eu-region?select=IT+Salary+Survey+EU++2020.csv

R Function #1

>AirTraffic = data.frame(read.csv(“AirTraffic.csv”))

> str(AirTraffic)
‘data.frame’: 680985 obs. of 16 variables:

There are 680985 observations(rows) with 16 variables(columns)

> which.max(AirTraffic$Total)
[1] 680985
> AirTraffic$Total[680985]
[1] 150195

The maximum number of Air Traffic passenger flown by air carrier is 150195

R Function #2

> plot(x=AirTraffic$Year, y=AirTraffic$Total, xlab=”Year”, ylab=”No of Passengers”, type=”l”)

The above scatter plot shows the reduction in the number of passengers for the year 2020 due to covid19

R Function #3

>hist(AirTraffic$Scheduled, xlab=”Metric Flown”, main=”Air Traffic”)

R Function #4

> boxplot(AirTraffic$Total, main = “Air Traffic”, ylab = “Total Flown Metric”)

R Function #5

Below tapply command finds the average flown metrics by year

> tapply(AirTraffic$Total, AirTraffic$Year, mean)

2014 = 7426.346

2015 = 7511.870

2016 = 7563.536

2017 =7604.168

2018 = 7845.246

2019 = 8024.988

2020 = 5809.144

Understanding data not only improves domain knowledge but also acts as the fuel to improve data engineering and machine learning projects.

Follow us to learn more about Data Engineering from our regular updates

--

--

No responses yet