5 Simple R Functions to Know Your Data Better

Anu Ganesan

4 min readMar 28, 2021

It only takes 5 simple R functions to understand data for better decision making.

Use Case 1: IT Salary Survey EU 2020

The data source for this use case is obtained from the below kaggle url

https://www.kaggle.com/parulpandey/2020-it-salary-survey-for-eu-region?select=IT+Salary+Survey+EU++2020.csv

R Function #1

> Salary = data.frame(read.csv(“Salary.csv”))

> str(Salary)
‘data.frame’: 1253 obs. of 23 variables:

There are 1253 observations(rows) with 23 variables(columns)

> which.max(Salary$Age)
[1] 1104
> Salary$Age[1104]
[1] 69

The maximum age of IT professional is 69

R Function #2

> names(Salary)[names(Salary) == “Annual.brutto.salary..without.bonus.and.stocks..one.year.ago..Only.answer.if.staying.in.the.same.country”] <- “AnnualSalary”

> which.max(Salary$AnnualSalary)
[1] 854
> Salary$AnnualSalary[854]
[1] 500000000
> which.min(Salary$AnnualSalary)
[1] 713
> Salary$AnnualSalary[713]
[1] 11000
> plot(x=Salary$Age, y=Salary$AnnualSalary, xlab=”Age”, ylab=”Salary”, type=”l”)

We obtain below scatter plot by removing the anamoly

> plot(x=Salary$Age, y=Salary$AnnualSalary, xlab=”Age”, ylab=”Salary”, ylim=c(0, 500000))

R Function #3

> hist(Salary$AnnualSalary, xlab=”Salary group”, main=”IT Salaries”)

> hist(Salary$Age, xlab=”Age group”, main=”IT Age Group”)

R Function #4

> boxplot(Salary$Age, main = “IT Age Range”, ylab = “Age”)

> boxplot(Salary$AnnualSalary, main = “IT AnnualSalary Range”, ylab = “AnnualSalary”)

The above box plot doesnot showcase the median, 1'st and 3'rd quartile of IT Employee’s annual salary. Lets remove the anamoly and box plot by providing salary range in the y-axis.

> boxplot(Salary$AnnualSalary, main = “IT AnnualSalary Range”, ylab = “AnnualSalary”, ylim=c(0,500000))

R Function #5

Below tapply command finds the average Salary by age

> tapply(Salary$AnnualSalary, Salary$Age, mean)
43 years old Average Salary= 77860
49 years old Average Salary = 61500
50 years old Average Salary =28800
51 years old Average Salary = 50000
54 years old Average Salary = 95000
59 years old Average Salary = 69000
65 years old Average Salary = 50000
66 years old Average Salary = 50000

Use Case 2: Air Traffic Passenger List

The data source for this use case is obtained from the below kaggle url

https://www.kaggle.com/parulpandey/2020-it-salary-survey-for-eu-region?select=IT+Salary+Survey+EU++2020.csv

R Function #1

>AirTraffic = data.frame(read.csv(“AirTraffic.csv”))

> str(AirTraffic)
‘data.frame’: 680985 obs. of 16 variables:

There are 680985 observations(rows) with 16 variables(columns)

> which.max(AirTraffic$Total)
[1] 680985
> AirTraffic$Total[680985]
[1] 150195

The maximum number of Air Traffic passenger flown by air carrier is 150195

R Function #2

> plot(x=AirTraffic$Year, y=AirTraffic$Total, xlab=”Year”, ylab=”No of Passengers”, type=”l”)

The above scatter plot shows the reduction in the number of passengers for the year 2020 due to covid19

R Function #3

>hist(AirTraffic$Scheduled, xlab=”Metric Flown”, main=”Air Traffic”)

R Function #4

> boxplot(AirTraffic$Total, main = “Air Traffic”, ylab = “Total Flown Metric”)

R Function #5

Below tapply command finds the average flown metrics by year

> tapply(AirTraffic$Total, AirTraffic$Year, mean)

2014 = 7426.346
2015 = 7511.870
2016 = 7563.536
2017 =7604.168
2018 = 7845.246
2019 = 8024.988
2020 = 5809.144

Understanding data not only improves domain knowledge but also acts as the fuel to improve data engineering and machine learning projects.

Follow us to learn more about Data Engineering from our regular updates

5 Simple R Functions to Know Your Data Better

Use Case 1: IT Salary Survey EU 2020

R Function #1

R Function #2

R Function #3

R Function #4

R Function #5

Use Case 2: Air Traffic Passenger List

R Function #1

R Function #2

R Function #3

R Function #4

R Function #5

Written by Anu Ganesan

No responses yet