Revising R Data Tables

This is the R Script I used to give a Beginner’s R Workshop. The aim of the workshop, was not to give a introduction about the language. But to give an hands-on learning to the participant. Learning by doing. The R Script was written keeping in mind that participants don’t have any pre-knoweldge of R. However, some working experience of spreadsheets would help understanding how the same operations can be scaled to a large dataset.Main focus was R Data Tables and is no where exhuastive but hopes to act as a good starting point.

The dataset I used was from Lok Dhaba. The datasets are freely available for use.

#What and why R ?
# is a programming language that allows you to Store | Manipulate | Visualise data 

# 1. More powerful than Excel
# 2. Handle thousands of rows and many columns at a time 
# 3. Repetitive efforts can be reduced with programs 
# 4. Vast library of packages for data manipulation and visualisations
# 5. Cross-platform friendly : Can run on many operating systems.



# 1. Ctrl+L --> To clear the console
# 2. ls() to check the variables that are live in the environment
# 3. rm(variable_name) to remove variables

#Many packages are already part of RStudio, to install other packages , this command has to be run once only 

#install.packages("data.table"), already ran this command before sharing 

#After installing package in the RStudio, it needs to be brought to the environment , this means many packages can exist in the RStudio, 
#but needs to be extracted into the working environment 
library(data.table)

#For Documentation to any function and package
#1. ?function_name
#2. help(function_name/package_name)

#Tabular Data in R can be processed with 3 ways
#1.data.frame
#2.data.table (Advanced version of data.frame, today's focus)
#3.dplyr library

#Data Frames: 2 dimensional data structure
#Data frames store data as a sequence of columns and rows. Each column can be of a different data type.

#The file contains the dataset, this dataset has to be fed into a variable. function : read.csv("path_to_the_file",..,..), takes the dataset as a data.frame
dataframe <-read.csv("/cloud/project/TCPD_AE_Goa_2021-7-1.csv")
View(dataframe)

#Data Tables
#1. Data manipulation operations such as subset, group, update, join etc.
#2. reducing programming and compute time tremendously,

datatable <-fread("/cloud/project/TCPD_AE_Goa_2021-7-1.csv")
View(datatable)

#The way dataset is rendered, datatable and dataframe doesn't have difference
#However, in your console, you can see the left most column and find a ":", this is specific to datatable and is a 
#way to recognise

head(dataframe,2)
head(datatable,2)

#to view names of all the columns in the datatable
names(datatable)

#to understand the structure of datatable, variable types of each column
str(datatable)


#Use ctrl+l to clear the console

#Data Tables allow you to slice the dataset and update it

#DT[i, j, by]
#R: i j by
#i --> rows to select or operate on
#j --> select columns to operate on /update these columns ++ take these columns as a variable and operate on it 
#by ---> group by --> use the columns and rows and make unique groups

#Lets try to get the age of the candidate in the sort manner
#First lets find unique values 
unique(datatable$Age)

#Sort these unique values
sort(unique(datatable$Age))

#To see unique values in a column
unique(datatable$Election_Type)
unique(datatable$District_Name)

#Sort Names of Candidates Alphabetically, and update the datatable
order(datatable$Candidate, decreasing=FALSE, na.last=TRUE)
datatable <- datatable[order(datatable$Candidate, decreasing=TRUE, na.last=TRUE),]

datatable$Candidate <- sort(datatable$Candidate, decreasing = TRUE)
#na.last puts wherever a NA is encountered at the last of the dataset

#Now, we can see the age values, lets try selecting rows where age of candidate is 35 
result <- datatable[datatable$Age==35]


#Lets find what kind of variable is this 'result'
class(result)

#It is data.table, we can render to view
View(result)

#what is we wish to select some specific section of row, for example all rows between 20 to 40 serial number , 
#index starts from 1 as can be seen the datatable
datatable[20:40]

#Selecting single column and all rows 
result <- datatable[,~Month]
head(result)

#The error is because R is case sensitive

result <- datatable[,.(datatable$month)]
View(result)

#Multiple columns
result <- datatable[, c("month", "Age")]
head(result, 10)

#Data Tables allows to take columns as a variable, operate on it

#the below line of code, evaluate if sum of month and position is less than 3 for each row 
#and sends the results as TRUE/FALSE the period /. before the j part is used to deliver a list instead of a vector. 

#For now, you can just say that we will use this syntax to get vertical columns as output. 
#And see the difference between vectors and list, later.

result <- datatable[,.((month+Position)<3)]
View(result)

#Now if you want to add this new variable to the datatable, the variable will added in the last
datatable$new_var_MPos <- result


#Here the function is not subset/slicing based on columns it is decision making

# .N is a special character in R , containing the number of rows in the group.
#Useful when you dont know the columns names and want to work on the selected rows in i part

#Here we took all the rows, checked the unique values in the column District_Name and counted how many rows
#falls under which category
result =datatable[,.N,by=District_Name]
result

#More example, Now lets assume we want runner's up value only and discard winners' rows for now
# And we want to group by, how many candidates are runner up district wise

result =datatable[Position!=1,.N,by=District_Name]
result
#Here, we selected all the rows which has position not (!) equal to 1 , .N helped it taking all these rows and grouped by
#District_Name

#More example
result =datatable[Position=1,.N,by=.(District_Name,Candidate_Type)]
result

#Dependency for plotly sudo apt-get install libcurl4-openssl-dev
#install.packages("plotly")

library(plotly)

fig <- plot_ly(data = datatable, x = ~Party, y =~Position)
fig

fig <- plot_ly(data = datatable, x = ~District_Name, y =~Candidate_Type)
fig

fig <- plot_ly(data = datatable, x = ~District_Name, y =~Candidate_Type, type= histogram2d)
fig

fig <- plot_ly(data = datatable, x = ~Age, y =~Position)
fig <- fig %>% layout(title = 'Age versus Position', yaxis = list(zeroline = FALSE),xaxis = list(zeroline = FALSE))
fig

#Helpful Resources

MOOCs: Educative.io R Course, just one among many other great sources available online
R Project Documentation: Website; Functions: help() , ?
Stack Overflow
Plotly:Visualisation library

Incase you figure out any errors and have feedbacks, please send them across here