top of page

Data Wrangling With R Programming Language

  • Writer: deepjraval
    deepjraval
  • Jul 8, 2022
  • 3 min read

Updated: Jul 12, 2022


Data Wrangling

Data Wrangling is the practice of excluding errors and merging complex data sets by performing tasks to turn the data into more accessible formats.

R programming language

R programming language is used for statistical analysis, cleaning, analyzing, and graphing data. R is considered the most suitable for the exploration and experimentation of data.

The purpose of creating R in the late 70s was to "turn ideas into the software quickly and faithfully," and today is increasingly becoming very popular in data science. With consideration of the preference for R in data science, businesses prefer to hire data scientists with profound knowledge of the R language.

R for data science

R provides objects, operators, and functions that allow users to explore, model, and visualize data as a programming language. R is used for data analysis. R in data science is used to handle, store and analyze data. It can be used for data analysis and statistical modeling. Moreover, the recent highlights state that Data Wrangling with R has been considered the most appropriate usage of R in data science.

R language offers objects, functions, and operators that permit data exploration, visualization, and modeling. R in data science is utilized for handling, storing, and analyzing data.


Data Wrangling With R

R has multiple functions that let the developers select and rename columns. It also allows sorting and filtering the data set, creating and calculating new columns, and summarizing values. Here, I have used the Gapminder data for each function to make it easy for you to follow and use for your data sets.

Functions of R for data wrangling

select()

For selecting the columns.


gapminder %>% 
  select(c1, c2, c3) %>% 
  head(rows)
For selecting the columns excluding column 2.
gapminder %>% 
  select(-c2) %>% 
  head(rows)

pull()

The function pull selects a column in a data frame and transforms it into a vector.


library(dplyr)
mtcars[["mpg"]]
mtcars %>% pull(mpg)
 
# more convenient than (mtcars %>% filter(mpg > 20))[[3L]]
mtcars %>%
  filter(mpg > 20) %>%
  pull(3)

rename()

If you want to rename the column "A" to "B" with DPLYR.


rename(dataframe, B = A)

arrange()

The arrange() function in R programming is used to reorder the rows of a data frame/table using column names.

Here we are arranging students by their height in centimeters.


 
#Load the library
Library (dplyr)
#Creating the data frame
df <- data.frame( students = c("Joey", "Chandler", "Charlie", "Barney"),
Height_in_cms = c (183, 176, 175, 170) )
# Arrange the data of height_in_cms
df.students<- arrange(df, height_in_kgs)
print(df.students)

filter()

The filter() function produces a subset of the data frame, retaining all rows that satisfy the specified conditions.


library(dplyr)
 
# sample data
df=data.frame(x=c(12,31,4,66,78),
			y=c(22.1,44.5,6.1,43.1,99),
			z=c(TRUE,TRUE,FALSE,TRUE,TRUE))
 
# condition
filter(df, x<50 & z==TRUE)

mutate ()

The mutate function creates a new variable from a data set.


mutate(data_frame, new_var = [existing_var]

group_by()

group_by is used to group the data frame in R.


library(dplyr)
 
df = read.csv("Sample_Superstore.csv")
 
df_grp_region = df %>% group_by(Region) %>%
					summarise(total_sales = sum(Sales),
							total_profits = sum(Profit),
							.groups = 'drop')
 
View(df_grp_region)

Group_by() function alone will not give any output. It should be followed by summarising () function with an appropriate action to perform.

summarize()

Summarize function reduces a data frame to a summary of just one vector or value.


summarize(X, by, FUN, …, 
          stat.name=deparse(substitute(X)),
          type=c('variables','matrix'), subset=TRUE,
          keepcolnames=FALSE)
asNumericMatrix(x)
matrix2dataFrame(x, at=attr(x, 'origAttributes'), restoreAll=TRUE)


Sum up:

These are the main features of the R programming language that make it the most ideal choice for data wrangling and other data science projects. Make sure to make the most out of it. These features of R can be used by both data scientists and data engineers. If you may wish to read more on data scientist vs data engineer. This article may give you proper insights into the responsibilities of these two professions.



Comments


bottom of page