R Basics¶
This course includes 4 sections:
- Section 1: R Basics
- Section 2: Data Wrangling
- Section 3: Programming Basics
- Section 4: Importing Data
These four sections correspond to the first chapter of the first book.
I will take notes in the order of the book rather than the order of the video.
1. Getting Started¶
As an undergraduate student taking notes from the online book "Introduction to Data Science - Getting Started with R," here are the key points to consider:
Why R?¶
- Origin: Developed by statisticians for interactive data analysis, not for general software development like C or Java.
- Interactivity: Essential for quick data exploration in data science.
- Scripting: Ability to save work as scripts for record-keeping and reproducibility.
- Differences: Not conventional like other programming languages; offers unique power in data analysis and visualization.
- Other Attractive Features of R
- Free and Open Source: Accessible to all users.
- Cross-Platform: Runs on Windows, macOS, UNIX/Linux.
- Seamless Sharing: Scripts and data objects can be shared across different platforms.
- Active Community: Large, growing, and supportive with numerous learning resources.
- Extensibility: Easy to contribute and share new data science methodologies across various disciplines.
The R Console¶
- Function: Used for interactive data analysis, executing commands as typed.
- Access: Can be accessed by starting R on a computer.
RStudio¶
- Role: Serves as a launching pad for data science projects with an editor, console, and other tools.
- Panes: Four main panes for different functionalities (code editor, console, environment/history/connections/tutorial, files/plots/packages/help/viewer/presentation).
- Key Bindings: Keyboard shortcuts for efficient task performance, recommended to memorize for common operations.
Installing R Packages¶
- Base R: Limited functionality; extended by add-ons from developers.
- CRAN and GitHub: Sources for packages, with hundreds available.
- Installation: Easy installation from within R or RStudio.
install.packages()
- Loading Packages: Use
library()
to load installed packages into R sessions. - Dependencies: Some packages install additional required packages automatically.
2. R basic¶
Motivating Example: US Gun Murders
you can load the data through data(murders)
.
The Very Basics of R¶
- Objects: Storing values for later use with
<-
for assignment- Example: Solving quadratic equations using variables
- Workspace: The environment where objects are stored and can be accessed
- Viewing workspace variables in RStudio's Environment tab
- Methods to save and load workspaces using
save
,save.image
, andload
- Prebuilt Functions and Objects
- Prebuilt Functions
- Using predefined functions like
sqrt
,log
for calculations - You can get help by using the
help
function like this:help("log")
or?log
- Using predefined functions like
- Prebuilt Objects
- You can see all available prebuilt datasets using
data()
- mathematical quantities like
pi
andInf
- You can see all available prebuilt datasets using
- Prebuilt Functions
- Variable Names in R: start with a letter, can’t contain spaces, and avoid conflicts with existing functions or reserved words in the language
- Comment: comments start with the symbol
#
Data Types in R¶
Using class()
to identify the type of an object; The function str
is useful for finding out more about the structure of an object.
- Data Frames
- Storing datasets in data frames as tables with observations and variables
- Accessing data frames and their structure using
str()
andhead()
- Using
$
to access variables within a data frame - Naming and accessing vector entries
- Vectors
- Creating and manipulating vectors with
c()
and accessing elements - types: "numeric", "character", "logical", "factor", "integer"
- Numbers default to the numeric class, even if they are whole numbers. To convert to integers, use
as.integer()
or suffix withL
(e.g.,1L
).
- Creating and manipulating vectors with
- Factor: Storing categorical data and ordering levels with
factor
andreorder
(different from character) -
Lists
- Data frames are a special case of lists. Lists are useful because you can store any combination of different types. You can create a list using the
list
function like this:
S - you can extract the components of a list with the accessor
$
or double square brackets ([[
) - Matrices
- Defining matrices for matrix algebra operations and converting to data frames
- entries in matrices have to be all the same type
- We can define a matrix using the
matrix
function like:
S - We can access elements in the matrix in a similar manner to numpy
- We can convert matrices into data frames using the function
as.data.frame
- Data frames are a special case of lists. Lists are useful because you can store any combination of different types. You can create a list using the
More of Vectors¶
- names:
S | |
---|---|
- Use
seq
for creating vectors generates sequance - Subsetting: We use square brackets to access specific elements of a vector.
- Coercion: When an entry does not match the expected, some of the prebuilt R functions try to guess what was meant before throwing an error.
as.numeric
,as.character
; a special value called anNA
for “not available”
- Sorting and Ordering
sort
return the sorted vector in increasing order.order
takes a vector as input and returns the vector of indexes that sorts the input vectorrank
For any given vector it returns a vector with the rank of the first entry, second entry, etc., of the input vector.max
andmin
return the value.which.max
andwhich.min
return the index.
- Vector arithmetics:
- arithmetic operations on vectors occur element-wise
- If we have two vectors of the same length, and we sum them in R, they will be added entry by entry as follows:
\[
\begin{pmatrix}a \\b \\c \\d\end{pmatrix}+\begin{pmatrix}e \\f \\g \\h\end{pmatrix}=\begin{pmatrix}a+e \\b+f \\c+g \\d+h\end{pmatrix}
\]
- if the vectors don’t match in length, R has recycled the numbers in the short vector
- Indexing:
- we can use logicals to index vectors
- logical operator:
==
,!=
,<>
,>
,>=
,<
,<=
,&
,|
,!
,&&
,||
&
and|
are Element-wise Logical operator&&
and||
are Vectorized Logical operator
- logical operator:
- The function
which
tells us which entries of a logical vector are TRUE - The function
match
tells us which indexes of a second vector match each of the entries of a first vector - If rather than an index we want a logical that tells us whether or not each element of a first vector is in a second, we can use the function
%in%
- we can use logicals to index vectors
Basic Plots¶
- Creating scatterplots with
plot
, histograms withhist
, boxplots withboxplot
, and image withimage
- The function
with
Evaluate an R expression in an environment constructed from data, possibly modifying (a copy of) the original data.
3. Programming Basic¶
- Conditional expressions
- R use if-else statement (like C language) for flow control
ifelse
takes three arguments: a logical and two possible answers. If the logical isTRUE
, the value in the second argument is returned and ifFALSE
, the value in the third argument is returned
- Defining functions
- In general, functions are objects, so we assign them to variable names with
<-
. The functionfunction
tells R you are about to define a function. The general form of a function definition looks like this:
- In general, functions are objects, so we assign them to variable names with
S | |
---|---|
- namespace: it is likely that two packages use the same name for two different functions
- R will follow a certain order when searching for a function in these namespaces. You can see the order by typing
search
- You can force the use of a specific namespace by using double colons (
::
) like:dplyr::filter
,stats::filter
- if we want to use a function in a package without loading the entire package, we can use the double colon as well
- If you want to see all the packages that have function called, for example
filter
, you can use double questions marks:??filter
- R will follow a certain order when searching for a function in these namespaces. You can see the order by typing
- For-loops: the grammar is also like C language.
- Vectorization and functionals: vectorization is preferred over for-loops
- Functionals are functions that help us apply the same function to each entry in a vector, matrix, data frame, or list. Here we cover the functional that operates on numeric, logical, and character vectors:
sapply
.
- Functionals are functions that help us apply the same function to each entry in a vector, matrix, data frame, or list. Here we cover the functional that operates on numeric, logical, and character vectors:
4. The tidyverse¶
- Tidy format permits the data analyst to focus on more important aspects of the analysis rather than the format of the data
- Refining data frame:
mutate
: adding columns; it takes the data frame as a first argument and the name and values of the variable as a second argument using the convention; transform variables, apply the same transformation to several variables byacross
filter
: row-wise subsetting; it takes the data frame as the first argument and then a conditional statement as the secondselect
: column-wise subsetting; it takes the data frame as a first argument and the column name as the next arguments (orstarts_with
,where
,ends_with
,contains
,matches
,num_range
)
- Pipe: perform a series of operations by
%>%
or|>
; the pipe sends the result of the left side of the pipe to be the first argument of the function on the right side of the pipe - The
summarize
function in dplyr provides a way to compute summary statistics - A common operation in data exploration is to first split data into groups and then compute summaries for each group
group_by
- We can extract varialbes with
pull
- With
arrange
we get to decide which column to sort by (also for nested sorting) - The function
top_n
takes a data frame as it’s first argument, the number of rows to show in the second, and the variable to filter by in the third. if the third argument is left blank,top_n
filters by the last column. - “tibble”, is a special kind of data frame, we can transform a data frame into a tibble by
as_tibble
- Tibbles display better than data frame
- Tibbles can have complex entries (even functions)
- Tibbles can be grouped
- creating tibble;
- Placeholder: if we want to pass it as argument to the right-hand side function that is not the first, we should use placeholder.
- For
|>
pipe the placeholder operator is_
- for the
%>%
pipe the placeholder is.
- For
- purrr package: includes functions similar to
sapply
but that better interact with other tidyverse functionsmap
works very similar tosapply
but always, without exception, returns a listmap_dbl
always returns a vector of numeric valuesmap_df
, always returns a tibble data frame
-
Tidyverse conditionals:
- The
case_when
function is useful for vectorizing conditional statements. It is similar toifelse
but can output any number of values
S between
function determines if a value falls inside an interval.
- The
5. data.table¶
- data.table is more efficient and can handle larger datasets more effectively.
- data.table is a separate package that needs to be installed:
library(data.frame)
-
Refining data tables:
as.data.table
can convert the data frame into a data.table- Column-wise subsetting:
-
Adding or transformin variables: The data.table
:=
function permits us update the variable by reference
S | |
---|---|
Text Only | |
---|---|
1 2 3 4 |
|
- Summarizing data:
s <- heights |> summarize(avg = mean(height), sd = sd(height))
has the same function ass <- heights_dt[, .(avg = mean(height), sd = sd(height))]
- We simply add the
by
argument to split the data into groups based on the values in categorical variableheights_dt[, .(avg = mean(height), sd = sd(height)), by = sex]
- We simply add the
- Sorting:
murders_dt[order(population)]
6. Importing data¶
- highly recommend only using relative paths in your code
getwd
get the full path of your working directory;setwd
change your working directory;file.path
function combines characters to form a complete path;file.copy
copy the file with full path;- File types: text files and binary files
- The most common delimiters are comma (
,
), semicolon (;
), space (), and tab (a preset number of spaces or
\t
); - You can look at any number of lines from within R using the
readLines
function which can reveal what the file's delimited is. - R’s
readBin
function can process any binary file - Encoding: ASCII, UTF-8, UTF-16, and UTF-32; RStudio typically uses UTF-8 as its default
- The most common delimiters are comma (
- Parsers: importing functions are most in readr, readxl and data.table packages.
- Basic R:
read.csv
,read.table
andread.delim
;scan
is another flexible function - readr: it is part of the tidyverse; Its parsers permit us to specify an encoding. It also includes a function
guess_encoding
that tries to guess the encoding; we can specify it through thelocale
argument - readxl: provides functions to read-in Microsoft Excel formats; These functions read the first sheet by default; The
excel_sheets
function gives us the names of all the sheets in an Excel file. These names can then be passed to thesheet
argument in these functions. - data.table: provides the
fread
function for large datasets which automatically detects the format of the input (even gzip and zip)
- Basic R:
Function | Format | Typical suffix |
---|---|---|
read_table |
white space separated values | txt |
read_csv |
comma separated values | csv |
read_csv2 |
semicolon separated values | csv |
read_tsv |
tab delimited separated values | tsv |
read_delim |
general text file format, must define delimiter | txt |
Function | Format | Typical suffix |
---|---|---|
read_excel | auto detect the format | xls, xlsx |
read_xls | original format | xls |
read_xlsx | new format | xlsx |
- Downloading files: Most parsers can read these files on the internet through url directly;
- you can use the
download.file
function to have a local copy of the file; tempdir
creates a directory with a random name that is very likely to be unique;tempfile
creates a character string, not a file, that is likely to be a unique filename
- you can use the
- Organizing data with spreadsheets:
- avoid Microsoft Excel format
- Be Consistent
- Choose Good Names for Things: don't use space and symbols; stick to letters and numbers
- Write Dates as YYYY-MM-DD
- No Empty Cells
- Put Just One Thing in a Cell
- Make It a Rectangle
- Create a Data Dictionary
- No Calculations in the Raw Data Files
- Do Not Use Font Color or Highlighting as Data
- Make Backups
- Use Data Validation to Avoid Errors
- Save the Data as Text Files