2 Jumping essentials
2.1 The essentials
Before you jump into the water it can be of benefit when you know more about the water. What is the temperature? Is it really cold or just nice and warm? How high is the jump? Do you need to jump first 5 meters from a diving board or can you already feel the water with your toes?
This first chapter will give some basic programming essentials that will allow you to jump easier. Also it can be used as a reference for when you need to make the jump again
2.1.1 Find your info online and in documentation
R has so many functions that it is impossible to know everything by heart. Documentation and the internet are always your best friend.
Stackexhange
is an excellent resource. 90 to 99% of your questions related to how you should use your R and tidy functions has been asked before by others. The nice thing is that, most often, the questions include a snippet of code that makes the problem reproducible. More importantly, almost all questions have been accurately answered in multiple ways.
Other resources that often come up in my search results are either forums on POSIT community
, Reddit
, or Github discussions or issues
. These are more forum-like comments, with not such a good solvability structure as stackexchange.
Then there are many more resources that somehow scrape the internet and collect basic info. Most of the time the info is correct but too simplistic. Not real issues are tackled. These are sites like geeksforgeeks
, datanovia
, towardsdatascience
, some have better info then others, but most of the time these have commercial activities and in the end want to sell you courses or get your clicks.
The good news is that you don’t have to remember any of these sites by heart. Just type your question on your favorite search engine, ideally copypasting the exact error message, and your answer would very likely be found.
2.1.2 R and tidyverse documentation
All functions in R and tidyverse are accurately documented. All its arguments are described and especially the examples
that are given are really helpful. Packages have often even more documentation called vignettes
that explain certatin topics and contexts on how and when to use the functions.
2.1.3 Style and layout
Writing your code benefits from proper readability. Just like we layout our texts, manuscripts and excel data files, we also need a good layout for our code.
There are mulitple ways to organize your code, I try to adhere to: - short lines (max 60 characters per line) - indent after first line - indent after ggplot - each next function call aligns with the above function - each argument aligns with the previous argument - each ggplot layer gets its own line - I put the x and y aesthetics for ggplot mapping on one line
Other good practices are: - use the package name before a function, like dplyr::mutate
- use comments to annotate the code, when you put a #
before it, it is not executed
So here is an example on what not to do and its corrections
2.1.4 Calling packages and functions
There are two ways of calling functions in R, the most straightforward and easy one is just to call it with its name. This works only when you run a function from base R. When you want a function from another package, you can either first load the package with library(your_favorite_package)
and then call your function with my_favorite_function(my_argument)
. Another, preferred way, is to always explicitly mention the package from which the function comes from. It can happen that two different packages implement a function with the same name, leading to confusion. In that case, you need to be careful with the library
loading, because then the function might be masked by another function with the same name from another package
2.2 Basic R semantics
When starting using R and tidyverse the new language can be daunting. So here is a short primer of common semantics that are often not directly understood from code.
I took some of these example directly or indirectly from:
2.2.1 Assignment
The most common way of assigning in R is the <-
symbol. Although the =
works in the same way, it is reserved by R users for other things. I tend to use it for assigning numbers to constants, and it is used in function arguments
2.2.2 Vectors and lists
A vector
in R is a collection of items (elements) of the same kind (types). A list
is a collection of items that can also have different types. We make a vector with c()
and a list with list
. The c
in c()
apparently stands for combine
link
If you try to build a vector with elements of different types, R will try to adapt all of them to a single type. You can see that when you specify a vector with numbers and characters eg. c(1, 2, "1", "2)
. It forces the vector to be of character
type. While it may look handy that R does this for us, it is a dangerous feature that might lead to wrong inputs going unnoticed.
Lists form the basis of all other data than vectors. Dataframes are collections of related data with rows and columns and unique columns names and row names (or row numbers). data.frame
is actually a wrapper around the list
method.Tibbles
are the tidyverse equivalent of dataframes
with some more handy properties over dataframes. A ‘list’ can have names items or not.
2.2.3 Common semantics
R language is different from other programming languages, and when starting out learning R there are some rules and common practices.
2.2.4 ~ (the “tilde”)
2.2.5 + (the plus)
Apart from the simple arithmetic addition, +
is also used in the ggplot functions. It adds the multiple layers to each ggplot
2.2.6 %>% (the pipe)
The %>%
is used to forward an object to another function or expression. It was first introduced in the magrittr
package and is now also introduced in base R as the |>
pipe, which are now identical. See blogpost for more info.
2.2.7 == (equal to)
The ==
is the equal to operator. It is different than =
which is used only for assignment.
2.2.8 aes (aesthetics in ggplot)
The aes
is important for telling the ggplot what to plot. aes
are the aesthetics of the plot that need to mapped to data. So the ggplot needs data
and mappings
.
The ggplot
acronym is actually coming from the grammar of graphics
, which is a book “The grammar of graphics” by Leland Wilkinson, and was used by Hadley Wickham to make the ggplot
package in 2005.
A ggplot
consists of: - data - aestehtic mappings (like x, y, shape, color etc) - geometric objects (like points, lines etc) - statistical transformations (stat_smooth) - scales - coordinate systems - themes and layouts - faceting
2.2.9 %in% (match operator)
This is handy to check and filter specific elements from a vector
2.3 Practical tips
2.3.1 Running your code
Webr code in the browser can be run as a complete code block by clicking on the Run code
button when the webr status is Ready!
, right above the block.
Another option is to select a line of code (or more lines) and press command or ctrl enter
. This will execute only the line or lines that you have selected.
2.3.2 Simple troubleshooting your pipelines and ggplots
It happens that your code is not right away typed in perfectly, so you will get errors and warnings. It is good practice to break down your full code block or pipe into parts and observe after which line of code the code is not working properly.
2.4 ADVANCED EXERCISE
2.4.1 Building your data visualisation step by step
Let’s take a built-in R dataset USArrests
. We want to visualize how the relative number of murders in the state Massachusetts relates to the other states with the highest urban population in those state. In the dataset, the murder
column represents the number of murders per 100.000 residents
Make a plot that addresses the above dataviz problem.
Hints:
Do the following in your coding:
glimpse
at the data and look at the top5 rows usinghead()
- use
tibble::rownames_to_column()
to make a separate column calledstates
- clean the column names using
janitor::clean_names()
- turn the datatable into a
tibble
using ‘as_tibble’ - take only the the top states by using a filter on the urban population (take it higher than 74)
- plot the data using a
geom_col
- label the x axis and not the y-axis
- highlight the massachusetts column using a separate
geom_col
layer, were you put a filter on the original data by using in thegeom_col
a call todata = . %>% filter(str_detect(states, "Mass"))
. Also give this bar a red color. - apply a nice theme so that there are only x axis grid lines and no lines for y and x axis.
- Also make sure that x-axis starts at zero
- Use the
forcats::refactor()
to sort the states on the y-axis from highest murder to the lowest murder rate.
Include all these aspects step by step.