Chapter 16 Extra Material
16.1 Data Types
Data frames are required that each column have the same type. That is to say, if a column is numeric, you can’t just change one value to a character string. Below are the most common data types that are used within R.
- Integers - These are the integer numbers \(\left(\dots,-2,-1,0,1,2,\dots\right)\).
To convert a numeric value to an integer you may use the function
as.integer()
.
<-3
x<-c(1:3)# note the c command to 'combine' data
y<-c(1:10,12:20) z
- Numeric - These could be any number (whole number or decimal). To convert another
type to numeric you may use the function
as.numeric()
.
<-3.14
x<-c(1,2,8,21*pi)# note the c command to 'combine' data
y<-c(1:10,12:20,17.75) z
- Strings - These are a collection of characters (example: Storing a student’s
last name). To convert another type to a string, use
as.character()
.
<-'Derek'
x<-c('Bonnie','Clyde')
y<-c('Rod','Jane','Freddie') z
- Factors - These are strings that can only values from a finite set. For example
we might wish to store a variable that records home department of a student.
Since the department can only come from a finite set of possibilities, I would
use a factor. Factors are categorical variables, but R calls them factors instead.
A vector of values of another type can always be converted
to a factor using the
as.factor()
command.
<-as.factor(c('Bonnie','Clyde'))
y y
## [1] Bonnie Clyde
## Levels: Bonnie Clyde
- Logicals - This is a special case of a factor that can only take the values
TRUE
andFALSE
. (Be careful to always capitalizeTRUE
andFALSE
. Because R is case-sensitive, TRUE is not the same as true.) Using the functionas.logical()
you can convert numeric values toTRUE
andFALSE
where0
isFALSE
and anything else isTRUE
.
<-FALSE
x<-c(TRUE,TRUE,FALSE,FALSE)
y<-as.logical(c(0,0,1,1,0,1,0)) z
Depending on the command, R will coerce your data from one type to another if necessary, but it is a good habit to do the coercion yourself. If a variable is a number, R will automatically assume that it is continuous numerical variable. If it is a character string, then R will assume it is a factor when doing any statistical analysis.
Most of these types are familiar to users of other software except for factors. Factors are how R keeps track of categorical variables. R does this in a two step pattern. First it figures out how many categories there are and remembers which category an observation belongs to and second, it keeps a vector of character strings that correspond to the names of each of the categories.
# A character vector
<- c('B','B','A','A','C')
y y
## [1] "B" "B" "A" "A" "C"
# convert the vector of characters into a vector of factors
<- factor(y)
z str(z)
## Factor w/ 3 levels "A","B","C": 2 2 1 1 3
Notice that the vector z
is actually the combination of group assignment vector 2,2,1,1,3
and the group names vector “A”,”B”,”C”
. So we could convert z to a vector of numerics or to a vector of character strings.
as.numeric(z)
## [1] 2 2 1 1 3
as.character(z)
## [1] "B" "B" "A" "A" "C"
Often we need to know what possible groups there are, and this is done using the levels()
command.
levels(z)
## [1] "A" "B" "C"
Notice that the order of the group names was done alphabetically, which we did not chose. This ordering of the levels has implications when we do an analysis or make a plot and R will always display information about the factor levels using this order. It would be nice to be able to change the order. Also it would be really nice to give more descriptive names to the groups rather than just the group code in my raw data. Useful functions for controlling the order and labels of the factor can be found in the forcats
package.
16.2 Working with markdown
Markdown is a useful tool to quickly write good reports, presentations, websites, etc; it allows us to share our work with collaborators, for assessment, or publish for the world. As an example, these notes are written in R Markdown.
R currently (confusingly) has two markdown languages installed within it: R Markdown and Quarto. If you are working with R, both are fairly equivalent, but Quarto is newer and also allows you to use other computer languages such as python, and I suspect R Markdown will be phased out in favour of Quarto.
The first step in any new analysis or project is to create a new markdown file.
This can be done by selecting the File -> New File -> Quarto Document
dropdown option and a menu will appear asking you for the document title, author, and preferred output type. For now, use HTML. In order to create a PDF, you’ll need to have LaTeX installed, but the HTML output nearly always works and I’ve had good luck with the MS Word output as well.
A Markdown document is just a text file with some basic structure specifying the typesetting information so that it can easily be converted into either a webpage, pdf, or MS Word document. This syntax was extended to allow us to embed R commands directly into the document.
Perhaps the easiest way to understand the syntax is to look at the
Help -> Markdown Quick Reference
dropdown link in RStudio.
Whenever you create a new Markdown document, it is populated with code and comments that attempts to teach new users how to work with markdown. Critically there are two types of regions:
Region Type | Description |
---|---|
Commentary | These are the areas with a white background. You can write nearly anything here and it will appear in your final document. I typically use these spaces to write commentary and interpretation of my data analysis project. |
Code Chunk | These are the grey areas. This is where your R code will go. When rendering the document, each code chunk will be run sequentially and the code in each chunk must run. |
The R code is nicely separated from regular text and R knows that these code chunks are those that needs to be evaluated. The output of this document looks good as a HTML, PDF, or MS Word document.
While writing a markdown file, each of the code chunks can be executed in a couple of different ways.
- Press the green arrow at the top of the code chunk to run the entire chunk.
- The run button has several options has several options.
- There are keyboard shortcuts, on windows it is Cmd-Enter.
To insert a new code chunk, a user can type it in directly, use the green Insert button, or the keyboard shortcut.
Finally, we want to produce a nice output document that combines the code, output, and commentary. To do this, you’ll “Render” the document which causes all of the R code to be run in a new R session, and then weave together the output into your document. This can be done using the “Render” button at the top of the Editor Window.
Without Markdown, we must tediously copy and paste tables of output from the R console and figures into another document, such as Microsoft Word. Far too often we can make a small mistake, have to go back, correct the mistake, and then redo all the laborious copying.
With R markdown we can re-run the analysis with a click of a button and all the tables and figures will be updated by magic.
16.2.1 Exercises
- Create a new markdown document. Give it a title, and put your name is as author. I suggest sticking to the defaults, and using html format.
- Type in a heading “Analysis of data”.
- After this, add a code chunk (the green button with a “c” on it) and in this box, type the lines
<-sleep
mydatahead(mydata)
boxplot(mydata)
summary(mydata)
- Hit the green Play button to the right of the code and check that it does what you expect.
- Right a paragraph below the code chunk summarising what you see in the code chunk. You can fromat this bold, italic, add bullets and numbering, as you would in any basic editor.
- When you have finished, hit the render button on the ribbon at the top of the editor window. You should be asked for a filename for your document. You could call it “myfirst.qmd”. (R uses .qmd as an extension for Quarto Markdown Document, .rmd as R Markdown Documents)
- Your first markdown document should appear in a new browser window. Doesn’tr it look professional?!
- Go back to your document, and change the first line of your code so it looks like
<-iris mydata
and then hit the run button for the chunk. The dataset will change, but the analysis is the same. You will need to update your description, but can then re-render the document and generate a webpage again. - Note that Quarto comes with a visual and a source editor. I like to work with the visual editor, but occasionally the source editor is needed to make tweaks and do the detailed work.
- Start a new file, but this time start a Quarto Presentation. Just save the default file created, and hit Render to get an idea of how it works: presentations in Quarto are very easy to do simple things. There is a good introduction here: https://quarto.org/docs/presentations/