Chapter 5 Data Import and Export
5.1 Best Practice Summary
5.1.1 Reading
When reading data the recommended approach is:
- Read data as it is without making any changes
- Validate the column names and types
- Possibly also validate values
- Apply transformations to make it tidy
Validating the data when reading it can save lots of problems later. If the data changes it should result in clear error messages so the user knows where the problem is.
Tidying the data after reading it makes it much easier to work with.
For a really robust process, you can either:
Create a function per input containing all of the above steps
Specify the structure of the inputs as a configuration file and create a general function that maps over all inputs reading and validating them.
5.2 CSV
Perhaps the most convenient way to transfer tabular data between applications is the CSV (Comma-Separated Values). CSVs are just plain text files with a line per row and each column separated by commas. Reading and writing CSVs is straight forward with {readr}.
5.2.1 Reading
To read a CSV use read_csv()
, supplying the path to the file:
library(readr)
<- readr_example("chickens.csv")
chickens_csv <- read_csv(chickens_csv) chickens
## Rows: 5 Columns: 4
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): chicken, sex, motto
## dbl (1): eggs_laid
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Notice that read_csv()
automatically detects the type of data in each column and lets you know how it has interpreted it. It is good practice when building a process to specify the columns and types you are expecting explicitly. This way you are alerted when new data does not conform to your specification. Doing so with read_csv()
is straight forward with the col_types
argument. You can even use the spec()
function to extract the types readr guessed and modify them as required:
spec(chickens)
## cols(
## chicken = col_character(),
## sex = col_character(),
## eggs_laid = col_double(),
## motto = col_character()
## )
<- read_csv(
chickens file = chickens_csv,
na = "",
col_types = cols(
chicken = col_character(),
sex = col_character(),
eggs_laid = col_integer(),
motto = col_character()
) )
If you want to override the column names, you can with the col_names
argument, but that is not recommended as it can mask unexpected data changes. It is generally better to read it in as it is, validate the columns, then rename them afterwards.
One thing to be careful of with when reading CSVs is that read_csv()
interprets "NA"
as a missing value. This particularly bad for Lloyd’s as we have a “NA” Risk Code. To avoid this set the na
argument to ""
, so only empty values are interpreted as missing.
One other issue you may encounter when reading tables are headers that are difficult to work with. read_csv()
preserves the column headers as they are, but if there are spaces of special characters they have to be surrounded by ``
. Unless the preservation of names is absolutely necessary it is a good idea to replace them with tidy names. You can use the rename()
function from {dplyr} to rename them directly, or the {janitor} package has a useful function called clean_names()
which converts the column names into snake_case.
5.2.2 Writing
Despite the name, the {readr} package also contains a function for writing CSVs. As you may expect, it is called write_csv()
:
write_csv(chickens, "chickens.csv", na = "")
As with read_csv()
you have to be careful with missing values (NA
). The default behaviour in write_csv()
is to write NA
s as “NA” in a CSV. Since we have a Risk Code called “NA” this could have unforeseen issues. It is better to specify write_csv()
outputs NA
s as empty strings (""
).
5.3 Excel
Interacting with Excel from R is not always straight forward. Excel is not the best medium for data as it combines it with formatting and presentation. As a result it can be fiddly to pick out the data you are interested in. In addition, due to the interactive nature of Excel it is unlikely data is validated and users can modify the workbook without realising it will break the downstream R processes. It is difficult to build a robust process that depends on Excel.
The following high-level rules should help working with R and Excel:
- Avoid using Excel. If another data format can be used, e.g. CSV or a database, then use it. It is better to have a more robust medium and import/export into R and Excel from the simpler format. Using the data import functionality in Excel makes this very simple.
- Keep it simple. Ideally you should only import/export tidy flat tables, preferably with a single table per sheet. If you must have multiple tables per worksheet then lay them out side-by-side. Avoid multiple, nested header rows. While it may look nice for presentation, it is much harder to work with. It is better to use flat tables in “input” worksheets then use pivot tables or similar to present the data in another worksheet or possibly another workbook.
- Validate data. Because the structure of Excel workbooks is not locked down, the most common issue when using R is changes to the layout, which breaks the process. While validation of inputs is recommended in all processes it is even more important here. In particular, you should specify the column names and types expected, as well as allowed values. Giving a good error message if the validation fails will save lots of time debugging later on.
5.3.1 Packages
5.3.1.1 {readxl}
The best package for reading data from Excel is {readxl}. It is part of the Tidyverse and so works in a very similar way to {readr}. The {readxl} package is intentionally simple and most of the time you will only need to use two functions:
excel_sheets()
: List all sheets in an excel spreadsheet.read_xlsx()
: Read tabular data from an xlsx file.
read_xlsx()
has lots of options for reading the data, similar to read_csv()
, we will summarise some of them here, but for the full list check out the documentation.
5.3.1.1.1 Table location
In order to locate the required data within the spreadsheet you have a number of options:
sheet
: The worksheet name. The function will extract a table based on the sheet contents. If there is only one table this will work fine, but if you have multiple tables you will need to specify arange
.skip
: Skip the specified number of rows before reading the table.range
: Specify a range to read from. This can be in the format"B3:D87"
or"R1C3:R27C12"
. Alternately, several helper functions can be used for more flexible ranges:cell_rows()
: Specify rows only, columns are determined by non-empty cells.cell_cols()
: Specify columns only, rows are determined by non-empty cells.cell_limits()
: Specify the upper-left and lower-right cell. Unspecified limits will be determined by non-empty cells.
5.3.1.1.2 Column Names and Types
read_xlsx()
allows you to define the column names and types, with a few arguments:
col_names
: If unspecified the first row will used as column names. Otherwise, it can be used to override the column names in the spreadsheet. As with CSVs this is generally not recommended, as it may mask issues with the data, such as columns being moved.col_types
: This can be used in the same way as CSVs to set the expected column types. If the data does not conform to these types a warning is raised.
5.3.1.1.3 Examples
See the examples page for the read_xlsx()
function.
5.3.2 {writexl}
Like {readxl}, {writexl} is intentionally simple. It only writes data to a new Excel workbook and the table is positioned on the top-left of the sheet.
library(writexl)
write_xlsx(chickens, "~/chickens.xlsx")
If you want to write multiple table, you can used a named list, where the name is the sheet name. For example:
write_xlsx(
x = list(iris = iris, chickens = chickens),
path = "~/chickens.xlsx"
)
5.3.3 {openxlsx}
While most use-cases can be covered by the above packages, there are some that are not. For example, if you need to
- Create a spreadsheet that has multiple tables per sheet, or multiple headers per table.
- Define the formatting for the output.
- Fill out an existing Excel template
In the above examples {writexl} will not be suitable. In these examples {openxlsx} is the package to use.
Before using {openxlsx}, you should first review alternatives. For example, it might be better to export data as a separate Excel workbook using {writexl} then use lookups in the template. Using {openxlsx} is very powerful, but it can get complicated and difficult to maintain if the spreadsheet changes regularly.
{openxlsx} can read, write and update Excel files, it can also get and set formatting and styles. It can be used to build complicated workbooks from scratch or update templates. Describing how it works is beyond the scope of this document, but if you would like to use it, it has a very good introduction.