## +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
##
## 2. SharePoint Path ----
##
## +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
# SharePoint path
if (Sys.info()["user"] == "johnDoe") {
<- "/Users/johnDoe/OneDrive-WJP/Data Analytics/"
path2SP
else if (Sys.info()["user"] == "anaPerez") {
} <- "/CloudStorage/OneDrive-WJP/Data Analytics/"
path2SP
else if (Sys.info()["user"] == "kumikoNagato"){
} <- "/Users/kumikoNagato/Library/OneDrive-WJP/Data Analytics/"
path2SP
else{
} <- "PLEASE INSERT YOUR PERSONAL PATH TO THE SP DAU DIRECTORY"
path2SP
}
1 Workflow
In this chapter, we will cover the basic guidelines and issues related to the prefered workflow when programming with R programming language. This chapter will cover four different aspects: a) Cloud Storage, b) Version Control System, c) R Studio Projects, and d) File Management practices.
1.2 Git
Git is a free and open source software that allows users to set up a version control system designed to handle projects. Given the nature of its features, it is normally used for collaboratively developing code and data integrity. GitHub is a website and cloud-based service that helps developers store and manage their code, as well as track and control changes to their code. For a gentle introduction to Git and GitHub, see the following post published by Kinsta. For a more in depth introduction, please refer to the GitHub documentation or watch this video tutorial.
Using the Git features allow us to simultaneously work on the same project and even in the same code without worrying about interfering with other members of the team. As a rule, every project carried by the DAU has a code administrator who is in charge of setting up the GitHub repository and add other members of the team as project collaborators. Additionally, the code administrator is in charge of setting the main branch and the initial structure of the code (see the [data-management](https://ctoruno.quarto.pub/wjp-r-handbook/workflow.html#data-management) section on this chapter). It is required that the GitHub repository have its main branch in its respective SharePoint folder.
Note: It is highly recommended to create the repository from GitHub.com and not from the local machine to avoid the initial commit that can include system files such as .DS_STORE files
We use convergence development1 to collaboratively code in the same project. For this, it is highly recommended that each team member works on a separate branch and, once the data routine is done, the auxiliary branches can be merged into the main branch of the repository. Collaborators that are not the code administrator have the option to clone the GitHub repository in their computer in a local directory that it is not sync to the SharePoint and work in their respective branch from a local copy outside the SharePoint. In other words, it is only the functional final version contained inside the main branch the one that it is going to be sync in SharePoint.
Important: GitHub is used to keep track of the code we use in each project. Under no circumstance, we will include the data sources in the online repositories.
1.3 Projects
When you load a data set or source an R script, you will have to set up the working directory where these files are located in your computer. However, the path to this working directory is quite different for all the members of the team. R Studio allow us to enclose all of our analysis, code and auxiliary files into a project.
A project is a feature that allow us to work with the analysis we are carrying without having to worry about where does these files are stored or who is working on them. Say goodbye to setwd("...")
. Besides managing relatives paths, R projects allow users to keep a history of actions performed and even keep the objects in your environment. Because of this, projects are the cornerstone of our work when performing analysis with R.
As a rule, every project has a file named project-name.Rproj
in its root directory and open it should be the first action when working on a project. For more information on working with R projects, refer to the Workflow section from the R for Data Science book.
1.5 Data Management
The University of California San Diego (UC San Diego) has a Data Management Best Practices that reviews common guidelines for managing research data. In this handbook, I will focus on two topics mentioned by those guidelines: File Organization and Documentation.
1.5.1 File Organization
The file organization involves two important elements: filing system and naming conventions.
A filing system is basically the organization structure (directories, folders, sub-folders, etc) in which files are stored. There are no standard rules about how this should be done. However, the chosen filing system needs to make sense not only to the person currently working on a given project but to anyone going through these files in the future. As a rule, each project would have the following sub-folders:
Code: Depending on the complexity of the project, you could choose to create separate directories within the Code folder for Stata, R or Python files.
Data: Depending on the complexity and nature of the project, you could choose to create separate directories within the Data folder for RAW, INTERMEDIATE or CLEAN data sets.
Outcomes: The outcomes of the project might have several different formats. For example, images could be in PNG and/or SVG format, Reports might be created in PDF, some tables might be exported as TEX files, etc. The outcomes folder should have a separate directory for each one of these formats.
In some cases, the creation of a PDF report is key, for example, the Regional Country Reports. For these projects, we strongly advise to create a separate directory to store the code files used for the report. At the moment of writing this handbook, these reports are created using R Markdown, but a migration to Quarto is feasible in the future.
- Markdown/Quarto
In relation to the naming conventions, these are a set of rules designed to complement the filing system and help collaborators in understanding the data organization. Each project have the flexibility to use a specific set of naming rules to use in the filing system. However, there are a few general rules to note:
Use descriptive file names that are meaningful to you and your colleagues while also keeping them short.
Avoid using spaces and make use of hyphens, snake_case, and/or camelCase.
Avoid special characters such as $, #, ^, & in the file names.
Be consistent not only along the project but also across different projects. If all different data files and routines are named in the same way, it’s easier for you to use those tools across projects and re-factor routines.
1.5.2 Documentation
When initializing the GitHub repository, we strongly suggest to include a README file with it. Such file should be a Markdown document including the following elements:
A brief description of the project and the role of the DAU in it.
Referred person contact, which should the project leader and the code administrator.
A brief description of the filing system along with what to find in each sub-directory.
Given that no data is uploaded into the online repositories, include the following descriptions:
Data needed to run the code: source and process to obtain it.
Location of the data in the SharePoint.
A brief description on how to read the code with, if possible, visual aids.
For an example on one of these files, please check the following README file. If possible, use this example as a template for future projects.
Depending on the complexity of each project, the team can add any suitable documentation files if needed. Additionally, the DAU keeps records of significant issues encountered during each project in order to facilitate the solution process in future projects. These records are kept in the Issues Logbook of the DAU.