Research Project

The final project will a short analysis in R markdown that implements a Bayesian model using Stan to answer a research question of the student’s choosing.

Proposal

Fork https://github.com/UW-CSSS-564/2018-research-project.
Edit the README.md. Add your name, correct the title, and make any other appropriate customizations.
Write your proposal in the document doc/proposal.Rmd.
Submit the assignment by opening a pull request titled, “Review Proposal”.

The project proposal should be brief. The intent of the proposal is two-fold:

Encourage the student to think about the project with enough time to implement it.
Receive feedback early in order to avoid projects that will be unfeasible in the context of this course.

The proposal should consist of a paragraph or two describing.

State research question
Describe data that will be used to answer that question.
Describe the model to be estimated to the best of your current understanding.
Forecast anticipated problems with the project: data availability, data size, model complexity, etc.

Other notes on this proposal:

You should have data on hand. There is not enough time to collect new data before this is due.
If you will be using non-public or large (> 100 MB) datasets, let the instructors know. Using non-public data is acceptable for the project. However, the instructors will need to work with you to ensure that your workflow keeps that data non-public.
Replications are acceptable. However, you must do some original work when replicating a paper.

Draft

This is a draft of the final research project. This will consistent of whatever the student is able to produce at the time of the due date. We will spend the last week working on these projects so the intent of having a draft prior to that week is not to have a complete draft, but to provide an opportunity for the student to document the current state of the project from which instructors and peers can provide feedback. The student should write up whatever they have done, and their plans, in such a way to provide others the best possibility to provide feedback.

Peer Review

After the drafts are submitted, each student will be assigned one other student to peer review and provide feedback. Time will be allocated in class during the last week in order to do this.

Reading another researcher’s code is a skill, and it can be an eye-opening experience. Peer review is also an important part of the academic process, and being a good reviewer takes practice.

Instructions

These reviews will be submitted as comments on the pull request. The comments should address both substantive and programming issues.

Evaluation

Each peer review will be deemed “good” or “needs more”.

How to give good comments:

Address portions of the code and analysis that were incorrect or unclear. Also address parts that were done particularly well.
Give thoughtful, constructive and considerate comments.
Be specific and concise.
Try to learn something new and, if you succeed, point that out.
If you can’t find anything to praise or that you found helpful, then at least offer some suggestions in a kind way.

Features of “needs more”:

You do not address all parts of the analysis, both computational and substantive.
Your review is mean.
You can’t find anything to praise/learn and yet you don’t offer any suggestions either.

Final Research Project

Submission of the final research project. See Instructions.

Instructions

The final project should produce a computational narrative in R markdown which answers a relevant research problem using methods introduced in this course. A [computational narrative] is a document which combines prose, code, and visualizations to explain and communicate scientific results. Unlike a traditional article, the code and data producing the results transparently presented in the document itself. The Stan Case Studies provide good exemplars of this style.

This course asks requires a computational essay instead of an article in order to focus the students’ attention on the methodology, computation, and reproducibility issues which this course should promote.

The submission will consist of an R markdown document compiled into either a PDF or HTML file (if the document contains interactive material) and a repository containing the data and code necessary to reproduce the analysis. The details of this process are described in the final project details.
The computational narrative should be no longer than a research note in a political science journal, 5,000 words of text (excluding code).
The computational narrative should be self contained and written for a general social science audience.
The computational narrative should effectively and informatively communicate its research design, methods, and contribution using text, code, and visualizations. The computational narrative should include no unnecessary material.
The computational narrative must address an research problem or question using the appropriate methodology and design incorporating methods introduced and discussed in this course.
As with any project worked on during graduate school, the author should use it to further research that could lead to a publication. However, given the time constraints of the quarter, it is recognized that it may not be possible to develop a novel contribution to a body of knowledge. Thus, projects will be primarily assessed on the appropriate application of methodology while taking the question as given.
Replications of existing papers are acceptable provided that the author conducts their own analysis.
It is acceptable to use this work in other courses.
It is possible to collaborate on this assignment, but it requires pre-approval of the instructor.
The computational narrative should not include a lengthy discussion of previous literature. But if it makes an original contribution it should clearly identify a gap in this literature and state the original contribution.
Citations should use the citation capabilities in R markdown; see the R markdown documentation.
Projects will not be judged on their results. They will be judged on the appropriateness of Bayesian methods used to answer the question.
The use of visualization to convey results is preferable to tables or the printed output of code.
Equations and formulas are important for the presentation of statistical arguments. Authors should make the mathematical presentation as clear as possible. Clear and consistent notation and formatting of equations should be used. All symbols used in equations need to be clearly defined. To ensure readability of the paper, authors should choose a notation that makes the argument as easy to follow as possible. Equations are part of the text and thus they should contain appropriate punctuation. Equations should be numbered consecutively, with sub-numbering used as appropriate, e.g. equations 1a and 1b.
Mathematical notation should use the appropriate R markdown math capabilities. See this for an introduction.
To the extent possible, variable names should be readable and clearly denote what the variable is.
Provide clear a descriptions of the data and the context necessary to understand the data. A non-expert reader should be able to understand the meaning and scale of the relevant variables and important features of the analysis.

Advice

As in any graduate courses, strive to produce analysis that could lead to a publishable article.¹
The main focus of the computational narrative should be on data, methods, and results. Justify your modeling choices with reference to theory. Present findings in terms any intelligent person could understand, regardless of their statistical knowledge. This should not limit the sophistication of your methods. It does require you to explain results from complicated models in approachable terms.
Do not spend too much time on literature reviews or theory, but do not neglect hypothesis building. Hypotheses can be clearly explicated without recourse to numbered lists. The the time the reader reaches the results, they should know what to expect, what would be surprising, and why.
Research that ask substantively important, interesting, novel, or controversial questions are better—potentially much better—than research that do not, all else equal.
Papers that explain their empirical findings in ways non-specialists can understand are better than papers that do not, all else equal.
Model specifications informed by test statistics, substantive knowledge and theory are better than model specifications based solely on test statistics, all else equal.

Reproducibility

Code must run without error and successfully replicate.
The repository must contain a README.md file which describes the contents of the repositories, software and data dependencies and installation instructions, and instructions on how to reproduce the analysis.
If the analysis requires data too large to include in the repository, instructions on how to obtain the necessary must be included. If possible, a script to automate the process is preferred.
If the analysis uses data that cannot be distributed for privacy or intellectual property reasons, it must be documented. The student must discuss with the instructor prior to submission.

Organization

The repository must be organized as an R project.
To the extent possible, good scientific practices should be followed, as outlined in [Good enough practices in scientific computing] (http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005510).
Do not use of absolute paths in the code. See this post for a discussion of this. Use the here package to
Do not use install.packages in the computational narrative. Instructions on installing dependencies should go in the README.md or a separate script.
Include a code chunk at the top of the R markdown document which loads the packages which will be used.
Include a code chunk at the end of the R markdown document which includes information from sessionInfo() to document the original computational environment.
Do not include unnecessary output (e.g. extraneous print statement) or messages. Each piece of code and output in the computational narrative should serve some purpose toward communicate the research.

Computational Narratives

A computational narrative is a document that combines text, code, and code output to communicate a data analysis or scientific research. R markdown is one program that can be used to produce computational narratives. Even if students are familiar with statistics and programming, they may be less familiar with computational narratives than with the format of journal articles and books.

The following references provide some discussion of computational narratives. Many of these references Jupyter, a similar, Python-based tool to produce notebooks.

Since students may not have seen computational narratives, the following examples are provided. Most of them are produced by R markdown. Some are produced by the Python program Jupyter. These are not meant to be definitive guides for how to structure your computational narrative. Many of these were produced for different purposes than the project for this course. However, they do provide examples of effectively combining text, code, and output to communicate data analysis.

Stan Case Studies are generally well formatted examples of this style.
Getting Started with GDELT
Brian Keegan, The Need for Openness in Data Journalism (python/Jupyter)
David Robinson, Text analysis of Trump’s tweets confirms he writes only the (angrier) Android half
The Case Studies in the tidytext R package.
Words Growing or Shrinking in Hacker News Titles: A Tidy Analysis
Gender and Verbs across 10,000 stories: a tidy analysis
Brookman, David, and Kalla, Joshua, and Aronow, Peter. 2014. Irregularities in LaCour (2014) This is an unusual document that provides evidence of research fraud in the LaCour and Green paper. However, note how it uses knitr and R to do so. The paper is making and argument, with points supported by figures and results, with the code that produced them also visible.
Mapping the GDELT data and some Russian protests too

This advice is derived with minimal editing from Chris Adolph Writing Empirical Papers 6 Rules & 12 Recommendations.↩