diff --git a/2022-03-replication/replication.html b/2022-03-replication/replication.html new file mode 100644 index 0000000..0e95f8d --- /dev/null +++ b/2022-03-replication/replication.html @@ -0,0 +1,440 @@ + + + + +Making your work replicable + + + + + + + + + + + +
+
+

Making your work replicable

+

James Brusey

7 March 2024

+
+ +
+
+

Motivation

+
    +
  • There is a replication crisis in science today
  • +
  • There are plenty of incentives to publish more papers
  • +
  • There are few incentives to publish better papers
  • +
  • Penalties for negligence, bias, and fraud are minor even if offenders are found out
  • + +
+ + +
+

science-fictions-book.jpg +

+
+ +
+
+
+
+

Our institutions are implicated

+
    +
  • Publishers charge outrageous fees for essentially running a web / archive service
  • +
  • Universities continue to promote staff based on questionable metrics
  • +
  • Professors teach their PhD students to continue to `game' the system
  • +
  • Reviewers reject replication studies as not `sufficiently novel'
  • +
  • Academics collude in citation cartels to bump up each others citation ranking
  • + +
+ +
+
+
+
+

The public are starting not to trust academics

+
    +
  • Surprisingly, academics are still respected
  • +
  • Public trust is not a given and should not be taken for granted
  • +
  • We (academics) have a responsibility to fix things +
      +
    • let's start by improving replicability of our research
    • + +
  • + +
+ +
+
+
+
+

Ideas for improving replicability

+

+These ideas are focused on the analysis rather than the experimental work itself. +

+ +
+
+
+
+

Please stop using Word and Excel

+
    +
  • An old version of Excel caused a statistical analysis error during the Covid pandemic +
      +
    • but why were they using Excel?
    • + +
  • +
  • An analysis of genomics research shows that many studies have fallen prey to MARCH1 gene being altered by autocorrect in Excel
  • +
  • There are many reasons why you should not use Word but the number one reason is that it will stop you from automating parts of your research—you will tend to be relying on cutting and pasting in figures and tables rather than auto-generating them. Convert away before it is too late.
  • + +
+ +
+
+
+
+

Use the command line and GNU Make

+
    +
  • Analysis ends up having several steps +
      +
    • combining multiple data-sets into one
    • +
    • cleaning up NA entries
    • +
    • removing junk entries
    • +
    • summarising data to produce a table
    • +
    • producing a graph
    • + +
  • + +
+ +
+
+
+
+

Method for using Make

+
    +
  • Each step should be performed with a command or script (e.g., gnuplot)
  • +
  • Form multiple steps into a pipeline with GNU Make
  • +
  • Alongside much on-line sources, also see Data Science at the Command Line https://datascienceatthecommandline.com/
  • +
  • Python tabulate library can be used to convert a CSV to a LaTeX table.
  • +
  • In your LaTeX file, use \input to include those files
  • + +
+ +
+
+
+
+

Example—generating data

+

+For example, say we have a script to generate some data a.csv, b.csv, c.csv called gen.py +

+ +
+ +
import pandas as pd
+import numpy as np
+
+SZ=(20,)
+
+df = pd.DataFrame(np.random.randint(0, 10, size=SZ), columns=["value"])
+df.to_csv("a.csv", index=False)
+df = pd.DataFrame(np.random.normal(0, 1, size=SZ), columns=["value"])
+df.to_csv("b.csv", index=False)
+df = pd.DataFrame(np.random.normal(5, 3, size=SZ), columns=["value"])
+df.to_csv("c.csv", index=False)
+
+
+ +
+
+
+
+

Example—combine data

+

+We might then have another script comb.py to combine them. +

+
+ +
import pandas as pd
+import numpy as np
+
+newframe = {}
+for f in ["a", "b", "c"]:
+    newframe[f] = pd.read_csv(f"{f}.csv")["value"]
+
+df = pd.DataFrame(newframe)
+df.to_csv("all.csv", index=False)
+
+
+
+
+
+
+

Example—table

+

+We can produce a table using python tabulate in a script called maketable.py +

+
+ +
import pandas as pd
+import numpy as np
+import tabulate
+
+df = pd.read_csv("all.csv")
+result = pd.melt(df).groupby("variable").agg(["mean", "std"])
+
+with open("result.tex", "w") as f:
+    print(
+        tabulate.tabulate(result, tablefmt="latex",
+                          headers=["Class", "mean", "std"]),
+        file=f,
+    )
+
+
+ + +
+
+
+
+

Example—graph

+

+Finally, we might use graph.py to plot a versus b (ok, this is not a very meaningful graph!) +

+
+ +
import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+df = pd.read_csv("all.csv")
+df.plot(x="a", y="b", kind="scatter")
+plt.savefig("graph.png")
+
+
+ + +
+

graph.png +

+
+ +
+
+
+
+

Example—LaTeX doc

+

+Naturally, we need a LaTeX document: +

+
+ +
\documentclass{article}
+\usepackage{siunitx}
+\usepackage{graphicx}
+\title{My great article}
+\author{James Brusey}
+\begin{document}
+\maketitle
+\section{Introduction}
+Blah blah blah.
+\section{Results}
+\input{result}
+\begin{figure}
+  \includegraphics{graph.png}
+  \caption{A scatter plot}
+\end{figure}
+\end{document}
+
+
+ +
+
+
+
+

Example—Makefile

+

+Finally, we tie everything together with a Makefile +

+
+ +
article.pdf: article.tex result.tex graph.png
+        pdflatex article.tex
+
+graph.png: all.csv graph.py
+        python graph.py
+
+result.tex: all.csv maketable.py
+        python maketable.py
+
+all.csv: a.csv b.csv c.csv comb.py
+        python comb.py
+
+a.csv: gen.py
+        python gen.py
+
+
+ + +
+
+
+
+

Using RStudio

+
    +
  • RStudio allows you to put all the steps into a notebook form
  • +
  • The result can be exported to a LaTeX document
  • +
  • Best for R but difficult to format for a paper
  • +
  • A great resource for R and the tidyverse is R for Data Science https://r4ds.had.co.nz/
  • +
  • You can also use Pandoc separately from RStudio
  • + +
+
+
+
+
+

RStudio example

+
+ +
---
+title: "Example rmarkdown document"
+date: "24/03/2022"
+author:
+- James Brusey
+- Ann Other Author
+documentclass: scrartcl
+classoption: twoside
+geometry: false
+subtitle: false
+output:
+  pdf_document:
+    includes:
+      in_header: header.tex
+---
+
+# Introduction
+
+This is a sample markdown document. 
+
+I can make a new paragraph using a blank line and a numbered list just with:
+1. this item
+2. this other item
+3. and so forth
+
+Math symbols are also easy either inline $K: \Re \times \Re \rightarrow \{0, 1\}$ or as display math,
+$$ x = \int _{0} ^{\infty} \frac{1}{y^2}. $$
+
+In addition, you can use the power of R to process your data-set and display results as tables or graphs. 
+
+
+```{r cars, warning=FALSE}
+library(knitr)
+kable(summary(cars))
+```
+
+## Including Plots
+
+You can also embed plots, for example:
+
+```{r pressure, echo=FALSE}
+plot(pressure)
+```
+
+Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code ```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE)
+```
+
+## R Markdown
+
+This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.
+
+When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
+that generated the plot.
+
+
+
+
+
+
+

Use Jupyter Notebook

+
    +
  • Jupyter notebook supports Python and several other languages
  • +
  • As with Rstudio, can produce LaTeX by combining code, graphs, and markdown
  • +
  • There are no easy options for changing the document class, so not really a viable option for writing papers
  • + +
+ +
+
+
+
+

Use Emacs org-mode

+
    +
  • Org mode is a powerful editing environment that comes with Emacs
  • +
  • Org mode documents are similar to Rmarkdown (or pandoc) with easy formatting instructions
  • +
  • More flexible than Rmarkdown
  • +
  • Easy to change document class
  • +
  • Can include many different programming languages within the one document
  • + +
+
+
+
+
+

Using a different latex class

+
    +
  1. Your org mode document will need #+latex_class: IEEEtran
  2. +
  3. You'll need to configure Emacs using something like:
  4. + +
+
+ +
(add-to-list 'org-latex-classes
+             '("IEEEtran"
+               "\\documentclass\[10pt\]\{IEEEtran\}"
+               ("\\section\{%s\}" . "\\section*\{%s\}")
+               ("\\subsection\{%s\}" . "\\subsection*\{%s\}")
+               ("\\subsubsection\{%s\}" . "\\subsubsection*\{%s\}")
+               ("\\paragraph\{%s\}" . "\\paragraph*\{%s\}")
+               ))
+
+
+
+ +
+
+
+
+

Further reading

+
    +
  1. I thoroughly recommend Science Fictions &ritchieScienceFictionsExposing2020
  2. +
  3. John Kitchin has a nice article on embedding data into PDFs. &kitchinExamplesEffectiveData2015
  4. +
  5. He also has a youtube describing org mode for research https://youtu.be/1-dUkyn_fZA
  6. + +
+
+
+
+
+ + + + + + + + + diff --git a/2022-03-replication/replication.org b/2022-03-replication/replication.org index bfceaf0..f288d73 100644 --- a/2022-03-replication/replication.org +++ b/2022-03-replication/replication.org @@ -1,16 +1,16 @@ +:REVEAL_PROPERTIES: +#+REVEAL_ROOT: https://cdn.jsdelivr.net/npm/reveal.js +#+REVEAL_REVEAL_JS_VERSION: 4 +#+REVEAL_THEME: white +#+options: timestamp:nil toc:nil num:nil +#+REVEAL_INIT_OPTIONS: width:1200, height:1000, margin: 0.1, minScale:0.2, maxScale:2.5, transition:'cube', slideNumber:true +#+REVEAL_HEAD_PREAMBLE: +:END: #+title: Making your work replicable -#+date: 25 March 2022 +#+date: 7 March 2024 #+property: header-args:ipython :session session1 :results output raw drawer :exports both -#+options: toc:nil H:1 -#+startup: beamer -#+latex_class: beamer -#+latex_class_options: -#+beamer_theme: Boadilla -#+latex_header: \usepackage{natbib} -#+description: #+keywords: #+subtitle: -#+latex_compiler: pdflatex * Motivation + There is a replication crisis in science today @@ -104,7 +104,8 @@ result = pd.melt(df).groupby("variable").agg(["mean", "std"]) with open("result.tex", "w") as f: print( - tabulate.tabulate(result, tablefmt="latex", headers=["Class", "mean", "std"]), + tabulate.tabulate(result, tablefmt="latex", + headers=["Class", "mean", "std"]), file=f, ) #+END_SRC @@ -200,8 +201,6 @@ a.csv: gen.py )) #+END_SRC -* Formatting numbers - * Further reading @@ -211,7 +210,7 @@ a.csv: gen.py -bibliographystyle:plainnat -bibliography:~/Documents/zotero-export.bib * build :noexport: [[elisp:(org-beamer-export-to-pdf)]] +bibliographystyle:plainnat +bibliography:~/Documents/zotero-export.bib