MRSA case study example data

educational resource

posted on 2023-03-10, 12:39 authored by John SundhJohn Sundh

### Dataset description

This dataset contains fastq files for three Illumina HiSeq runs of an
RNA-seq analysis (see Osmundson J et al., PLoS One, 2013;8(10):e76572)

This data is used as a case-study for the Tools in Reproducible Research
course. We have previously used `fastq-dump` from the `sra-tools` package
to download a subsampled set of sequences from the Sequence Read Archive
(SRA). However,recently sra-tools has become very unreliable due to some
certificate/security issue when downloading from the National Center for
Biotechnology Information (NCBI). We have therefore created this dataset to
use as an alternative starting point for the course case-study.

All three files were generated on the Rackham compute cluster by installing
sra-tools (v.3.0.3) from the bioconda channel:

```
mamba create -n sra-tools -c bioconda sra-tools
conda activate sra-tools
```

then running:

```
fastq-dump SRR935090 -X 100000 --gzip -Z > SRR935090.fastq.gz
fastq-dump SRR935091 -X 100000 --gzip -Z > SRR935091.fastq.gz
fastq-dump SRR935092 -X 100000 --gzip -Z > SRR935092.fastq.gz
```

Thus, each file contains a subset of 100,000 reads for each sample downloaded
from the original data found in the SRA archive. The original data contains
between 76.3 - 176.6 million reads. The idea is to let the students download
these subsampled files directly or as part of bioinformatic workflows taught
during the course.