<p>### Dataset description<br>
<br>
This dataset contains fastq files for three Illumina HiSeq runs of an<br>
RNA-seq analysis (see Osmundson J et al., PLoS One, 2013;8(10):e76572)<br>
<br>
This data is used as a case-study for the Tools in Reproducible Research<br>
course. We have previously used `fastq-dump` from the `sra-tools` package<br>
to download a subsampled set of sequences from the Sequence Read Archive<br>
(SRA). However,recently sra-tools has become very unreliable due to some<br>
certificate/security issue when downloading from the National Center for<br>
Biotechnology Information (NCBI). We have therefore created this dataset to<br>
use as an alternative starting point for the course case-study.<br>
<br>
All three files were generated on the Rackham compute cluster by installing <br>
sra-tools (v.3.0.3) from the bioconda channel:<br>
<br>
```<br>
mamba create -n sra-tools -c bioconda sra-tools<br>
conda activate sra-tools<br>
```<br>
<br>
then running:<br>
<br>
```<br>
fastq-dump SRR935090 -X 100000 --gzip -Z > SRR935090.fastq.gz<br>
fastq-dump SRR935091 -X 100000 --gzip -Z > SRR935091.fastq.gz<br>
fastq-dump SRR935092 -X 100000 --gzip -Z > SRR935092.fastq.gz<br>
```<br>
<br>
Thus, each file contains a subset of 100,000 reads for each sample downloaded<br>
from the original data found in the SRA archive. The original data contains<br>
between 76.3 - 176.6 million reads. The idea is to let the students download<br>
these subsampled files directly or as part of bioinformatic workflows taught<br>
during the course.</p>
Funding
National Bioinformatics Infrastructure Sweden (NBIS)