11 files

References and test datasets for the Cactus pipeline

posted on 25.11.2022, 14:10 authored by Jerome SalignonJerome Salignon, Lluis Milan Arino, Maxime GarciaMaxime Garcia, Christian Riedel


This item contains references and test datasets for the Cactus pipeline.

Cactus (Chromatin ACcessibility and Transcriptomics  Unification Software) is an mRNA-Seq and ATAC-Seq analysis pipeline that aims to provide advanced molecular insights on the conditions under study. 

Test datasets

The test datasets contain all data needed to run Cactus in each of the 4 supported organisms. This include ATAC-Seq and mRNA-Seq data (*.fastq.gz), parameter files (*.yml) and design files (*.tsv). They were were created for each species by downloading publicly available datasets with fetchngs (Ewels et al., 2020) and subsampling reads to the minimum required to have enough DAE (Differential Abundant Entries) for enrichment analysis.

Datasets downloaded:

- Worm and Humans: GSE98758

- Fly: GSE149339

- Mouse: GSE193393


One of the goals of Cactus is to make the analysis as simple and fast as possible for the user while providing detailed insights on molecular mechanisms. This is achieved by parsing all needed references for the 4 ENCODE (Dunham et al., 2012; Stamatoyannopoulos et al., 2012; Luo et al., 2020) and modENCODE (THE MODENCODE CONSORTIUM et al., 2010; Gerstein et al., 2010) organisms (human, M. musculus, D. melanogaster and C. elegans). This parsing step was done with a Nextflow pipeline with most tools encapsulated within containers for improved efficiency and reproducibility and to allow the creation of customized references.

Genomic sequences and annotations were downloaded from the Ensembl FTP. The ENCODE API was used to download 2,714 TF CHIP-Seq profiles (Landt et al., 2012; Boyle et al., 2014) and 899 ChromHMM and 6 HiHMM chromatin state profiles (Boix et al., 2021; van der Velde et al., 2021; Ho et al., 2014). Slim annotations (cell, organ, development, and system) were parsed and used to create groups of CHIP-Seq profiles that share the same annotations to allow users to analyze only relevant CHIP-Seq profiles. 2,779 TF motifs were obtained from the Cis-BP database (Lambert et al., 2019). GO terms and KEGG pathways were obtained via respectively the R packages AnnotationHub (Morgan and Shepherd, 2021) and clusterProfiler (Yu et al., 2012; Wu et al., 2021).


More information on how to use Cactus and how references and test datasets were created is available on the documentation website: https://github.com/jsalignon/cactus.



Karolinska Institutet