SciLifeLab
Browse

SBDI Sativa curated 16S GTDB database

Version 7 2024-06-28, 05:53
Version 6 2024-05-22, 10:50
Version 5 2022-10-14, 15:21
Version 4 2022-08-31, 15:22
Version 3 2021-11-10, 09:09
Version 2 2021-10-25, 13:56
Version 1 2021-08-05, 05:17
dataset
posted on 2024-06-28, 05:53 authored by Daniel LundinDaniel Lundin, Anders AnderssonAnders Andersson

The data in this [repository](https://doi.org/10.17044/scilifelab.14869077) is the result of vetting 16S sequences from the GTDB database release R09RS220 (r220) (https://gtdb.ecogenomic.org/; Parks et al. 2018) with the Sativa program (Kozlov et al. 2016) using the [sbdi-phylomarkercheck](https://github.com/biodiversitydata-se/sbdi-phylomarkercheck) Nextflow pipeline.

Using Sativa [Kozlov et al. 2016], 16S sequences from GTDB were checked so that their phylogenetic signal is consistent with their taxonomy.

Before calling Sativa, sequences longer than 2000 nucleotides or containing Ns were removed, and the reverse complement of each is calculated. Subsequently, sequences were aligned with HMMER [Eddy 2011] using the Barrnap [https://github.com/tseemann/barrnap] archaeal and bacterial 16S profiles respectively, and sequences containing more than 10% gaps were removed. The remaining sequences were analyzed with Sativa, and sequences that were not phylogenetically consistent with their taxonomy were removed.

Files for the DADA2 (Callahan et al. 2016) methods `assignTaxonomy` and `addSpecies` are available, in three different versions each. The `assignTaxonomy` files contain taxonomy for domain, phylum, class, order, family, genus and species. (Note that it has been proposed that species assignment for short 16S sequences require 100% identity (Edgar 2018), so use species assignments from `assignTaxonomy` with caution.) The versions differ in the maximum number of genomes that we included per species: 1, 5 or 20, indicated by "1genome", "5genomes" and "20genomes" in the file names respectively. Using the version with 20 genomes per species should increase the chances to identify an exactly matching sequence by the `addSpecies` algorithm, while using a file with many genomes per species could potentially give biases in the taxonomic annotations at higher levels by `assignTaxonomy`. Our recommendation is hence to use the "1genome" files for `assignTaxonomy` and "20genomes" for `addSpecies`.


All files are gzipped fasta files with 16S sequences, the assignTaxonomy associated with taxonomy hierarchies from domain to species whereas the `addSpecies` file have sequence identities and species names. There is also a fasta files with the original GTDB sequence names: sbdi-gtdb-sativa.r09rs220.20genomes.fna.gz.

Taxonomical annotation of 16S amplicons using this data is available as an optional argument to the nf-core/ampliseq Nextflow workflow from version 2.1: --dada_ref_taxonomy sbdi-gtdb (https://nf-co.re/ampliseq; Straub et al. 2020).

In addition to the fasta files, the workflow outputs phylogenetic trees by optimizing branch-lengths of the original phylogenomic GTDB trees based on a 16S sequence alignment. As not all species in GTDB will have correct 16S sequences, the GTDB trees are first subset to contain only species for which the species representative genome has a correct 16S sequence. Subsequently, branch lengths for the tree are optimized based on the original alignment of 16S sequences using IQTREE [Nguyen et al. 2015] with a GTR+F+I+G4 model. The alignment files end with .alnfna, the taxonomy files with .taxonomy.tsv and the tree files (newick-formatted) end with .brlenopt.newick. They will be made available in nf-core/ampliseq for phylogenetic placement.

The data will be updated circa yearly, after the GTDB database is updated.

Version history


  • v7 (2024-06-25): Update to GTDB R09-RS220 from R08-RS214.
  • v6 (2024-04-24): Replace manual procedure with Nextflow pipeline. Update to GTDB R08-RS214 from R07-RS207.
  • v5 (2022-10-07): Add missing fasta file with original GTDB names.
  • v4 (2022-08-31): Update to GTDB R07-RS207 from R06-RS202

Acknowledgements


The computations were enabled by resources in project [NAISS 2023/22-601, SNIC 2022/22-500 and SNIC 2021/22-263] provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS) at UPPMAX, funded by the Swedish Research Council through grant agreement no. 2022-06725.

Computations were also enabled by resources provided by Dr. Maria Vila-Costa, Institute of Environmental Assessment and Water Research (IDAEA-CSIC), Barcelona.


Funding

Swedish Biodiversity Data Infrastructure

Swedish Research Council

Find out more...

History

Publisher

Swedish Biodiversity Data Infrastructure (SBDI)

Usage metrics

    Swedish Biodiversity Data Infrastructure (SBDI)

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC