SBDI Sativa curated 16S GTDB database

dataset

posted on 2025-04-29, 08:59 authored by Daniel LundinDaniel Lundin, Anders AnderssonAnders Andersson

The data in this [repository](https://doi.org/10.17044/scilifelab.14869077) is the result of vetting 16S sequences from the Genome Taxonomy Database (GTDB) release R10RS226 (r226) (https://gtdb.ecogenomic.org/; Parks et al. 2018) with the Sativa program (Kozlov et al. 2016) using the [sbdi-phylomarkercheck](https://github.com/biodiversitydata-se/sbdi-phylomarkercheck) Nextflow pipeline.

Using Sativa [Kozlov et al. 2016], 16S sequences from GTDB were checked so that their phylogenetic signal is consistent with their taxonomy.

Before calling Sativa, sequences longer than 2000 nucleotides or containing Ns were removed, and the reverse complement of each is calculated. Subsequently, sequences were aligned with HMMER [Eddy 2011] using the Barrnap [https://github.com/tseemann/barrnap] archaeal and bacterial 16S profiles respectively, and sequences containing more than 10% gaps were removed. The remaining sequences were analyzed with Sativa, and sequences that were not phylogenetically consistent with their taxonomy were removed.

Files for the DADA2 (Callahan et al. 2016) methods `assignTaxonomy` and `addSpecies` are available, in three different versions each. The `assignTaxonomy` files contain taxonomy for domain, phylum, class, order, family, genus and species. (Note that it has been proposed that species assignment for short 16S sequences require 100% identity (Edgar 2018), so use species assignments from `assignTaxonomy` with caution.) The versions differ in the maximum number of genomes that we included per species: 1, 5 or 20, indicated by "1genome", "5genomes" and "20genomes" in the file names respectively. Using the version with 20 genomes per species should increase the chances to identify an exactly matching sequence by the `addSpecies` algorithm, while using a file with many genomes per species could potentially give biases in the taxonomic annotations at higher levels by `assignTaxonomy`. Our recommendation is hence to use the "1genome" files for `assignTaxonomy` and "20genomes" for `addSpecies`.

The fasta files are gzipped fasta files with 16S sequences, the assignTaxonomy associated with taxonomy hierarchies from domain to species whereas the `addSpecies` file have sequence identities and species names. There is also a fasta files with the original GTDB sequence names: sbdi-gtdb-sativa.r09rs220.20genomes.fna.gz.

Taxonomical annotation of 16S amplicons using this data is available as an optional argument to the nf-core/ampliseq Nextflow workflow: --dada_ref_taxonomy sbdi-gtdb (https://nf-co.re/ampliseq; Straub et al. 2020).

In addition to the fasta files, the workflow outputs phylogenetic trees by optimizing branch-lengths of the original phylogenomic GTDB trees based on a 16S sequence alignment. As not all species in GTDB will have correct 16S sequences, the GTDB trees are first subset to contain only species for which the species representative genome has a correct 16S sequence. Subsequently, branch lengths for the tree are optimized based on the original alignment of 16S sequences using IQTREE [Nguyen et al. 2015] with a GTR+F+I+G4 model. The alignment files end with .alnfna, the taxonomy files with .taxonomy.tsv and the tree files (newick-formatted) end with .brlenopt.newick. They will be made available in nf-core/ampliseq for phylogenetic placement.

The data will be updated circa yearly, after the GTDB database is updated.

Version history

v10 (2025-04-30): Update versions in this text
v9 (2025-04-29): Update to GTDB R10-RS226
v8 (2025-02-18): Remove extra sequences from e.g. "1genome" files that appeared due to ties.
v7 (2024-06-25): Update to GTDB R09-RS220 from R08-RS214.
v6 (2024-04-24): Replace manual procedure with Nextflow pipeline. Update to GTDB R08-RS214 from R07-RS207.
v5 (2022-10-07): Add missing fasta file with original GTDB names.
v4 (2022-08-31): Update to GTDB R07-RS207 from R06-RS202

Acknowledgements

The computations were enabled by resources in project [NAISS 2023/22-601, SNIC 2022/22-500 and SNIC 2021/22-263] provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS) at UPPMAX, funded by the Swedish Research Council through grant agreement no. 2022-06725.

Computations were also enabled by resources provided by Dr. Maria Vila-Costa, Institute of Environmental Assessment and Water Research (IDAEA-CSIC), Barcelona.