SBDI Sativa curated 16S GTDB database

dataset

posted on 2022-10-14, 15:21 authored by Daniel LundinDaniel Lundin, Anders AnderssonAnders Andersson

The data in this repository is the result of vetting 16S sequences from the GTDB database release R07-RS207 (https://gtdb.ecogenomic.org/; Parks et al. 2018) with the Sativa program (Kozlov et al. 2016).

Files for the DADA2 (Callahan et al. 2016) methods `assignTaxonomy` and `addSpecies` are available, in three different versions each. The `assignTaxonomy` files contain taxonomy for domain, phylum, class, order, family, genus and species. (Note that it has been proposed that species assignment for short 16S sequences require 100% identity (Edgar 2018), so use species assignments with `assignTaxonomy` with caution.) The versions differ in the maximum number of genomes that we included per species: 1, 5 or 20, indicated by "1genome", "5genomes" and "20genomes" in the file names respectively. Using the version with 20 genomes per species should increase the chances to identify an exactly matching sequence by the `addSpecies` algorithm, while using a file with many genomes per species could potentially give biases in the taxonomic annotations at higher levels by `assignTaxonomy`.

There is also a fasta file with the original GTDB sequence names: gtdb-sbdi-sativa.r07rs207.20genomes.fna.gz.

All files are gzipped fasta files with 16S sequences, the assignTaxonomy associated with taxonomy hierarchies from domain to species whereas the `addSpecies` file have sequence identities and species names.

Taxonomical annotation of 16S amplicons using this data is available as an optional argument to the nf-core/ampliseq Nextflow workflow from version 2.1: `--dada_ref_taxonomy sbdi-gtdb` (https://nf-co.re/ampliseq; Straub et al. 2020).

The data will be updated circa yearly, after the GTDB database is updated.

Curation

After download, sequences longer than 2000 basepairs and sequences containing undetermined bases ('N') were removed. Subsequently, sequences, as well as the reverse-complements of these, were aligned to the archaeal and bacterial SSU profiles from Barrnap (https://github.com/tseemann/barrnap) with hmmalign from HMMER (Eddy 2011). Sequences aligning to fewer than 1000 bases of their respective profile in both forward and reverse-complementary direction were deleted. For the sequences passing the above filters, the longest sequence in each genome was kept. For each species, a maximum of 20 sequences were selected, giving highest priority to sequences from GTDB species-representative genomes, and secondly longer sequences before shorter. These sequences were then analyzed with Sativa (Kozlov et al. 2016) and sequences misclassified at genus to phylum level were removed. For the remaining sequences, for each species, a maximum of 1, 5 and 20 sequences was selected, as before prioritizing sequences from GTDB species-representative genomes, and longer sequences before shorter. A Perl script for conducting filtering of sequences prior to and after Sativa analysis can be found in the scripts folder in the GitHub repo: https://github.com/biodiversitydata-se/sbdi-gtdb. Run `perl select_seq_sativa.pl --h` for documentation.