COI reference sequences from BOLD DB

dataset

posted on 2023-01-16, 08:54 authored by John SundhJohn Sundh

Dataset description

This item contains COI (mitochondrial cytochrome oxidase subunit I) sequences collected from the BOLD database. The fasta file bold_clustered_cleaned.fasta.gz has record ids that can be queried in the Public Data Portal and each fasta header contains the taxonomic ranks + the BIN ID assigned to the record. The taxonomic information for each record is also given in the tab-separated file bold_info_filtered.tsv.gz.
The file bold_clustered.sintax.fasta.gz is directly compatible with the SINTAX algorithm in vsearch while files bold_clustered.assignTaxonomy.fasta.gz and bold_clustered.addSpecies.fasta.gz are directly compatible with the assignTaxonomy and addSpecies functions from DADA2, respectively. The dataset was last created on December 16, 2022

NOTE: We have noticed that the gzipped files in this upload have been compressed twice for some reason. A quick fix is to unzip any file with a ".gz" extension, then rename the unzipped file by adding the ".gz" extension back. Then running the unzipping once again. Sorry for the inconvenience.

Methods

The code used to generate this dataset consists of a snakemake workflow wrapped into a python package that can be installed with conda (`conda install -c bioconda coidb`). Firstly sequence and taxonomic information for records in the BOLD database is downloaded from the GBIF Hosted Datasets. This data is then filtered to only keep records annotated as 'COI-5P' and assigned to a BIN ID. The taxonomic information is parsed in order to assign species names and resolve higher level ranks for each BIN ID. Sequences are processed to remove gap characters and leading and trailing `N`s. After this, any sequences with remaining non-standard characters are removed. Sequences are then clustered at 100% identity using vsearch (Rognes _et al._ 2016). This clustering is done separately for sequences assigned to each BIN ID.

For more information, see https://github.com/biodiversitydata-se/coidb