# COI reference sequences from BOLD DB ## General information Author: SBDI molecular data team Contact e-mail: john.sundh@scilifelab.se DOI: 10.17044/scilifelab.20514192 License: CC BY 4.0 Categories: Bioinformatics and computational biology not elsewhere classified, Computational ecology and phylogenetics, Ecology not elsewhere classified Item type: Dataset Keywords: COI sequence analysis, Ampliseq Funding: Swedish Research Council (VR), grant number 2019-00242. National Bioinformatic Infrastructure Sweden This README file was last updated: 2022-08-23 Please cite as: Swedish Biodiversity Data Infrastructure (SBDI; 2022). COI reference sequences from BOLD DB ## Dataset description This item contains COI (mitochondrial cytochrome oxidase subunit I) sequences collected from the [BOLD database](https://boldsystems.org/). The fasta file bold_clustered_cleaned.fasta.gz has record ids that can be queried in the [Public Data Portal](https://boldsystems.org/index.php/Public_BINSearch?searchtype=records) and each fasta header contains the taxonomic ranks + the BIN ID assigned to the record. The taxonomic information for each record is also given in the tab-separated file bold_info_filtered.tsv.gz. The dataset was last created on February 18, 2022. ### Methods The code used to generate this dataset consists of a snakemake workflow wrapped into a python package that can be installed with [conda](https://docs.conda.io/en/latest/miniconda.html) (`conda install -c bioconda coidb`). Firstly sequence and taxonomic information for records in the BOLD database is downloaded from the [GBIF Hosted Datasets](https://hosted-datasets.gbif.org/ibol/). This data is then filtered to only keep records annotated as 'COI-5P' and assigned to a BIN ID. The taxonomic information is parsed in order to assign species names and resolve higher level ranks for each BIN ID. Sequences are processed to remove gap characters and leading and trailing `N`s. After this, any sequences with remaining non-standard characters are removed. Sequences are then clustered at 100% identity using [vsearch](https://github.com/torognes/vsearch) (Rognes _et al._ 2016). This clustering is done separately for sequences assigned to each BIN ID. ## References Rognes T, Flouri T, Nichols B, Quince C, Mahé F. (2016) VSEARCH: a versatile open source tool for metagenomics. PeerJ 4:e2584. doi: 10.7717/peerj.2584