SciLifeLab
Browse
.GZ
asv_taxonomy_sintax_MG.tsv.gz (16.79 MB)
.GZ
asv_taxonomy_vsearch_MG.tsv.gz (18.39 MB)
.GZ
asv_taxonomy_vsearch_SE.tsv.gz (23.73 MB)
.GZ
asv_taxonomy_MG.tsv.gz (17.04 MB)
.GZ
asv_taxonomy_sintax_SE.tsv.gz (25.98 MB)
.GZ
asv_taxonomy_SE.tsv.gz (26 MB)
.GZ
asv_taxonomy_epang_MG.tsv.gz (18.76 MB)
.GZ
asv_taxonomy_epang_SE.tsv.gz (23.22 MB)
.GZ
cluster_taxonomy_SE.tsv.gz (30.58 MB)
.GZ
cluster_taxonomy_MG.tsv.gz (29.36 MB)
.GZ
cluster_consensus_taxonomy_MG.tsv.gz (803.36 kB)
.GZ
cluster_consensus_taxonomy_SE.tsv.gz (664.96 kB)
.GZ
cluster_counts_MG.tsv.gz (5.39 MB)
.GZ
cluster_counts_SE.tsv.gz (10.5 MB)
.GZ
cluster_reps_SE.fasta.gz (14.38 MB)
.GZ
cluster_reps_MG.fasta.gz (28.06 MB)
.GZ
noise_filtered_cluster_counts_MG.tsv.gz (2.31 MB)
.GZ
noise_filtered_cluster_counts_SE.tsv.gz (6.28 MB)
.GZ
noise_filtered_cluster_taxonomy_MG.tsv.gz (15.51 MB)
.GZ
noise_filtered_cluster_taxonomy_SE.tsv.gz (22.72 MB)
1/0
31 files

Processed ASV data from the Insect Biome Atlas Project

Version 3 2024-10-30, 09:23
Version 2 2024-10-24, 20:32
Version 1 2024-10-23, 08:38
dataset
posted on 2024-10-30, 09:23 authored by Andreia MiraldoAndreia Miraldo, Elzbieta Iwaszkiewicz-Eggebrecht, John SundhJohn Sundh, Lokeshwaran ManoharanLokeshwaran Manoharan, Emma GranqvistEmma Granqvist, Anders F. Andersson, Piotr Łukasik, Tomas Roslin, Ayco J. M. Tack, Fredrik RonquistFredrik Ronquist

The Insect Biome Atlas project was supported by the Knut and Alice Wallenberg Foundation (dnr 2017.0088). The project analyzed the insect faunas of Sweden and Madagascar, and their associated microbiomes, mainly using DNA metabarcoding of Malaise trap samples collected in 2019 (Sweden) or 2019–2020 (Madagascar).

Please cite this version of the dataset as: Miraldo A, Iwaszkiewicz-Eggebrecht E, Sundh J, Lokeshwaran M, Granqvist E, Goodsell R, Andersson AF, Lukasik P, Roslin T, Tack A, Ronquist F. 2024. Processed ASV data from the Insect Biome Atlas Project, version 3. doi:10.17044/scilifelab.27202368.v3 or https://doi.org/10.17044/scilifelab.27202368.v3

This dataset contains the results from bioinformatic processing of version 1 of the amplicon sequence variant (ASV) data from the Insect Biome Atlas project (Miraldo et al. 2024), that is, the cytochrome oxidase subunit 1 (CO1) metabarcoding data from Malaise trap samples processed using the FAVIS mild lysis protocol (Iwaszkiewicz et al. 2023). The bioinformatic processing involved: (1) taxonomic assignment of ASVs, (2) chimera removal; (3) clustering into OTUs; (4) noise filtering and (5) cleaning. The clustering step involved resolution of the taxonomic annotation of the cluster and identification of a representative ASV. The noise filtering step involved removal of ASV clusters identified as potentially originating from nuclear mitochondrial DNA (NUMTs) or representing other types of error or noise. The cleaning step involved removal of ASV clusters present in >5% of negative control samples. ASV taxonomic assignments, ASV cluster designations, consensus taxonomies and summed counts of clusters in the sequenced samples are provided in compressed tab-separated files. Sequences of cluster representatives are provided in compressed FASTA format files. The bioinformatic processing pipeline is further described in Sundh et al. (2024). NB! All result files include ASVs and clusters that represent biological and synthetic spike-ins.

Methods

Taxonomic assignment

ASVs were taxonomically assigned using kmer-based methods implemented in a Snakemake workflow available here. Specifically ASVs were assigned a taxonomy using the SINTAX algorithm in vsearch (v2.21.2) using a CO1 database constructed from the Barcode Of Life Data System (Sundh 2022). ASVs assigned to Class 'Insecta' or 'Collembola' but unassigned at lower taxonomic ranks were then placed into a reference phylogeny of 49,325 insect species (represented by 49,338 sequences) using the phylogenetic placement tool EPA-NG with subsequent taxonomic assignments using GAPPA. Assignments at the order level in this second pass were used to update the first kmer-based assignments, but only at the order level, leaving child ranks with the ‘unclassified’ prefix.

Chimera removal

The workflow first identifies chimeric ASVs in the input data using the ‘uchime_denovo’ method implemented in vsearch. This was done with a so-called ‘strict samplewise’ strategy where each sample was analysed separately (hence the ‘samplewise’ notation), only comparing ASVs present in the same sample. Further, ASVs had to be identified as chimeric in all samples where they were present (corresponding to the ‘strict’ notation) in order to be removed as chimeric.

ASV clustering

Non-chimeric sequences were then split by family-level taxonomic assignments and ASVs within each family were clustered in parallel using swarm (v3.1.0) with differences=15. Representative ASVs were selected for each generated cluster by taking the ASV with the highest relative abundance across all samples in a cluster. Counts were generated at the cluster level by summing over all ASVs in each cluster.

Consensus taxonomy

A consensus taxonomy was created for each cluster by taking into account the taxonomic assignments of all ASVs in a cluster as well as the total abundance of ASVs. For each cluster, starting at the most resolved taxonomic level, each unique taxonomic assignment was weighted by the sum of read counts of ASVs with that assignment. If a single weighted assignment made up 80% or more of all weighted assignments at that rank, that taxonomy was propagated to the ASV cluster, including parent rank assignments. If no taxonomic assignment was above the 80% threshold, the algorithm continued to the parent rank in the taxonomy. Taxonomic assignments at any available child ranks were set to the consensus assignment prefixed with ‘unresolved’.

Noise filtering and cleaning

The clustered data was further cleaned from NUMTs and other types of noise using the NEEAT algorithm, which takes taxonomic annotation, correlations in occurrence across samples (‘echo signal’) and evolutionary signatures into account, as well as cluster abundance (Sundh et al., 2024). We used default settings for all parameters in the evolutionary and distributional filtering steps, and removed clusters unassigned at the order level and with less than 3 reads summed across each dataset.

As a last clean-up step in the noise filtering, clusters containing at least one ASV present in more than 5% of blanks were removed. Further, we removed ASvs assigned to a reference sequence in the BOLD database annotated as Zoarces gillii (BOLD:AEB5125), a fish found between Japan and eastern Korea. Closer inspection revealed that this was a mis-annotated bacterial sequence and ASVs assigned to this reference most likely represent bacterial sequences in our dataset. This record has been deleted from BOLD after our custom reference database was constructed.

The chimera filtering and ASV clustering methods have been implemented in a Snakemake workflow available here. This workflow takes as input:


  1. The ASV sequences in FASTA format
  2. A tab-delimited file of counts of ASVs (rows) in samples (columns)

Data for 1) and 2) are available at https://doi.org/10.17044/scilifelab.25480681.v5

Cleaning of ASV clusters in controls and identification of spikeins was done with a custom R script available here.

Available data

Processed ASV data files

ASV taxonomic assignments, non-chimeric ASV cluster designations, consensus taxonomies, sequences of cluster representatives and summed counts of clusters in the sequenced samples are provided in compressed tab-separated files. Files are organized by country (Sweden and Madagascar), marked by the suffixes SE and MG, respectively.

Taxonomic assignments

The files asv_taxonomy_[SE|MG].tsv.gz are tab-separated files with taxonomic assignments using SINTAX+EPA-NG for all ASVs. Columns:

  • ASV: The id of the ASV
  • Kingdom, Phylum, Class, Order, Family, Genus, Species, BOLD_bin: Taxonomic assignment for each rank.

If an ASV was unclassified at a particular rank, the taxonomic label is prefixed with ‘unclassified.’ followed by the taxonomic assignment of the most resolved parent rank.

The files asv_taxonomy_sintax_[SE|MG].tsv.gz, asv_taxonomy_epang_[SE|MG].tsv.gz and asv_taxonomy_vsearch_[SE|MG].tsv.gz have the same structure, but contain results from assignments with SINTAX, EPA-NG and VSEARCH, respectively.

Cluster assignments

The files cluster_taxonomy_[SE|MG].tsv are tab-separated files containing all non-chimeric ASVs (that is, the ASVs passing the chimera-filtering step) with their corresponding taxonomic and cluster assignments. Columns:

  • ASV: ASV id
  • cluster: name of designated cluster
  • median: the median of normalized reads across all samples for each ASV
  • Kingdom, Phylum, Class, Order, Family, Genus, Species, BOLD_bin: taxonomic assignment of each ASV
  • representative: contains 1 if ASV is a representative of its cluster, otherwise 0

Cluster counts

The files cluster_counts_[SE|MG].tsv are tab-separated files with read counts of ASV clusters (rows) in samples (columns). Counts have been summed for all ASVs belonging to each cluster. Note that these files contain counts for biological spike-ins and for Sweden also synthetic spike-ins.

Sequences of cluster representatives

The files cluster_reps_[SE|MG].fasta are text files in FASTA format with representative sequences for each cluster. The fasta headers have the format “>ASV_ID CLUSTER_NAME”.

Consensus taxonomy

The files cluster_consensus_taxonomy_[SE|MG].tsv are tab-separated files with consensus taxonomy of each generated ASV cluster. Columns are the same as in asv_taxonomy_[SE|MG].tsv.

Noise-filtered data

The files prefixed with 'noise_filtered' contain data that has been cleaned from NUMTs and other types of noise using the NEEAT algorithm. The files contain the same information as the cluster files, but only for clusters that passed the noise filtering step.

Cleaned noise filtered data

The files prefixed with 'cleaned_noise_filtered' contain data that has been cleaned from NUMTs and other types of noise using the NEEAT algorithm, and further cleaned from clusters present in >5% of blanks. The files contain the same information as the cluster files, but only for clusters that passed the noise filtering and cleaning steps.

Additional files

The files removed_control_tax_[SE|MG].tsv.gz contain the ASV clusters removed from each dataset as part of cleaning.

The files spikeins_tax_[SE|MG].tsv.gz contain the taxonomic assignments of the biological spike-ins identified.

References:

  • Iwaszkiewicz-Eggebrecht, E., Łukasik, P., Buczek, M., Deng, J., Hartop, E. A., Havnås, H., ... & Miraldo, A. (2023). FAVIS: Fast and versatile protocol for non-destructive metabarcoding of bulk insect samples. PloS one, 18(7), e0286272.
  • Miraldo, A., Iwaszkiewicz-Eggebrecht, E., Sundh, J., Manoharan, L., Granqvist, E., Andersson, A., Łukasik, P., Roslin, T., Tack, A. J. M., & Ronquist, F. (2024). Amplicon sequence variants from the Insect Biome Atlas project (Version 5). SciLifeLab. https://doi.org/10.17044/scilifelab.25480681.v5
  • Sundh, J. (2022). COI reference sequences from BOLD DB (Version 4). SciLifeLab. https://doi.org/10.17044/scilifelab.20514192.v4


Funding

Insect Biome Atlas

Knut and Alice Wallenberg Foundation

Find out more...

National Bioinformatics Infrastructure Sweden (NBIS)

History

Publisher

Naturhistoriska riksmuseet

Usage metrics

    Insect Biome Atlas

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC