
Amplicon sequence variants from the Insect Biome Atlas project

posted on 2024-05-17, 07:34 authored by Andreia MiraldoAndreia Miraldo, Elzbieta Iwaszkiewicz-EggebrechtElzbieta Iwaszkiewicz-Eggebrecht, John SundhJohn Sundh, Lokeshwaran ManoharanLokeshwaran Manoharan, Emma GranqvistEmma Granqvist, Anders AnderssonAnders Andersson, Piotr Łukasik, Tomas Roslin, Ayco J. M. Tack, Fredrik RonquistFredrik Ronquist

General information

The Insect Biome Atlas project was supported by the Knut and Alice Wallenberg Foundation (dnr 2017.0088). The project analyzed the insect faunas of Sweden and Madagascar, and their associated microbiomes, mainly using DNA metabarcoding of Malaise trap samples collected in 2019 (Sweden) or 2019–2020 (Madagascar).

Please cite this version of the dataset as: Miraldo A, Iwaszkiewicz-Eggebrecht E, Sundh J, Lokeshwaran M, Granqvist E, Andersson AF, Lukasik P, Roslin T, Tack A, Ronquist F. 2024. Dataset of amplicon sequence variants (ASVs) from the Insect Biome Atlas Project, version 2.

Dataset description

This dataset contains amplicon sequence variants (ASVs) generated from high-throughput sequencing of the cytochrome c oxidase subunit I (COI) gene from Malaise trap samples (lysates, homogenates and preservative ethanol) and soil and litter samples. It includes ASV sequences and abundance information (number of reads) as well as metadata files that are needed to interpret and analyse the data further. Future versions of the dataset will include additional data.


Samples were sequenced using Illumina technology. Raw data are available at the European Nucleotide Archive (ENA) under project PRJEB61109. The raw sequence data was preprocessed using a Snakemake workflow. Preprocessed reads were then used as input to the AmpliSeq Nextflow (v.2.1.0) pipeline to generate ASVs.

Available data

Two types of files are provided: ASV files and metadata files. Files marked with 'SE' and 'MG' contain data from Sweden and Madagascar, respectively.

The file shasum.txt contains checksums for each of the files.

ASV files

ASV sequences in fasta format are found in files CO1_asv_seqs_SE.fasta.gz and CO1_asv_seqs_MG.fasta.gz. Counts of ASVs in each sample are in CO1_asv_counts_SE.tsv.gz and CO1_asv_counts.MG.tsv.gz. The Swedish dataset contains 821,559 ASVs in 6,169 samples. The Madagascar dataset contains 701,769 ASVs in 2,287 samples.

Metadata files

Three types of metadata files are included:

  1. sequencing_metadata files with information about samples that were processed in the lab and sequenced
  2. samples_metadata files with information about samples that were collected in the field.
  3. sites_metadata files with information about sites where samples were collected.

Sequencing metadata files

Two sequencing metadata files are included (CO1_sequencing_metadata_SE.tsv and CO1_sequencing_metadata_MG.tsv) with information about samples that were sequenced. Columns in these files are as follows:

  • sampleID_NGI: Sample id given by the sequencing facility (matching the columns in the counts file)
  • sampleID_HISTORICAL: Custom user id
  • sampleID_FIELD: Sample id from field sampling
  • sampleID_LAB: Sample id from handling in the lab
  • dataset: Dataset designation for each sample
  • lab_sample_type: Type of sample, e.g. 'sample', 'buffer_blank', 'pcr_neg' etc.
  • country: Country of origin for sample
  • biological_spikes: True if sample has biological spike ins added
  • artificial_spikes: True if sample has artificial spike ins added at the time of DNA purification
  • sample_metadata_file: Corresponding metadata file for sample
  • lysate_rack_ID: Identification of 96-well plate where lysate aliquot is stored in the lab (internal use only)
  • lysate_well_ID: Identification of well position where lysate aliquot is stored in the lab (internal use only)
  • dna_plate_ID: Identification of 96-well plate where purified DNA is stored in the lab (internal use only)
  • dna_plate_well_ID: Identification of well position where lysate is stored in the lab (internal use only)
  • sequencing_batch: Custom user id for sequencing batch number
  • sequencing_batch_NGI: Sequencing batch number given by the sequencing facility
  • notes_lab: Additional information about sample processing in the lab (only for SE file)
  • sequencing_status: Additional information about sample sequencing status. If a sample has a value of “sequencing failed” in this column, then this sample will be missing from the ASV counts file
  • study_accession_ENA: Study identification at the European Nucleotide Archive
  • sample_accession_ENA: Sample identification at the European Nucleotide Archive
  • experiment_accession_ENA: Experiment identification at the European Nucleotide Archive
  • run_accession_ENA: Run identification at the European Nucleotide Archive

Samples metadata files

Two samples_metadata files are included (samples_metadata_malaise_SE.tsv and samples_metadata_malaise_MG.tsv) with information about each sample that was collected in the field. Columns in these files are as follows:

  • sampleID_FIELD: Sample id from field sampling
  • trapID: Malaise trap id from field sampling
  • biomass_grams: Wet weight of each bulk sample
  • placing_time: Time when sampling started
  • placing_date: Date when sampling started
  • collecting_time: Time when sampling ended
  • collecting_date: Date when sampling ended
  • duration_min: Total number of minutes the sample was collecting
  • trap_condition_collection: Condition of the malaise trap at the time of collecting the sample from the trap (good; acceptable; poor)
  • sample_ethanol_conc: Concentration of preservative ethanol at the time of DNA extraction (only for SE file)
  • processing_group: Processing batch id (for internal use only)
  • sample_accession_ENA: Sample identification at the European Nucleotide Archive
  • sample_status: Additional information about sample processing status in the lab

For arthropod samples collected from litter and soil two files are included: samples_metadata_soil_litter_SE.tsv and samples_metadata_litter_MG.tsv. For Madagascar we did not collect arthropod samples from soil. Also, for Madagascar we collected four leaf litter samples at each trap location, one sample in each direction of the Malaise trap (front, back, left and right); whilst for Sweden we collected only one sample at each trap location. Columns in these files are as follows:

  • sampleID_FIELD: Sample id from field sampling
  • trapID: Malaise trap id from field sampling
  • sample_type: Identifies if a sample is a soil or litter sample (only for SE file)
  • transectID: Identification of transect where sample was collected: front of Malaise trap (transectID=1), right hand side of Malaise trap (transectID=2), back of Malaise trap (transectID=3), left hand side of Malaise trap (transectID=4) (only for MG file)
  • biomass_grams: Wet weight of each bulk sample (only for MG file)
  • date: Date when sample was collected
  • time: Time when sample was collected
  • sample_accession_ENA: Sample identification at the European Nucleotide Archive
  • sample_status: Additional information about sample processing status in the lab

Sites metadata files

Two files contain information about sampling sites: sites_metadata_SE.tsv and sites_metadata_MG.tsv. Columns in these files are as follows:

  • siteID: Sampling site id number. Note that for some sites there can be several Malaise traps assembled (malaise_trap_type=Multitrap)
  • trapID: Malaise trap id from field sampling
  • latitude_WGS84: Latitude in WGS84 coordinate system. This info specifies the Malaise trap location at the sampling site
  • longitude_WGS84: Longitude in WGS84 coordinate system. This info specifies the Malaise trap location at the site
  • trap_habitat: Habitat where the Malaise trap was located
  • malaise_trap_type: Identifies if there are multiple traps assembled at the sampling site (Multitrap) or only one (Single_trap)
  • parkID: Name of national park (for MG only)
  • provinceID: Name of province (for MG only)
  • NILS_mhabitat: Habitat for nearest plot of the National Inventory of Landscapes in Sweden (NILS) from the malaise trap location (only for SE file). For more information about NILS sampling design see here.
  • NILS_square: Identification of nearest NILS square for sampling site (only for SE file)
  • NILS_plot: Identification of nearest NILS plot to the Malaise trap location (only for SE file)
  • trap_orientation_degrees_S: Orientation in degrees of the collection head of the Malaise trap
  • notes: notes associated with the Malaise trap (only for SE file)


