SciLifeLab
Browse
.GZ
CO1_asv_seqs_SE.fasta.gz (67.63 MB)
.GZ
CO1_asv_counts_SE.tsv.gz (32.47 MB)
.GZ
CO1_asv_seqs_MG.fasta.gz (63.97 MB)
.GZ
CO1_asv_counts_MG.tsv.gz (23.35 MB)
DATASET
CO1_sequencing_metadata_SE.tsv (1.06 MB)
DATASET
CO1_sequencing_metadata_MG.tsv (469.47 kB)
DATASET
samples_metadata_malaise_SE.tsv (620.08 kB)
DATASET
samples_metadata_malaise_MG.tsv (304.31 kB)
DATASET
sites_metadata_SE.tsv (13.88 kB)
DATASET
sites_metadata_MG.tsv (5.21 kB)
TEXT
shasum.txt (0.67 kB)
TEXT
MANIFEST.txt (0.46 kB)
TEXT
README.txt (7.63 kB)
1/0
13 files

Amplicon sequence variants from the Insect Biome Atlas project

Version 6 2024-11-25, 19:00
Version 5 2024-10-24, 20:45
Version 4 2024-10-21, 14:44
Version 3 2024-10-15, 08:31
Version 2 2024-05-17, 07:34
Version 1 2024-05-06, 08:32
dataset
posted on 2024-05-06, 08:32 authored by Andreia MiraldoAndreia Miraldo, Elzbieta Iwaszkiewicz-EggebrechtElzbieta Iwaszkiewicz-Eggebrecht, John SundhJohn Sundh, Lokeshwaran ManoharanLokeshwaran Manoharan, Emma GranqvistEmma Granqvist, Anders AnderssonAnders Andersson, Piotr Łukasik, Tomas Roslin, Ayco J. M. Tack, Fredrik RonquistFredrik Ronquist

General information

The Insect Biome Atlas project was supported by the Knut and Alice Wallenberg Foundation (dnr 2017.0088). The project analyzed the insect faunas of Sweden and Madagascar, and their associated microbiomes, mainly using DNA metabarcoding of Malaise trap samples collected in 2019 (Sweden) or 2019–2020 (Madagascar).

Please cite this version of the dataset as: Miraldo A, Iwaszkiewicz-Eggebrecht E, Sundh J, Lokeshwaran M, Granqvist E, Andersson AF, Lukasik P, Roslin T, Tack A, Ronquist F. 2024. Dataset of amplicon sequence variants (ASVs) from the Insect Biome Atlas Project, version 1. https://doi.org/10.17044/scilifelab.25480681

Dataset description

This dataset (version 1) contains amplicon sequence variants (ASVs) generated from high-throughput sequencing of the cytochrome c oxidase subunit I (CO1) gene from Malaise trap samples processed with mild lysis, with the exception of 15 samples for which we also provide sequencing data from homogenates and preservative ethanol. It includes both the ASV sequences and abundance information (number of reads) and it also contains metadata files that are needed to interpret and analyse the data further. Future versions of the dataset will include additional data.

Methods

Samples were sequenced using Illumina technology. Raw data are available at the European Nucleotide Archive (ENA) under project PRJEB61109. The raw sequence data was preprocessed using a Snakemake workflow available at https://github.com/biodiversitydata-se/amplicon-multi-cutadapt. Preprocessed reads were then used as input to the AmpliSeq Nextflow (v.2.1.0) pipeline to generate Amplicon Sequence Variants (ASVs).

Available data

In this dataset we provide two types of files: ASV files and metadata files. Files marked with 'SE' contain data from Sweden while those marked with 'MG' contain data from Madagascar.

The file shasum.txt contains checksums for each of the files. After downloading you can run:
shasum -c shasum.txt
to check file integrity

ASV files

This dataset contains ASV sequences in fasta format (CO1_asv_seqs_SE.fasta.gz and CO1_asv_seqs_MG.fasta.gz) and counts of ASVs in each sample (CO1_asv_counts_SE.tsv.gz and CO1_asv_counts.MG.tsv.gz). Files marked with 'SE' are from samples in Sweden while those marked with 'MG' are from Madagascar. The Swedish dataset contains 636,297 ASVs in 4,873 samples (including negative and positive control samples). The Madagascar dataset contains 559,023 ASVs in 2,081 samples (including negative and positive control samples).

Metadata files

There are three types of metadata files included in this dataset:

  1. sequencing_metadata files with information about samples that were processed in the lab and sequenced
  2. samples_metadata files with information about samples that were collected in the field.
  3. sites_metadata files with information about sites where samples were collected.

Sequencing metadata files

Two sequencing metadata files are included in this dataset (CO1_sequencing_metadata_SE.tsv and CO1_sequencing_metadata_MG.tsv) with information about samples that were sequenced. Columns in these files are as follows:

  • sampleID_NGI: Sample id given by the sequencing facility (matching the columns in the counts file)
  • sampleID_HISTORICAL: Custom user id
  • sampleID_FIELD: Sample id from field sampling
  • sampleID_LAB: Sample id from handling in the lab
  • dataset: Dataset designation for each sample
  • lab_sample_type: Type of sample, e.g. 'sample', 'buffer_blank', 'pcr_neg' etc.
  • country: Country of origin for sample
  • biological_spikes: True if sample has biological spike ins added
  • artificial_spikes: True if sample has artificial spike ins added at the time of DNA purification
  • sample_metadata_file: Corresponding metadata file for sample
  • lysate_rack_ID: Identification of 96-well plate where lysate aliquot is stored in the lab (internal use only)
  • lysate_well_ID: Identification of well position where lysate aliquot is stored in the lab (internal use only)
  • dna_plate_ID: Identification of 96-well plate where purified DNA is stored in the lab (internal use only)
  • dna_plate_well_ID: Identification of well position where lysate is stored in the lab (internal use only)
  • sequencing_batch: Custom user id for sequencing batch number
  • sequencing_batch_NGI: Sequencing batch number given by the sequencing facility
  • notes_lab: Additional information about sample processing in the lab (only for SE file)
  • sequencing_status: Additional information about sample sequencing status. If a sample has a value of “sequencing failed” in this column, then this sample will be missing from the ASV counts file
  • study_accession_ENA: Study identification at the European Nucleotide Archive
  • sample_accession_ENA: Sample identification at the European Nucleotide Archive
  • experiment_accession_ENA: Experiment identification at the European Nucleotide Archive
  • run_accession_ENA: Run identification at the European Nucleotide Archive

Samples metadata files

Two samples_metadata files are included in this dataset (samples_metadata_malaise_SE.tsv and samples_metadata_malaise_MG.tsv) with information about each sample that was collected in the field. Columns in these files are as follows:

  • sampleID_FIELD: Sample id from field sampling
  • trapID: Malaise trap id from field sampling
  • biomass_grams: Wet weight of each bulk sample
  • placing_time: Time when sampling started
  • placing_date: Date when sampling started
  • collecting_time: Time when sampling ended
  • collecting_date: Date when sampling ended
  • duration_min: Total number of minutes the sample was collecting
  • trap_condition_collection: Condition of the malaise trap at the time of collecting the sample from the trap (good; acceptable; poor)
  • sample_ethanol_conc: Concentration of preservative ethanol at the time of DNA extraction (only for SE file)
  • processing_group: Processing batch id (for internal use only)
  • sample_accession_ENA: Sample identification at the European Nucleotide Archive
  • sample_status: Additional information about sample processing status in the lab

Sites metadata files

There are two files that contain information about sampling sites, one for each country: sites_metadata_SE.tsv and sites_metadata_MG.tsv. Columns in these files are as follows:

  • siteID: Sampling site id number. Note that for some sites there can be several Malaise traps assembled (malaise_trap_type=Multitrap)
  • trapID: Malaise trap id from field sampling
  • latitude_WGS84: Latitude in WGS84 coordinate system. This info specifies the Malaise trap location at the sampling site
  • longitude_WGS84: Longitude in WGS84 coordinate system. This info specifies the Malaise trap location at the site
  • trap_habitat: Habitat where the Malaise trap was located
  • malaise_trap_type: Identifies if there are multiple traps assembled at the sampling site (Multitrap) or only one (Single_trap)
  • parkID: Name of national park (for MG only)
  • provinceID: Name of province (for MG only)
  • NILS_mhabitat: Habitat for nearest plot of the National Inventory of Landscapes in Sweden (NILS) from the malaise trap location (only for SE file). For more information about NILS sampling design, check: https://www.slu.se/centrumbildningar-och-projekt/nils_old/Datainsamling/bakgrund-och-mal/
  • NILS_square: Identification of nearest NILS square for sampling site (only for SE file)
  • NILS_plot: Identification of nearest NILS plot to the Malaise trap location (only for SE file)
  • trap_orientation_degrees_S: Orientation in degrees of the collection head of the Malaise trap
  • notes: notes associated with the Malaise trap (only for SE file)


Funding

Insect Biome Atlas

Knut and Alice Wallenberg Foundation

Find out more...

National Bioinformatics Infrastructure Sweden (NBIS)

Swedish Research Council

Find out more...

History

Publisher

Fredrik Ronquist

Usage metrics

    Insect Biome Atlas

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC