SciLifeLab
Browse

6. Ecological genomics of the Northern krill: Genome-scale variation

dataset
posted on 2024-03-28, 00:10 authored by Andreas WallbergAndreas Wallberg, Per UnnebergPer Unneberg

This item holds multiple tar archives with genome-scale genetic variation data for the Northern krill, ranging from VCF files with SNP and genotypes to results of population genetic analyses, including estimates of the levels of genetic variation, allele-frequencies and divergence, and extended haplotype signatures. Depending on estimate, they are given on a per-SNP or per-window basis.

Genome chunks

Because of the large genome (>19 Gb) and number of SNPs (>750 M), these data were typically produced in chunks. Based on preliminary mappings of RNA-seq data, scaffolds/contigs were divided into two groups: those containing expressed genes ("gene-rich sequences") or those that did not ("gene-poor sequences"). Each group was divided into 80 chunks of approximately the same length across the genome. Chunks are numbered from 1 to 80 and follow the sorting order of the genome assembly: chunks with low numbers contain data across longer but fewer sequences.

Population definitions

The population genetic data was grouped differently depending on thests. Four main structures were used:

  1. "all" = all 74 diploid individuals/samples were grouped into a single population called "all" (i.e. chromosome sample size 2*74 = 148). This was used for example to measure the levels of genetic variation across the genome and in different genomic regions.
  2. "populations" = the 74 samples were grouped according to the eight populations / geographic regions of origin: bar=Barents Sea (10); brc=Barcelona (n=7); can=Canada (n=10); ice=Iceland (n=10); mai=Maine (n=10); nor=Norway (n=10); sva=Svalbard (n=9); swe=Sweden (n=8).
  3. "at vs. me" = Atlantic Ocean samples (n=67) vs. the Mediterranean (i.e. Barcelona) samples (n=7).
  4. "we vs. ea" = South-West North Atlantic Ocean (n=20) vs. North-East North Atlantic Ocean (n=47). In files using this contrast, sometimes the label "wa" is used instead of "we" for the South-West North Atlantic Ocean samples.

Note: originally, 75 specimens were sequenced but one library failed and was excluded from all analyses.

Contents:

  1. bed_files.gene_rich_sequences.tar.gz, constain 80 BED files specifying which gene-rich sequences belong to which chunk, as well as the length of each sequence.
  2. bed_files.gene_poor_sequences.tar.gz, constain 80 BED files specifying which gene-poor sequences belong to which chunk, as well as the length of each sequence.
  3. population_definitions.tar.gz, TXT text files specifying sample groupings as above.
  4. vcfs.gene_rich_sequences.tar, contains 80 gzipped VCF files with variants on gene-rich sequences, as detected with Freebayes.
  5. vcfs.gene_poor_sequences.tar, contains 80 gzipped VCF files with variants on gene-poor sequences, as detected with Freebayes.
  6. vcfs.gene_rich_sequences.annotated.tar, contains 80 gzipped VCF files with decomposed and biallelic SNP variants on gene-rich sequences and further filtered for quality, imputed and phased with Beagle and functionally annotated with SnpEff.
  7. vcfs.gene_poor_sequences.annotated.tar, contains 80 gzipped VCF files with decomposed and biallelic SNP variants on gene-rich sequences and further filtered for quality, imputed and phased with Beagle and functionally annotated with SnpEff.
  8. diversity.gene_rich_sequences.all.tar.gz, contains estimates of the levels of genetic variation across gene-rich sequences in TSV format in 80 chunks.
  9. diversity.gene_poor_sequences.all.tar.gz, contains estimates of the levels of genetic variation across gene-poor sequences in TSV format in 80 chunks.
  10. diversity_regions.gene_rich_sequences.all.tar.gz, contains estimates of the levels of genetic variation across gene-rich sequences in TSV format in 80 chunks, subdivided by different gene regions (e.g. intergenic, CDS, intron, UTRs).
  11. diversity_regions.gene_poor_sequences.all.tar.gz, contains estimates of the levels of genetic variation across gene-poor sequences in TSV format in 80 chunks, subdivided by different gene regions (e.g. intergenic, CDS, intron, UTRs).
  12. freebayes.ld_prune.vcf.gz, contains a subset of LD-pruned SNPs in VCF format, that were used in Admixture and PCA analyses.

bed_files.gene_rich_sequences.tar.gz and bed_files.gene_poor_sequences.tar.gz

The BED format is available here: https://github.com/samtools/hts-specs

These files contain the fields:

  1. name of sequence
  2. start position of sequence (0)
  3. end position of sequence (sequence length - 1)

vcfs.gene_rich_sequences.tar and vcfs.gene_poor_sequences.tar

Contains variants saved in standard VCFv4.2 format. The VCF format specification is available here: https://samtools.github.io/hts-specs/VCFv4.2.pdf

Each VCF file was produced using Freebayes and the output was directly piped to vcflib's vcffilter to keep only variants with QUAL>20 before being saved to disk. At this stage, variants could be SNPs or small haplotypes or indels, and contain more than two alleles.

vcfs.gene_rich_sequences.annotated.tar and vcfs.gene_poor_sequences.annotated.tar

Contains variants saved in standard VCFv4.1 format.
Each VCF file was processed by decomposing haplotypes of closely variants into individual SNPs using bcftools and vt and filtered for depth and sample missingness/coverage using a custom script. Each VCF file was then imputed and phased with Beagle and annotated with SnpEff using the non-redundant set of protein-coding genes.

diversity.gene_rich_sequences.all.tar.gz and diversity.gene_poor_sequences.all.tar.gz

Contains per-base estimates of variation in TSV format.

The statistics are per-base Watterson's theta (population mutation rate) and Pi (nucleotide diversity), as well as Tajima's D, and were computed across all 74 samples in the dataset ("all). Estimates are available as window-based averages or sequence-wide averages and were corrected by the number of accessible bases in each region using the genome mask.

Each of the chunk represented by three files:

  1. *.wattersons_theta_pi_tajimas_D.any.csv = sequence-wide average levels of variation.
  2. *.wattersons_theta_pi_tajimas_D.window_1000.any.csv = 1,000 bp non-overlapping window-wide average levels of variation.
  3. *.wattersons_theta_pi_tajimas_D.window_100000.any.csv = 100,000 bp window-wide average levels of variation.

These files contain the following fields for each window/region:

  1. CHROM = the name of the scaffold or contig
  2. POS = the start position along the sequence
  3. GLOBAL_POS = the start position of the window/region in the file, counting incrementally across all sequences
  4. LENGTH = length of the window (last window on each sequence is typically shorter than the rest).
  5. COVERED = the number of accessible sites in the window/region
  6. COVERED_PROP = the proportion of accessible sites
  7. all_N = this field may optionally contain the sample size, or be left blank
  8. all_VARIABLE = number of variable sites (SNPs)
  9. all_THETA = average Watterson's theta per base
  10. all_PI = average nucleotide diversity (Pi)
  11. all_TD = Tajima's D

diversity_regions.gene_rich_sequences.all.tar.gz and diversity_regions.gene_poor_sequences.all.tar.gz

Contains per-base estimates of variation in TSV format, subdivided by different gene regions.

The statistics are and formats are as above and window-based statistics are provided in 1 kb, 10 kb and 100 kb intervals.

Funding

Climate genomics in the Northern krill: the past, present and future of an important marine species

Swedish Research Council for Environment Agricultural Sciences and Spatial Planning

Find out more...

History

Publisher

Uppsala University

Usage metrics

    Andreas Wallberg Lab

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC