SciLifeLab
Browse

7. Ecological genomics of the Northern krill: Genome-scale comparisons of adaptive divergence

dataset
posted on 2024-03-28, 00:10 authored by Andreas WallbergAndreas Wallberg, Per UnnebergPer Unneberg

This item holds multiple tar archives with genome-scale comparisons of divergence between Northern krill populations, including estimated allele-frequencies and divergence (e.g. FST) , and extended haplotype signatures (XP-nSL estimates). Many analyses were performed in "chunks" (160 in total across both gene-rich and gene-poor sequences), which are described in a previous item.

Population definitions

Population definitions are the same as desribed in a different item:

  1. "at vs. me" = Atlantic Ocean samples (n=67) vs. the Mediterranean (i.e. Barcelona) samples (n=7).
  2. "we vs. ea" = South-West North Atlantic Ocean (n=20) vs. North-East North Atlantic Ocean (n=47). In files using this contrast, sometimes the label "wa" is used instead of "we" for the South-West North Atlantic Ocean samples.

Contents:

  1. allele_freqs_fst.gene_rich_sequences.at_vs_me.tar, contains per-SNP estimates of allele frequencies and FST between "at" and "me" groups along gene-rich sequences.
  2. allele_freqs_fst.gene_rich_sequences.we_vs_ea.tar, as above but between "we" and "ea" groups.
  3. allele_freqs_fst.gene_poor_sequences.at_vs_me.tar, contains per-SNP estimates of allele frequencies and FST between "at" and "me" groups along gene-poor sequences.
  4. allele_freqs_fst.gene_poor_sequences.we_vs_ea.tar, as above but for "we" and "ea" groups.
  5. allele_freqs_fst.merged_sequences.at_vs_me.csv.gz, contains per-SNP estimates of allele frequencies and FST between "at" and "me" merged into a single TSV file.
  6. allele_freqs_fst.merged_sequences.we_vs_ea.csv.gz, as above but for "we" and "ea".
  7. allele_freqs_fst.gene_rich_sequences_windows.at_vs_me.tar.gz, contains per-window estimates of FST between "at" and "me" groups along gene-rich sequences.
  8. allele_freqs_fst.gene_rich_sequences_windows.we_vs_ea.tar.gz, as above but for "we" and "ea" groups.
  9. allele_freqs_fst.gene_poor_sequences_windows.at_vs_me.tar.gz, contains per-window estimates of FST between "at" and "me" groups along gene-poor sequences.
  10. allele_freqs_fst.gene_poor_sequences_windows.we_vs_ea.tar.gz, as above but for "we" and "ea" groups.
  11. selscan_xpnsl.gene_rich_sequences.tar.gz, contains per-SNP cross-population XP-nSL statistics for gene-rich sequences.
  12. selscan_xpnsl.gene_poor_sequences.tar.gz, contains per-SNP cross-population XP-nSL statistics for gene-poor sequences.
  13. selscan_xpnsl.gene_rich_sequences_windows.tar.gz, contains per-window cross-population XP-nSL statistics for gene-rich sequences.
  14. selscan_xpnsl.gene_poor_sequences_windows.tar.gz, as above but for gene-poor sequences.
  15. fst_vs_xpnsl.per_snp.at_vs_me.csv.gz, contains per-SNP FST, genomic region and XP-nSL values in a single file for the "at vs. me" contrast.
  16. fst_vs_xpnsl.per_snp.we_vs_ea.csv.gz, contains per-SNP FST, genomic region and XP-nSL values in a single file for the "we vs. ea" contrast.
  17. fst_vs_xpnsl_vs_diversity_vs_regions.merged_sequences.at_vs_me.tsv.tar.gz, integrates window-based statistics into a single file for the "at vs. me" contrast.
  18. fst_vs_xpnsl_vs_diversity_vs_regions.merged_sequences.we_vs_ea.tsv.tar.gz, as above but for the "we vs. ea" contrast.

allele_freqs_fst.gene_(rich|poor)_sequences.(at_vs_me|we_vs_ea).tar

The TSV files in these archives contain per-SNP estimates of allele frequencies and FST, along with SNP annotations. There are nine main fields/columns with overlapping/redundant information to accommodate flexible parsing. Large fields have nested subfields that are separated by "|" (first level) or ":" (second level).

  1. name of sequence (e.g. "seq_s_1")
  2. position of SNP (e.g. "448878")
  3. reference allele (e.g. "A")
  4. alternate allele (e.g. "G")
  5. major column with FST value and allele frequency and other data for each population. It is described below.
  6. type of SNP (e.g. intron, synonymous, missense, intergenic, ...) and label of associated gene (e.g. missense|REF_STRG_1_4_XLOC_012878)
  7. FST tag and value (e.g. fst|0.0653)
  8. region, type of SNP and gene label (e.g. region|missense|REF_STRG_1_4_XLOC_012878)
  9. gene annotation derived from EnTAP annotations and Drosophila homologs, which are described below. Uses comma-separated sub-fields.

Subfields in field 5:

Example:

at/me:0.0653:148:1.0000:1.0000:1.0000|at,134,133.0000,1.0000,0.9925,0.0075|me,14,13.0000,1.0000,0.9286,0.0714

This field splits into three major subfields on "|": one about the pairwise comparison and two with metadata about each population.

1st subfield (at/me:0.0653:148:1.0000:1.0000:1.0000)

  1. name of contrast (at/me)
  2. FST of SNP (0.0653)
  3. Sample size (148)
  4. Proportion of observed data given overall sample size (1.0000), <1 if there are missing genotypes.
  5. Proportion of observed data given sample size of population 1 (1.0000)
  6. As above but for population 2 (1.0000)

2nd and 3rd subfields (at,134,133.0000,1.0000,0.9925,0.0075 and me,14,13.0000,1.0000,0.9286,0.0714)

  1. name of population
  2. sample size
  3. number of observed reference alleles
  4. number of observed alternate alleles
  5. frequency of reference allele
  6. frequency of alternate allele

Subfields in field 9:

Example: annotation|entap,XP_037775362.1 uncharacterized protein LOC119572362 [Penaeus monodon]|blast,FBgn0002526,FBtr0077014,CG10236,LanA,Laminin

  1. annotation tag
  2. entap annotation (comma separated sub-fields)
  3. blast annotation (comma separated sub-fields)

These datasets are provided for each chunk and in a single merged TSV file for each contrast.

allele_freqs_fst.gene_(rich|poor)_sequences_windows.(at_vs_me|we_vs_ea).tar.gz

The TSV files in these archives contain FST estimates across 100 bp or 1,000 bp non-overlapping windows. Each TSV file has four fields:

  1. CHROM = name of sequence
  2. POS = window start position
  3. N_(contrast) = number of SNPs in the window
  4. FST_(contrast) = average Reynold's FST of the window.

selscan_xpnsl.gene_rich_sequences.tar.gz and selscan_xpnsl.gene_poor_sequences.tar.gz

The TSV files in these archives contain raw and normalized per-SNP cross-population XP-nSL output from selscan, which was used to test for selective sweeps. The format and meaning of the fields are documented with the original tool selscan: https://github.com/szpiech/selscan

selscan_xpnsl.gene_rich_sequences_windows.tar.gz and selscan_xpnsl.gene_poor_sequences_windows.tar.gz

The TSV files in these archives contain per-window average XP-nSL computed from the normalized SNP-estimates at 1,000 or 10,000 bp resolution. The TSV files have the following headers:

  1. CHROM = name of sequence
  2. START = start of window
  3. STOP = stop of window
  4. N = number of SNPs with XP-nSL estimates
  5. N_CRIT = number of SNPs with critical XP-nSL estimates (>=2 or <=-2)
  6. PROP_CRIT = proportion of critical SNPs
  7. MIN = minimal XP-nSL value in window
  8. MAX = maximal XP-nSL value in window
  9. MEAN = mean XP-nSL value in window

fst_vs_xpnsl.per_snp.at_vs_me.csv.gz and fst_vs_xpnsl.per_snp.we_vs_ea.csv.gz

Per-SNP FST and XP-nSL data that have been merged into a single TSV file. Fields:

  1. name of sequence
  2. position of SNP
  3. FST of SNP
  4. gene region
  5. XP-nSL

fst_vs_xpnsl_vs_diversity_vs_regions.merged_sequences.(at_vs_me|we_vs_ea).tsv.tar.gz

Merged TSV files that integrates window-based FST, XP-nSL variation genomic region data at 1,000 bp resolution. Fields in the TSV files are:

  1. CHROM = name of sequence
  2. START = start of window
  3. N_at_vs_me = number of SNPs
  4. FST_at_vs_me = average FST .
  5. MEAN = mean XP-nSL.
  6. LENGTH = length of window
  7. COVERED = accessible bases
  8. COVERED_PROP = proportion of accessible bases
  9. all_THETA = Watterson's theta all data
  10. all_PI = Pi all data
  11. all_TD = Tajima's D all data
  12. pop1_VARIABLE = polymorphic sites population 1
  13. pop1_THETA = Watterson's theta population 1
  14. pop1_PI = as above
  15. pop1_TD = as above
  16. pop2_VARIABLE = polymorphic sites population 2
  17. pop2_THETA = Watterson's theta population 2
  18. pop2_PI = as above
  19. pop2_TD = as above
  20. intergenic_COVERED = accessible sites of this region
  21. intergenic_all_THETA = theta for this region across all data
  22. five_prime_utr_COVERED
  23. five_prime_utr_all_THETA
  24. cds_COVERED cds_all_THETA
  25. intron_COVERED
  26. intron_all_THETA
  27. three_prime_utr_COVERED
  28. three_prime_utr_all_THETA   

Funding

Climate genomics in the Northern krill: the past, present and future of an important marine species

Swedish Research Council for Environment Agricultural Sciences and Spatial Planning

Find out more...

History

Publisher

Uppsala University

Usage metrics

    Andreas Wallberg Lab

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC