<p>This item holds multiple tar archives with genome-scale comparisons of divergence between Northern krill populations, including estimated allele-frequencies and divergence (e.g. <em>F</em><sub>ST</sub>) , and extended haplotype signatures (XP-nSL estimates). Many analyses were performed in "chunks" (160 in total across both gene-rich and gene-poor sequences), which are described in a previous item.</p>
<p><strong>Population definitions</strong></p>
<p>Population definitions are the same as desribed in a different item:</p>
<ol>
<li>"at vs. me" = Atlantic Ocean samples (n=67) vs. the Mediterranean (i.e. Barcelona) samples (n=7).</li>
<li>"we vs. ea" = South-West North Atlantic Ocean (n=20) vs. North-East North Atlantic Ocean (n=47). In files using this contrast, sometimes the label "wa" is used instead of "we" for the South-West North Atlantic Ocean samples.</li>
</ol>
<p><strong>Contents:</strong></p>
<ol>
<li>allele_freqs_fst.gene_rich_sequences.at_vs_me.tar, contains per-SNP estimates of allele frequencies and <em>F</em><sub>ST</sub> between "at" and "me" groups along gene-rich sequences.</li>
<li>allele_freqs_fst.gene_rich_sequences.we_vs_ea.tar, as above but between "we" and "ea" groups.</li>
<li>allele_freqs_fst.gene_poor_sequences.at_vs_me.tar, contains per-SNP estimates of allele frequencies and <em>F</em><sub>ST</sub> between "at" and "me" groups along gene-poor sequences.</li>
<li>allele_freqs_fst.gene_poor_sequences.we_vs_ea.tar, as above but for "we" and "ea" groups.</li>
<li>allele_freqs_fst.merged_sequences.at_vs_me.csv.gz, contains per-SNP estimates of allele frequencies and <em>F</em><sub>ST</sub> between "at" and "me" merged into a single TSV file.</li>
<li>allele_freqs_fst.merged_sequences.we_vs_ea.csv.gz, as above but for "we" and "ea".</li>
<li>allele_freqs_fst.gene_rich_sequences_windows.at_vs_me.tar.gz, contains per-window estimates of <em>F</em><sub>ST</sub> between "at" and "me" groups along gene-rich sequences.</li>
<li>allele_freqs_fst.gene_rich_sequences_windows.we_vs_ea.tar.gz, as above but for "we" and "ea" groups.</li>
<li>allele_freqs_fst.gene_poor_sequences_windows.at_vs_me.tar.gz, contains per-window estimates of <em>F</em><sub>ST</sub> between "at" and "me" groups along gene-poor sequences.</li>
<li>allele_freqs_fst.gene_poor_sequences_windows.we_vs_ea.tar.gz, as above but for "we" and "ea" groups.</li>
<li>selscan_xpnsl.gene_rich_sequences.tar.gz, contains per-SNP cross-population XP-nSL statistics for gene-rich sequences.</li>
<li>selscan_xpnsl.gene_poor_sequences.tar.gz, contains per-SNP cross-population XP-nSL statistics for gene-poor sequences.</li>
<li>selscan_xpnsl.gene_rich_sequences_windows.tar.gz, contains per-window cross-population XP-nSL statistics for gene-rich sequences.</li>
<li>selscan_xpnsl.gene_poor_sequences_windows.tar.gz, as above but for gene-poor sequences.</li>
<li>fst_vs_xpnsl.per_snp.at_vs_me.csv.gz, contains per-SNP <em>F</em><sub>ST</sub>, genomic region and XP-nSL values in a single file for the "at vs. me" contrast.</li>
<li>fst_vs_xpnsl.per_snp.we_vs_ea.csv.gz, contains per-SNP <em>F</em><sub>ST</sub>, genomic region and XP-nSL values in a single file for the "we vs. ea" contrast.</li>
<li>fst_vs_xpnsl_vs_diversity_vs_regions.merged_sequences.at_vs_me.tsv.tar.gz, integrates window-based statistics into a single file for the "at vs. me" contrast.</li>
<li>fst_vs_xpnsl_vs_diversity_vs_regions.merged_sequences.we_vs_ea.tsv.tar.gz, as above but for the "we vs. ea" contrast.</li>
</ol>
<p><strong>allele_freqs_fst.gene_(rich|poor)_sequences.(at_vs_me|we_vs_ea).tar</strong></p>
<p>The TSV files in these archives contain per-SNP estimates of allele frequencies and <em>F</em><sub>ST</sub>, along with SNP annotations. There are nine main fields/columns with overlapping/redundant information to accommodate flexible parsing. Large fields have nested subfields that are separated by "|" (first level) or ":" (second level).</p>
<ol>
<li>name of sequence (e.g. "seq_s_1")</li>
<li>position of SNP (e.g. "448878")</li>
<li>reference allele (e.g. "A")</li>
<li>alternate allele (e.g. "G")</li>
<li>major column with <em>F</em><sub>ST</sub> value and allele frequency and other data for each population. It is described below.</li>
<li>type of SNP (e.g. intron, synonymous, missense, intergenic, ...) and label of associated gene (e.g. missense|REF_STRG_1_4_XLOC_012878)</li>
<li><em>F</em><sub>ST</sub> tag and value (e.g. fst|0.0653)</li>
<li>region, type of SNP and gene label (e.g. region|missense|REF_STRG_1_4_XLOC_012878)</li>
<li>gene annotation derived from EnTAP annotations and <em>Drosophila</em> homologs, which are described below. Uses comma-separated sub-fields.</li>
</ol>
<p><em><strong>Subfields in field 5:</strong></em></p>
<p>Example:</p>
<p>at/me:0.0653:148:1.0000:1.0000:1.0000|at,134,133.0000,1.0000,0.9925,0.0075|me,14,13.0000,1.0000,0.9286,0.0714</p>
<p>This field splits into three major subfields on "|": one about the pairwise comparison and two with metadata about each population.</p>
<p><em>1st subfield</em> (at/me:0.0653:148:1.0000:1.0000:1.0000)</p>
<ol>
<li>name of contrast (at/me)</li>
<li><em>F</em><sub>ST</sub> of SNP (0.0653)</li>
<li>Sample size (148)</li>
<li>Proportion of observed data given overall sample size (1.0000), <1 if there are missing genotypes.</li>
<li>Proportion of observed data given sample size of population 1 (1.0000)</li>
<li>As above but for population 2 (1.0000)</li>
</ol>
<p><em>2nd and 3rd subfields</em> (at,134,133.0000,1.0000,0.9925,0.0075 and me,14,13.0000,1.0000,0.9286,0.0714)</p>
<ol>
<li>name of population</li>
<li>sample size</li>
<li>number of observed reference alleles</li>
<li>number of observed alternate alleles</li>
<li>frequency of reference allele</li>
<li>frequency of alternate allele</li>
</ol>
<p><em><strong>Subfields in field 9</strong></em>:</p>
<p>Example: annotation|entap,XP_037775362.1 uncharacterized protein LOC119572362 [Penaeus monodon]|blast,FBgn0002526,FBtr0077014,CG10236,LanA,Laminin</p>
<ol>
<li>annotation tag</li>
<li>entap annotation (comma separated sub-fields)</li>
<li>blast annotation (comma separated sub-fields)</li>
</ol>
<p>These datasets are provided for each chunk and in a single merged TSV file for each contrast.</p>
<p><strong>allele_freqs_fst.gene_(rich|poor)_sequences_windows.(at_vs_me|we_vs_ea).tar.gz</strong></p>
<p>The TSV files in these archives contain <em>F</em><sub>ST</sub> estimates across 100 bp or 1,000 bp non-overlapping windows. Each TSV file has four fields:</p>
<ol>
<li>CHROM = name of sequence</li>
<li>POS = window start position</li>
<li>N_(contrast) = number of SNPs in the window</li>
<li>FST_(contrast) = average Reynold's <em>F</em><sub>ST</sub> of the window.</li>
</ol>
<p><strong>selscan_xpnsl.gene_rich_sequences.tar.gz</strong> and <strong>selscan_xpnsl.gene_poor_sequences.tar.gz</strong></p>
<p>The TSV files in these archives contain raw and normalized per-SNP cross-population XP-nSL output from selscan, which was used to test for selective sweeps. The format and meaning of the fields are documented with the original tool selscan: <a href="https://github.com/szpiech/selscan" target="_blank">https://github.com/szpiech/selscan</a></p>
<p><strong>selscan_xpnsl.gene_rich_sequences_windows.tar.gz</strong> and <strong>selscan_xpnsl.gene_poor_sequences_windows.tar.gz</strong></p>
<p>The TSV files in these archives contain per-window average XP-nSL computed from the normalized SNP-estimates at 1,000 or 10,000 bp resolution. The TSV files have the following headers:</p>
<ol>
<li>CHROM = name of sequence</li>
<li>START = start of window</li>
<li>STOP = stop of window</li>
<li>N = number of SNPs with XP-nSL estimates</li>
<li>N_CRIT = number of SNPs with critical XP-nSL estimates (>=2 or <=-2)</li>
<li>PROP_CRIT = proportion of critical SNPs</li>
<li>MIN = minimal XP-nSL value in window</li>
<li>MAX = maximal XP-nSL value in window</li>
<li>MEAN = mean XP-nSL value in window</li>
</ol>
<p><strong>fst_vs_xpnsl.per_snp.at_vs_me.csv.gz</strong> and <strong>fst_vs_xpnsl.per_snp.we_vs_ea.csv.gz</strong></p>
<p>Per-SNP <em>F</em><sub>ST</sub> and XP-nSL data that have been merged into a single TSV file. Fields:</p>
<ol>
<li>name of sequence</li>
<li>position of SNP</li>
<li>FST of SNP</li>
<li>gene region</li>
<li>XP-nSL</li>
</ol>
<p><strong>fst_vs_xpnsl_vs_diversity_vs_regions.merged_sequences.(at_vs_me|we_vs_ea).tsv.tar.gz</strong></p>
<p>Merged TSV files that integrates window-based <em>F</em><sub>ST</sub>, XP-nSL variation genomic region data at 1,000 bp resolution. Fields in the TSV files are:</p>
<ol>
<li>CHROM = name of sequence</li>
<li>START = start of window</li>
<li>N_at_vs_me = number of SNPs</li>
<li>FST_at_vs_me = average <em>F</em><sub>ST</sub> .</li>
<li>MEAN = mean XP-nSL.</li>
<li>LENGTH = length of window</li>
<li>COVERED = accessible bases</li>
<li>COVERED_PROP = proportion of accessible bases</li>
<li>all_THETA = Watterson's theta all data</li>
<li>all_PI = Pi all data</li>
<li>all_TD = Tajima's D all data</li>
<li>pop1_VARIABLE = polymorphic sites population 1</li>
<li>pop1_THETA = Watterson's theta population 1</li>
<li>pop1_PI = as above</li>
<li>pop1_TD = as above</li>
<li>pop2_VARIABLE = polymorphic sites population 2</li>
<li>pop2_THETA = Watterson's theta population 2</li>
<li>pop2_PI = as above</li>
<li>pop2_TD = as above</li>
<li>intergenic_COVERED = accessible sites of this region</li>
<li>intergenic_all_THETA = theta for this region across all data</li>
<li>five_prime_utr_COVERED</li>
<li>five_prime_utr_all_THETA</li>
<li>cds_COVERED cds_all_THETA</li>
<li>intron_COVERED</li>
<li>intron_all_THETA</li>
<li>three_prime_utr_COVERED</li>
<li>three_prime_utr_all_THETA </li>
</ol><p></p>
Funding
Climate genomics in the Northern krill: the past, present and future of an important marine species
Swedish Research Council for Environment Agricultural Sciences and Spatial Planning