SciLifeLab
Browse
1/1
5 files

1. Ecological genomics of the Northern krill: Genome assembly DNA sequences

dataset
posted on 2024-03-28, 00:10 authored by Andreas WallbergAndreas Wallberg, Per UnnebergPer Unneberg
  1. northern_krill.genome_assembly.tar.gz, the major gzipped tar archive that contains seven DNA sequence files in FASTA format. These files represent the finished genome assembly of the Northern krill, produced using the "K20" reference specimen.
  2. non_reference_preliminary_mitochondrial_sequence.tar.gz, a minor file with resources used to assemble a preliminary mitochondrial sequence from a non-reference specimen ("K4").
  3. README.genecovr_instructions.txt, instructions in a text file for how to use genecovr and GMAP to evaluate the quality of the resulting genome assembly using RNA transcript sequences.
  1. 1.m_norvegica.main_w_mito.fasta, main genome assembly including mitochondrial chromosome, 216568 sequences, 19.7 Gb.
  2. 2.m_norvegica.short.fasta, very short assembly fragments below 200 bp, 228 sequences, 27 kb.
  3. 3.m_norvegica.artefacts.fasta, sequences flagged as artefacts by Purge_Haplotigs due to very low or high mapping depths, 49868 sequences, 556 Mb.
  4. 4.m_norvegica.haplotigs.fasta, sequences flagged as putative haplotigs by Purge_Haplotigs due to depth similarity to other sequences, 168305 sequences, 1.95 Gb.
  5. 5.m_norvegica.mitochondrion.fasta, mitochondrial sequence, 1 sequence, 17944 b.
  6. 6.m_norvegica.mitochondrial_artefacts.fasta, mitochondrial-like sequence and potential assembly artifacts, 8 sequences, 81.6 kb.
  7. 7.m_norvegica.bacterial.fasta, putative sequences from bacterial contaminants, 113 sequences, 8.7 Mb.
  • seq_s_X: "s" indicates this sequence is a scaffold of contings and contains gaps ("N")
  • seq_c_X: "c" indicates this sequence is a contings and contains no gaps
  • seq_a_X: indicates this is an "artifact" (see above)
  • seq_h_X: indicates this is a "haplotig" (see above)
  • seq_m_X: indicates this is the mitochondrial sequence (see above)
  • seq_r_X: indicates this is a "mitochondrial artifact" sequence (see above)
  • seq_b_X: indicates this is a "bacterial" sequence (see above)
  • File = name of file
  • Sequence = name of sequence
  • Length = length of sequence
  • Length no N = length of sequence, not counting N
  • Start = counting incrementally across the whole file, this is the start position of the sequence
  • End = counting incrementally across the whole file, this is the stop position of the sequence
  • Total = counting incrementally across the whole file, the total amount of sequence seen at this stage
  • x = the N-level, from 1 to 100.
  • Nx = the N-level, from 1 to 100 and written as N1 to N100.
  • LENGTH[Nx] = the length of the shortest sequence at this level
  • n[Lx] = the number of of sequences at this level
  • SUM = counting incrementally across N-levels, the sum of the sequence lengths at this level
  • BIN_SUM = the sum of sequence lengths for this particular level/bin
  • TOT = total length of sequences in the file
  • SEQS = number of sequences in the file
  • MAX = the length of the longest sequence
  • MEAN = the mean length of the sequences

This item holds genome assembly reference sequences, i.e. the main output from the genome assembly.

Contents:

northern_krill.genome_assembly.tar.gz

This archive contains the final genome assembly sequences. The most important file of these is "1.m_norvegica.main_w_mito.fasta" which is the main nuclear genome assembly plus the mitochondrial sequence. This is the primary genome assembly resource and was annotated for genes, repeats and DNA methylation. Genome-scale patterns of genetic variation among individuals and populations was measured using this resource as the reference.

Archived contents:

Sequence naming follow this convention:

Each FASTA file contans two accessory files tab-separated spreadsheet files:

FASTA.lengths.csv: contains information about the order and lengths of sequences.

FASTA.Nx_stats_1.csv: contains info about the length distribution of sequences, for example the N50. The sequences were sorted by length in order to produce these statistics.

Four additional statistics are printed as keys and values on the first line:

non_reference_preliminary_mitochondrial_sequence.tar.gz

An archive with the Nanopore long-reads (FASTQ format) used to produce a preliminary mitochondrial assemly from a non-reference specimen ("K4"), as well as the resulting sequence (FASTA format). In addition, the archive contains the MITOS2 gene annotations for this preliminary assembly.

Funding

Climate genomics in the Northern krill: the past, present and future of an important marine species

Swedish Research Council for Environment Agricultural Sciences and Spatial Planning

Find out more...

History

Publisher

Uppsala University

Usage metrics

    Andreas Wallberg Lab

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC