SciLifeLab
Browse
1/1
17 files

3. Ecological genomics of the Northern krill: Genome assembly annotations (genes and repeats)

dataset
posted on 2024-03-28, 00:10 authored by Andreas WallbergAndreas Wallberg, Per UnnebergPer Unneberg

This item holds multiple gene and repeat model and annotation files, including coordinates in GFF/GTF formats, TXT/TSV table and sequences in FASTA format. It also contains some accessory RNA-seq gene resources, such as Trinity-assembled transcripts and Nanopore cDNA sequences that were used at various stages of assembly and annotation.

Coordinates refer to the main genome assembly reference sequence (1.m_norvegica.main_w_mito.fasta) but focus on the nuclear genome assembly and rarely include features of the mitochondrial assembly. Mitochondrial annotations are provided separately (see below).

Contents:

  1. trinity_transcripts.tar.gz, an archive with n=573,869 RNA transcripts that have been assembled with Trinity using Illumina RNA-seq data in FASTA format.
  2. trinity_transcript.16509_single_isoforms.cds.fasta.tar.gz, a subset of 16,509 single (longest) isoform of putatively protein-coding transcripts used to assess genome assembly metrics such as duplication and base-level error. Sequences are in FASTA format.
  3. nanopore_cDNA.representative_sequences_vsearch.tar.gz, n=25,484 cDNA Nanopore sequence reads used to filter gene models and scaffold the genome.
  4. annotations.all_genes_and_isoforms.redundant.tar.gz, an archive with all (n=202,138) gene models and isoforms/alternative splice variants, including also non-protein coding genes.
  5. annotations.protein_coding_gene_models.non_redundant.gff3, a non-redundant (i.e. single-isoform) set of putative protein-coding gene bodies (n=42,227) in standard GFF3 format.
  6. annotations.protein_coding_gene_models.non_redundant.CDS.fasta, the matching set of putative protein-coding genes in FASTA format (CDS nucleotide sequences).
  7. annotations.protein_coding_gene_models.non_redundant.PEP.fasta, the matching set of putative protein sequences in FASTA format (PEP peptide sequences).
  8. annotations.protein_coding_gene_models.non_redundant.PEP.fasta.BLAST.DROSOPHILA.tsv.tar.gz, output from BLASTP analyses between Northern krill and Drosophila peptide sequences (BLAST outfmt 6).
  9. annotations.protein_coding_gene_models.non_redundant.PEP.fasta.EnTAP.final_annotations_lvl1.tsv, main output from EnTAP functional annotations of protein coding genes.
  10. annotations.protein_coding_gene_models.non_redundant_added_stop_codons.gff, non-redundant protein-coding models as above, but missing stop-codons have been added if detected in the reference genome assembly (GFF format).
  11. annotations.protein_coding_gene_models.non_redundant_added_stop_codons.CDS.fasta, but missing stop-codons have been added if detected in the reference genome assembly (FASTA format).
  12. mitochondrion.tar.gz, an archive with gene coordinates and sequences of tRNAs, rRNAs, protein-coding genes and repeat features on the mitochondrial chromosome, as inferred using MITOS2. Files are standard BED/GFF/TSV/TXT/FASTA files and more information about formats can be found on the site for the original tool: http://mitos2.bioinf.uni-leipzig.de/help.py
  13. annotations.repeat_library.fasta, a custom set of n=10,909 non-redundant repeat sequences in FASTA format that were used to annotate the genome for repeats using RepeatMasker.
  14. annotations.repeats_across_the_genome_repeatmasker.tbl, the standard RepeatMasker masking overview output table.
  15. annotations.repeats_across_the_genome_repeatmasker.out.tar.gz, the full set of masked repeats and their coordinates across the genome.

trinity_transcripts.tar.gz

This archive contains the assembled transcripts assembled from RNA-seq data produced from six RNA extractions/tissues of the reference specimen. There are three FASTA files:

  • trinity_transcripts.all_genes_and_isoforms.fasta = all assembled transcripts (n=573,869)
  • trinity_transcripts.metazoan_genes_and_isoforms.CDS.fasta = a subset of n=60,677 assembled and putatively coding transcripts with best hits against Metazoan sequences (CDS nucleotide sequences)
  • trinity_transcripts.metazoan_genes_and_isoforms.PEP.fasta = the n=60,677 corresponding peptide sequences.

nanopore_cDNA.representative_sequences_vsearch.tar.gz

This archive contains putatively full-length cDNA reads in three FASTA files:

  • clusters.fa = VSEARCH cluster representatives (i.e. cluster centroids with low error rates) that retain the original Nanopore sequence headers (n=25,484)
  • clusters.renamed.fa = as above, but renamed with simple incrementing headers.
  • clusters.renamed.min_500bp.fa = as above, but only reads longer than 500 bp (n=24,632). These reads were used to scaffold the genome.

annotations.all_genes_and_isoforms.redundant.tar.gz

This archive contains gene models in four files:

  • annotations.all_genes_and_isoforms.redundant.gtf, coordinates in GTF format
  • annotations.all_genes_and_isoforms.redundant.gff3, coordinates in GFF3 format
  • annotations.all_genes_and_isoforms.redundant.fasta, sequences in FASTA format
  • annotations.all_genes_and_isoforms.redundant.transcripts.tsv, a TSV table with three fields specifying: 1) the final name of the isoform/splice variant; 2) the name of the gene model it belongs to; 3) the original name the isoform.

These models were consolidated into loci using GFFCOMPARE from multiple sources of data, including RNA and comparative data. The names of the original isoforms indicate source:

  • STRG = HISAT/STRINGTIE RNA-seq gene model. Tagged "REF_STRG" in the final gene model.
  • mRNA = Assembled Trinity transcript. Tagged "REF_TRIN" in the final gene model.
  • COMPARATIVE_SPALN = Comparative model derived other crustaceans.

GFF and GTF format specifications are available here:

https://www.ensembl.org/info/website/upload/gff.html

https://www.ensembl.org/info/website/upload/gff3.html

annotations.protein_coding_gene_models.non_redundant.(gff3|CDS.fasta|PEP.fasta)

These files contains a filtered set of the "best" model isoform of each locus (n=42,227) in total, which were determined by comparison to NCBI RefSeq. These models were used to annotate SNPs, infer homology/orthology, gene family evolution and molecular evolution.

annotations.repeat_library.fasta

This FASTA file contains the representative and non-redundant template repeat sequences that were used to annotate the Northern krill genome for interspersed repeats. The sequence headers indicate several aspects of each repeat.

Example: "seq_c_98391_5186_12351_FIN_ReC99C#LTR/Pao"

This indicates that the template is:

  • located on sequence seq_c_98391 with start/stop coordinates 5186/12351
  • originally detected using LTR_Finder ("FIN")
  • classified as "LTR/Pao" using RepeatClassifier ("ReC")
  • has 99% identity between the 5' and 3' LTRs ("99") and was considered complete, with respect to the expected protein domains detected along the repeat.

Additional tags and nomenclature are described in the paper methods.

Funding

Climate genomics in the Northern krill: the past, present and future of an important marine species

Swedish Research Council for Environment Agricultural Sciences and Spatial Planning

Find out more...

History

Publisher

Uppsala University

Usage metrics

    Andreas Wallberg Lab

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC