1. Comparative population transcriptomics in krill: reference transcriptomes (FASTA, GFF, TSV files)

dataset

posted on 2023-10-19, 13:31 authored by Andreas WallbergAndreas Wallberg

This item holds one major gzipped tar archive that contains 20 nested tar archives, each of which containing reference transcriptomes and associated metadata for one species of krill (20 species in total).

Archive:

krill.transcriptomes.tar.gz

Contents of major archive (FILE,TAG,SPECIES,SIZE):

earm.transcriptomes.tar,earm,Euphausia similis var. armata,491.6M
ecry.transcriptomes.tar,ecry,Euphausia crystallorophias,89.7M
edin.transcriptomes.tar,edin,Euphausia distinguenda,496.5M
efri.transcriptomes.tar,efri,Euphausia frigida,345.2M
elam.transcriptomes.tar,elam,Euphausia lamelligera,515.9M
elos.transcriptomes.tar,elos,Euphausia longirostris,234M
emuc.transcriptomes.tar,emuc,Euphausia mucronata,360.4M
epac.transcriptomes.tar,epac,Euphausia pacifica,357.1M
erec.transcriptomes.tar,erec,Euphausia recurva,114.8M
esim.transcriptomes.tar,esim,Euphausia similis,417.9M
espi.transcriptomes.tar,espi,Euphausia spinifera,425.1M
esup.transcriptomes.tar,esup,Euphausia superba,520.6M
etri.transcriptomes.tar,etri,Euphausia triacantha,396M
eval.transcriptomes.tar,eval,Euphausia vallentini,635.1M
mnor.transcriptomes.tar,mnor,Meganyctiphanes norvegica,469M
nmeg.transcriptomes.tar,nmeg,Nematoscelis megalops,429M
tine.transcriptomes.tar,tine,Thysanoessa inermis,594.6M
tlon.transcriptomes.tar,tlon,Thysanoessa longicaudata,328.8M
tmac.transcriptomes.tar,tmac,Thysanoessa macrura,253.4M
trac.transcriptomes.tar,trac,Thysanoessa raschii,231.2M

Contents of nested archives:

Each nested tar archive contains the follow set of files (the "TAG" prepends the filenames according to the list of species tags above):

TAG. trinity.fasta

The full Trinity transcriptomem, including non-coding transcripts and alternative isoforms

TAG.trinity.longest_isoforms.fasta.renamed.list.tsv:

A TSV table to translate between original Trinity transcript sequence names (field 3) and names used throughout the analyses (field 2). This table contains the longest isoforms, i.e. the resulting transcripts after removing redundant shorter isoforms.

field 1: number
field 2: species-specific transcript sequence names used in analyses. The sequence name follow the format "TAG_NUMBER" for non-coding transcripts and "TAG_NUMBER_OTHER_NUMBER" for coding transcripts (the last number indicates which reading-frame was selected by transdecoder as the best).
field 3: original Trinity transcript sequence names

TAG.trinity.longest_isoforms.coding.fasta

The filtered transcriptome, including only the longest isoform of each coding transcript.

TAG.trinity.longest_isoforms.coding.fasta.transdecoder.gff3

A GFF coordinate file that specifies where along the coding transcripts features such as CDS, UTRs start and stop.

TAG.trinity.longest_isoforms.fasta.transdecoder.cds.fasta

The CDS of the open reading frame of coding transcripts, as specified by the TAG.trinity.longest_isoforms.coding.fasta.transdecoder.gff3 GFF file and the TAG.trinity.longest_isoforms.coding.fasta file.

TAG.trinity.longest_isoforms.fasta.transdecoder.pep.fasta

The corresponding peptide sequence of encoded by each CDS.

The GFF files follow the GFF3 standard:

https://www.ensembl.org/info/website/upload/gff3.html

The FASTA files follow the FASTA standard:

https://www.ncbi.nlm.nih.gov/genbank/fastaformat/

Note: Compared to the files used in analyses, these files have been edited to reflect the species names and abbreviations used in publication figures.