SciLifeLab
Browse

Supporting data tracks for: "Breaking insect genome records: sequencing of <i>Stylops </i><i>ater </i>(Strepsiptera) reveals a minute, compact genome with a reduced set of genes".

dataset
posted on 2025-11-14, 15:03 authored by Johannes Bergsten, Martin PippelMartin Pippel, Meri LähteenaroMeri Lähteenaro, Julia Heintz, Genevieve Diedericks, Henrique G. Leitão, Carlos Leyton Rotella, Mahesh Binzer-Panchal, Christian Tellgren-Roth, Mai-Britt Mosbech, Hannes Svardal, Alice Mouton, Giulio Formenti, Ann M. Mc Cartney, Henrik Lantz, Olga Vinnere Pettersson
<p dir="ltr">This data set contains supporting data tracks for the manuscript: "Breaking insect genome records: sequencing of <i>Stylops </i><i>ater </i>(Strepsiptera) reveals a minute, compact genome with a reduced set of genes". Assembly and gene annotation are available on ENA under the umbrella project PRJEB71963. Here we publish the following additional resources:</p><ul><li>repeatmasker.gff - repeat track generated via: a repeat library was modelled with the RepeatModeler2 [1] <code>v2.0.2a</code> package. As repeats can be part of actual protein-coding genes, the candidate repeats modelled by RepeatModeler were vetted against our protein set (minus transposons) to exclude any nucleotide motif stemming from low-complexity coding sequences. From the repeat library, identification of repeat sequences present in the genome was performed using <a href="https://www.repeatmasker.org/" target="_blank">RepeatMasker</a> <code>v4.1.5</code> [2]</li><li>repeatrunner.gff - repeat track generated via: RepeatRunner [3]. RepeatRunner is a program that integrates RepeatMasker with BLASTX allowing analysing highly divergent repeats and divergent portions of repeats and identifying divergent protein coding portions of retro-elements and retroviruses not detected by RepeatMasker.</li><li>trna.gff - tRNA track - have been predicted through tRNAscan version 1.4 [4].</li><li>rfam.gff - ncRNA track - As the main source of information we use the RNA family database Rfam (version 14.9) [5]. Rfam provides curated co-variance (CM) models, which can be used together with the Infernal [6] package to predict ncRNAs in genomic sequences. By default, the set of CM profiles is limited by us to only included broadly conserved, eukaryotic ncRNA families. /! In general, Rfam-derived ‘annotations’ should rather be seen as ‘predictions’. With the exception of some very well conserved ncRNA families, many of the resulting Rfam predictions need to be considered with some care.</li><li>ST_1.gff3 - Transcriptome assembly of Illumina RNA-seq library ST_1 (ENA: SAMEA12922144, ERX11689259) assembled using our in-house pipeline <a href="https://github.com/NBISweden/pipelines-nextflow/tree/master/subworkflows/transcript_assembly" rel="noreferrer" target="_blank">transcript_assembly</a> [7]. To minimise gene fusions in this parasite genome the maximum intron length was reduced from 500000 to 20000 (hisat2 --max-intronlen 20000). Otherwise default parameter were used.</li><li>ST_2.gff3 - Transcriptome assembly of Illumina RNA-seq library ST_2 (ENA: SAMEA12922144, ERX11689261) assembled using our in-house pipeline <a href="https://github.com/NBISweden/pipelines-nextflow/tree/master/subworkflows/transcript_assembly" rel="noreferrer" target="_blank">transcript_assembly</a> [7]. To minimise gene fusions in this parasite genome the maximum intron length was reduced from 500000 to 20000 (hisat2 --max-intronlen 20000). Otherwise default parameter were used.</li><li>rc2_evidence_abinitio.gff - gene models created from second MAKER run (rc2), combining evidence (from run 1 or rc1) and <i>ab initio</i> predictors. Specifically, <b>AUGUSTUS</b> was used for the rc2_evidence_abinitio.gff</li><li>rc2_evidence_genemark.gff - gene models created from second MAKER run (rc2), combining evidence (from run 1 or rc1) and <i>ab initio</i> predictors. Specifically, <b>GeneMark</b> was used for the rc2_evidence_genemark.gff</li></ul><p dir="ltr">References:</p><p dir="ltr">[1] - Flynn JM, Hubley R, Goubert C, Rosen J, Clark AG, Feschotte C, Smit AF. (2020) RepeatModeler2 for automated genomic discovery of transposable element families. Proceedings of the National Academy of Sciences. 117 (17) 9451-9457. <a href="https://doi.org/10.1073/pnas.1921046117" target="_blank"><u>https://doi.org/10.1073/pnas.1921046117</u></a></p><p dir="ltr">[2] - Smit AFA, Hubley R, Green, P. (2013-2015) RepeatMasker Open-4.0.</p><p dir="ltr">[3] - Yandell Lab: https://www.yandell-lab.org/software/repeatrunner.html</p><p dir="ltr">[4] - Lowe TM, Eddy SR. (1997) tRNAscan-SE: A program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Research 25(5): 955–964.<a href="https://doi.org/10.1093/nar/25.5.955" target="_blank"> https://doi.org/10.1093/nar/25.5.955</a></p><p dir="ltr">[5] - Ioanna Kalvari, Eric P Nawrocki, Nancy Ontiveros-Palacios, Joanna Argasinska, Kevin Lamkiewicz, Manja Marz, Sam Griffiths-Jones, Claire Toffano-Nioche, Daniel Gautheret, Zasha Weinberg, Elena Rivas, Sean R Eddy, Robert D Finn, Alex Bateman, Anton I Petrov, Rfam 14: expanded coverage of metagenomic, viral and microRNA families, <i>Nucleic Acids Research</i>, Volume 49, Issue D1, 8 January 2021, Pages D192–D200, <a href="https://doi.org/10.1093/nar/gkaa1047" target="_blank">https://doi.org/10.1093/nar/gkaa1047</a></p><p dir="ltr">[6] - The recommended citation for using Infernal 1.1 is E. P. Nawrocki and S. R. Eddy, <a href="http://eddylab.org/publications.html#Nawrocki13c" target="_blank">Infernal 1.1: 100-fold faster RNA homology searches</a>, <i>Bioinformatics</i> 29:2933-2935 (2013).</p><p dir="ltr">[7] - Github: https://github.com/NBISweden/pipelines-nextflow/tree/master/subworkflows/transcript_assembly</p>

Funding

The Swedish Taxonomy Initiative (Dha 2019.4.3-7 and Dha 2019.4.3-218)

The Royal Swedish Academy of Sciences (BS2022-0020)

European Reference Genome Atlas pilot-funds

Science for Life Laboratory, Sweden

The Swedish Research Council, Council for Research Infrastructure

History

Publisher

Swedish Museum of Natural History

SciLifeLab acknowledgement

  • Bioinformatics platform (NBIS)