Supplemental data from the genome assembly and annotation of the Clouded Apollo Butterfly (Parnassius mnemosyne)

posted on 2024-06-26, 11:34 authored by Jacob Höglund, Guilherme Dias, Remi-André Olsen, André Soares, Ignas Bunikis, Venkat Talla, Niclas Backström

This dataset contains supplementary data from the genome sequencing of the Clouded Apollo Butterfly (Parnassius mnemosyne), published in:

Höglund, J., Dias, G., Olsen, R. A., Soares, A., Bunikis, I., Talla, V., & Backström, N. (2024). A Chromosome-Level Genome Assembly and Annotation for the Clouded Apollo Butterfly (Parnassius mnemosyne): A Species of Global Conservation Concern. Genome Biology and Evolution, 16(2), evae031.

Previous data from the project has been deposited at the European Nucleotide Archive (ENA) in the umbrella project PRJEB76269.

The data contained in this archive at SciLifeLab Data Repository describe the genome assembly (ENA accession: GCA_963668995.1), and the mitochondrial genome assembly (ENA accession: OZ075093.1).

Below follows a brief description of each file. The information on the methods used to generate the files was adapted from Höglund et al. 2024.

  • pmne_functional_edit1.gff.gz

contains the functional annotation (protein coding genes) of the primary genome assembly (GCA_963668995.1). This is the original file that was submitted to ENA. A derived version of the file is available from NCBI; the NCBI version was generated from the EMBL records of each annotated gene and differs in that it for instance use a different naming scheme for the seqid column and the locus tags. The NCBI version is available at this link.

The genes were predicted using BRAKER (v3.03), GALBA (v1.0.6), and GeneMarkS-T (v5.1). The resulting gene models were combined and filtered using TSEBRA (version: long_reads branch commit 1f2614). The combined gene model was functionally annotated by the NBIS nextflow pipeline v2.0.0 (

  • pmne_Illumina_RNAseq_StringTie_sorted-transcripts_match.gff.gz

contains a transcript assembly of the Illumina RNAseq reads (ENA accession: ERX11559451). The reads were aligned to the genome with HiSat2 (v2.1.0) and then assembled with StringTie (v2.2.1).

  • pmne_mtdna.gff.gz

contains the functional annotation of the mitochondrial genome assembly (ENA accession: OZ075093.1). This is the original file that was submitted to ENA. The annotation was generated using MitoFinder (v1.4.1).

  • pmne_ncRNAs.gff.gz

contains the annotation of putative non-coding RNA (ncRNA) genes. The prediction was done with Infernal (v1.1.4) and the Rfam (v14.1) covariance models.

  • pmne_tRNAs_and_pseudogenes.gff.gz

contains the annotation of putative tRNA genes and pseudogenes. The prediction was done with tRNAscan-SE (v2.0.12).

  • pmne_PacBio_isoseq.sorted.bam

contains the PacBio IsoSeq transcripts (ENA accession: ERX11559436) aligned to the primary genome assembly.

  • pmne_repeat_library.fa.gz

contains the nucleotide sequences of the prediced repeats in fasta format. The prediction was done with RepeatModeler2 (v2.0.2a).

Available variables

For a description of the column headers of the files, please see the following links to the documentation of the different file formats.

The GFF3 format (.gff) is described here:

The BAM format (.bam) is a compressed version of the SAM format, both of which are described here:

The fasta (.fa) format is described here:


