Phylogenomics of aquatic bacteria

dataset

posted on 2025-02-17, 14:20 authored by Krzysztof JurdzinskiKrzysztof Jurdzinski, Maliheh MehrshadMaliheh Mehrshad, Stefan Bertilsson, Anders AnderssonAnders Andersson

Intermediate data files obtained during the work on the manuscript "Phylogenomics of aquatic bacteria reveal molecular mechanisms behind the limits of their adaptation to salinity". The files published here were used at various stages of the analysis (or sum-up the stages) and should allow reproduction of the results as well as expanded investigation of the dataset.

These files are:

ar_mags_info.txt - information table for collected archeal MAGs. It contains names of the MAGs in format {data source 2-letter code}_{name of the MAG as in ENA}, the biome of origin and taxonomic classification. For the brackish MAGs there is also annotation to the basin (Baltic/Caspian) of origin and additional metadata for the Baltic Sea MAGs.

bac_mags_info.txt - information table for collected bacterial MAGs. It contains names of the MAGs in format {data source 2-letter code}_{name of the MAG as in ENA}, the biome of origin and taxonomic classification. For the brackish MAGs there is also annotation to the basin (Baltic/Caspian) of origin and additional metadata for the Baltic Sea MAGs.

CheckM_all_MAGs.csv - CheckM results for all the investigated MAGs (completness, contamination, strain heterogeneity).

ani_file.txt - average nucleotide indentity between all the pairs of investigated MAGs.

MAG-cluster-stats-interbiome-clusters.xlsx - Excel file with a table annotating MAGs to >95% ANI clusters and the represtatives chosen for further analysis marked. Contains also sheets with just the representatives, clusters common between the brackish basins and between the biomes, as well as MSG_table.tsv imported into Excel spreadsheet. The first sheet also contains accession numbers for the bacterial MAGs used in this study.

nozero.bifurc.bac.tree.nwk - phylogenetic tree of all the MAGs and GTDB reference genomes. Obtained using GTDB-tk.

pruned95.nozero.bifurc.bac.tree.nwk - the phylogenetic tree (nozero.bifurc.bac.tree.nwk) pruned to contain only one represtative for a biome from each >95% ANI cluster. Does not contain GTDB reference genomes.

subsampled.pruned95.nozero.bifurc.bac.tree.nwk - the phylogenetic tree with >95% ANI cluster respresntatives further randomly pruned the same number of freshwater and marine representatives.

timetree_evo_rate_100.nwk - the full phylogenetic tree (nozero.bifurc.bac.tree.nwk) with branch length adjusted to correspond to estimated times since divergence in mya [million years ago].

time_calibration.txt - constraint file used for estimating time since divergence, input for RelTime (MEGA11). Minimal estimates of time since host species diverged [mya], based on the fossil record, were used to set the constraints

MSG_table.tsv - a table (tab-separated) with all the MAGs within identified monobiomic sister groups (MSGs), annotated to appropriate transition_ID, biome and transition type. Taxonomic classification and transition times and directions are also included.

make_MSG_table.R - R script used to make MSG_table.tsv.

assess_datetree.R - R scirpt used to find the cross-biome transitions on the time-adjusted phylogenetic tree and obtained the information about the estimated time since they occured.

transition_directions.R - R script used to estimate the ancestral biome-states of MRCAs (most recent common ancestors) of MSG pair and thus infer the most probable transition directions.

all_MSG_ids.txt - a text file with names of all the representative MAGs within all the MSG pairs.

filter_MSGs.py - a Python script to extract the MAGs from within the MSGs (given all_MSG_ids.txt) from a folder containing a larger set of sequences.

Snakefile_proteins - Snakefile with a pipeline to go from nucleotide MAG sequences to pepstats statistics for inferred proteins. Includes proteome inference step using Prodigal (same procedure was used to infer amino-acid sequences for other purposes, including the taxonomic classification and reconstruction of the phylogenetic tree).

MSGs_whole_proteomes.py - a Python script to concatanate inferred proteomes into continous amino acid sequences (for amino acid usage statistics).

Snakefile_whole_proteome - Snakefile with a pipeline to obtain amino acid relative frequencies within proteomes. As an input takes proteomes in form of one continous sequence (MSGs_whole_proteomes.py output).

MSGs_pI_rel_freq_table.tsv - a table (tab separated) with relative frequencies of proteins with pIs (isoelectric points) within 0.5 pH wide bins.

aa_freqs_MSGs_list.json and assessed_aas.tsv - a json file with amino acid relative frequencies for each inferred proteome and a tab-separated list of IUPAC amino acid codes in order corresponding to values in the list.

aa_cat_freqs_MSGs_list.json and assessed_aa_cats.tsv - a json file with relative frequencies of amino acid categories for each inferred proteome and a tab-separated list names of the categories ordered accoridngly as in the .json file.

pI_aa_statistics.xlsx - statistics (p-values and differences sizes) for pairwise comparisons of inferred proteome properties and composition, i.e. i) relative frequencies of acidic, neutral and basic (isoelectric point (pI) categories) proteins ; ii) genome sizes as defined by number of inferred protein-coding genes; iii) amino acid relative frequencies; iv) relative frequencies of amino acids categories.

{transition type}.annotation.gz and MSG_ids_{transition type}_pairs.txt - annotation files (zipped) of inferred genes for random pairs of MAGs from each MSG pair, together with text file with MSG represntatives. Seperate pair of files for each transition type. Used for investigating coannotation.

ko_annot_full_everything.tsv - table with multilevel annotation of KEGG orthologs, adpoted from KEGG orthology website

ko_anno.rar - compressed table with numbers of genes annotated to respective KEGG orthologs in representative genomes from all the >95% ANI bacterial clusters (used for gene gain/loss analysis).

iterate_rarefying.R - R script used to indentify the differentially present genes.

gain_loss_tables.xlsx - Sheets 1-3: results of MSG-based (phylogeny-aware) gene content analysis. Tables with all the significant (FDR < 0.1, shaded in orange) differentially present genes across pairs of MSGs. For FB and FM type transitions additional genes were added to the table to show at least the top 25 most significant genes regardless of the FDR values. Sheets 4-6: Biome(s) in which the differentially present KOs were found across the identified transitions (MSG pairs), i.e. the data presented in Fig. 6 in text form and annotated to more specific taxa and single transition events. Includes taxonomic annotation of the transitions and numbers of bacterial species in MSGs from respective biomes. Sheets 7-9: Fraction of cases in which gene A (row) was also annotated as gene B (column), based on {transition type}.annotation.gz files. Sheets 10-12: Results of phylogeny-unaware gene content analysis. Tables with all the significant (FDR < 0.1) differentially present genes from an unpaired comparison of all bacterial species from each biome.