IPR0220 - InterPepRank set

dataset

posted on 2021-04-26, 12:10 authored by Isak Johansson-Åkhe, Claudio Mirabello, Björn WallnerBjörn Wallner

Peptide-protein interactions between a smaller or disordered peptide stretch and a folded receptor make up a large part of all protein-protein interactions. A common approach for modelling such interactions is to exhaustively sample the conformational space by fast-fourier-transform docking, and then refine a top percentage of decoys. Commonly, methods capable of ranking the decoys for selection in short enough time for larger scale studies rely on first-principle energy terms such as electrostatics, Van der Waals forces, or on pre-calculated statistical pairwise potentials.

InterPepRank is a machine-learning based method for peptide-protein complex scoring and ranking, which encodes the structure of the complex as a graph; with physical pairwise interactions as edges and evolutionary and sequence features as nodes. The graph-network is trained to predict the LRMSD of decoys by using edge-conditioned graph convolutions on a large set of peptide-protein complex decoys.

Here we present the complete dataset used to train InterPepRank. The set contains 679 receptor-peptide pairs, each pair has 50 different peptide conformations docked by 70000 different rotations. in total 2.5 billion conformations. This is too large to be distributed as flat files. As such, the dataset is distributed as a set of ft-files describing which rotations and translations to apply to the corresponding peptide ligands to generate decoy poses docked to the receptor structures. To generate these structures, the apply_ftresult_improved.py script is available.

In addition it also contains a set of apo and holo models that was used to benchmark unbound docking.

All files and scripts are given as-is with no warranty.