Codebase of deep learning models for inferring stability of mRNA molecules

Codebase of deep learning models for inferring stability of mRNA molecules, corresponding to the Kaggle Open Vaccine Challenge and accompanying manuscript “Predictive models of RNA degradation through dual crowdsourcing”, Wayment-Steele et al (2021) (full citation when available).

Models contained here are:

“Nullrecurrent”: A reconstruction of winning solution by Jiayang Gao. Link to original notebooks provided below.

“DegScore-XGBoost”: A model based the original DegScore model and XGBoost.

NB on other historic names for models

  • The Nullrecurrent model was called “OV” model in some instances and the .h5 model files for the Nullrecurrent model are labeled “ov”.

  • The DegScore-XGBoost model was called the “BT” model in Eterna analysis.

Organization

scripts: Python scripts to perform inference.

notebooks: Python notebooks to perform inference.

model_files: Store .h5 model files used at inference time.

data: Data corresponding to Kaggle challenge and to subsequent tests on mRNAs.

data/Kaggle_RYOS_data

This directory contains training set and test sets in .csv and in .json form.

Kaggle_RYOS_trainset_prediction_output_Sep2021.txt contains predictions from the Nullrecurrent code in this repository.

Model MCRMSEs were evaluated by uploading submissions to the Kaggle competition website at www.kaggle.com/c/stanford-covid-vaccine.

data/mRNA_233x_data

This directory contains original data and scripts to reproduce model analysis from manuscript.

Because all the original formats are slightly different, the reformat_*.py scripts read in the original formats and reformats them in two forms for each prediction: “FULL” and “PCR” in the directory formatted_predictions.

“FULL” is per-nucleotide predictions for all the nucleotides. “PCR” has had the regions outside the RT-PCR sequencing set to NaN.

python collate_predictions.py reads in all the data and outputs all_predictions_233x.csv

RegenerateFigure5.ipynb reproduces the final scatterplot comparisons.

posthoc_code_predictions contains predictions from the Nullrecurrent code model contained in this repository. To generate these predictions use the sequence file in the mRNA_233x_data folder and run the following command(s):

python scripts/nullrecurrent_inference.py -d deg_Mg_pH10 -i 233_sequences.txt -o 233x_nullrecurrent_output_Oct2021_deg_Mg_50C.txt,

etc.

Dependencies

Install via pip install requirements.txt or conda install --file requirements.txt.

Not pip-installable: EternaFold, Vienna, and Arnie, see below.

Setup

  1. Install git-lfs (best to do before git-cloning this KaggleOpenVaccine repo).

  2. Install EternaFold (the nullrecurrent model uses this), available for free noncommercial use here.

  3. Install ViennaRNA (the DegScore-XGBoost model uses this), available here.

  4. Git clone Arnie, which wraps EternaFold in python and allows RNA thermodynamic calculations across many packages. Follow instructions here to link EternaFold to it.

  5. Add path to this repository as KOV_PATH (so that script can find path to stored model files):

export KOV_PATH='/path/to/KaggleOpenVaccine'

Usage

To run the nullrecurrent winning solution on one construct, given in example.txt:

Run

python scripts/nullrecurrent_inference.py [-d deg] -i example.txt -o predict.txt

where the deg is one of the following options

deg_Mg_pH10
deg_pH10
deg_Mg_50C
deg_50C


Similarly, for the DegScore-XGBoost model :

python scripts/degscore-xgboost_inference.py -i example.txt -o predict.txt

This write a text file of output predictions to predict.txt:

(Nullrecurrent output)

2.1289976365, 2.650808962, 2.1869660805000004

(DegScore-XGBoost output)

0.2697107, 0.37091506, 0.48528114

A note on energy model versions

The predictions in the Kaggle competition and for the manuscript were performed with EternaFold parameters and CONTRAfold-SE code. The currently available EternaFold code will result in slightly different values. For more on the difference, see the EternaFold README.

Individual Kaggle Solutions

This code is based on the winning solution for the Open Vaccine Kaggle Competition Challenge. The competition can be found here:

www.kaggle.com/c/stanford-covid-vaccine/overview

This code is also the supplementary material for the Kaggle Competition Solution Paper. The individual Kaggle writeups for the top solutions that have been featured in that paper can be found in the following table:

Read more here: Source link