Codebase of deep learning models for inferring stability of mRNA molecules, corresponding to the Kaggle Open Vaccine Challenge and accompanying manuscript “Predictive models of RNA degradation through dual crowdsourcing”, Wayment-Steele et al (2021) (full citation when available).
Models contained here are:
“Nullrecurrent”: A reconstruction of winning solution by Jiayang Gao. Link to original notebooks provided below.
“DegScore-XGBoost”: A model based the original DegScore model and XGBoost.
NB on other historic names for models
The Nullrecurrent model was called “OV” model in some instances and the .h5 model files for the Nullrecurrent model are labeled “ov”.
The DegScore-XGBoost model was called the “BT” model in Eterna analysis.
scripts: Python scripts to perform inference.
notebooks: Python notebooks to perform inference.
model_files: Store .h5 model files used at inference time.
data: Data corresponding to Kaggle challenge and to subsequent tests on mRNAs.
This directory contains training set and test sets in .csv and in .json form.
Kaggle_RYOS_trainset_prediction_output_Sep2021.txt contains predictions from the Nullrecurrent code in this repository.
Model MCRMSEs were evaluated by uploading submissions to the Kaggle competition website at www.kaggle.com/c/stanford-covid-vaccine.
This directory contains original data and scripts to reproduce model analysis from manuscript.
Because all the original formats are slightly different, the
reformat_*.py scripts read in the original formats and reformats them in two forms for each prediction: “FULL” and “PCR” in the directory
“FULL” is per-nucleotide predictions for all the nucleotides. “PCR” has had the regions outside the RT-PCR sequencing set to NaN.
python collate_predictions.py reads in all the data and outputs
RegenerateFigure5.ipynb reproduces the final scatterplot comparisons.
posthoc_code_predictions contains predictions from the
Nullrecurrent code model contained in this repository. To generate these predictions use the sequence file in the mRNA_233x_data folder and run the following command(s):
python scripts/nullrecurrent_inference.py -d deg_Mg_pH10 -i 233_sequences.txt -o 233x_nullrecurrent_output_Oct2021_deg_Mg_50C.txt,
pip install requirements.txt or
conda install --file requirements.txt.
Not pip-installable: EternaFold, Vienna, and Arnie, see below.
Install git-lfs (best to do before git-cloning this KaggleOpenVaccine repo).
Install EternaFold (the nullrecurrent model uses this), available for free noncommercial use here.
Install ViennaRNA (the DegScore-XGBoost model uses this), available here.
Git clone Arnie, which wraps EternaFold in python and allows RNA thermodynamic calculations across many packages. Follow instructions here to link EternaFold to it.
Add path to this repository as
KOV_PATH(so that script can find path to stored model files):
To run the nullrecurrent winning solution on one construct, given in
python scripts/nullrecurrent_inference.py [-d deg] -i example.txt -o predict.txt
deg is one of the following options
deg_Mg_pH10 deg_pH10 deg_Mg_50C deg_50C
Similarly, for the DegScore-XGBoost model :
python scripts/degscore-xgboost_inference.py -i example.txt -o predict.txt
This write a text file of output predictions to
2.1289976365, 2.650808962, 2.1869660805000004
0.2697107, 0.37091506, 0.48528114
A note on energy model versions
The predictions in the Kaggle competition and for the manuscript were performed with EternaFold parameters and CONTRAfold-SE code. The currently available EternaFold code will result in slightly different values. For more on the difference, see the EternaFold README.
Individual Kaggle Solutions
This code is based on the winning solution for the Open Vaccine Kaggle Competition Challenge. The competition can be found here:
This code is also the supplementary material for the Kaggle Competition Solution Paper. The individual Kaggle writeups for the top solutions that have been featured in that paper can be found in the following table:
Read more here: Source link