Hi,
I was trying to fold a large complex with over 7,000aa in total. I ran the job on our cluster with the following settings: 460G RAM, 48 CPUs, but the job failed in the end because of the following error: RuntimeError: Resource exhausted: Out of memory while trying to allocate 294970604000 bytes.
I don’t know if it’s a Jax/tensorflow error, but I tried lowering “XLA_PYTHON_CLIENT_MEM_FRACTION” to “0.1” and setting “XLA_PYTHON_CLIENT_PREALLOCATE=false”. However, these didn’t solve the problem.
Could this be caused by some internal memory exhaustions in the program, with software buffers being overflowed?
Any suggestions on how to solve it would be highly appreciated!
Thank you in advance!
The following is the tail information of the log file.
$ tail AF2_slurm_57293_snoke5_alphafold_57293_4294967294.l
sys.exit(main(argv))
File “/opt/bioxray/programs/anaconda3/envs/alphafold/alphafold/run_alphafold.py”, line 284, in main
predict_structure(
File “/opt/bioxray/programs/anaconda3/envs/alphafold/alphafold/run_alphafold.py”, line 149, in predict_structure
prediction_result = model_runner.predict(processed_feature_dict)
File “/opt/bioxray/programs/anaconda3/envs/alphafold/alphafold/alphafold/model/model.py”, line 133, in predict
result = self.apply(self.params, jax.random.PRNGKey(0), feat)
File “/opt/bioxray/programs/anaconda3/envs/alphafold/lib/python3.8/site-packages/jax/interpreters/xla.py”, line 912, in _execute_compiled
out_bufs = compiled.execute(input_bufs)
RuntimeError: Resource exhausted: Out of memory while trying to allocate 294970604000 bytes.
Read more here: Source link