Increasing and excesive use of memory using OpenGL and AMD GPU – User discussions

GROMACS version: 2023.1
GROMACS modification: Yes/No
Here post your question:

I successfully compiled on a AMD/GPU node cluster. Each node has 8 gpus. So I’m running 8 simultaneous simulations on each node. As the simulations proceed the memory usage keeps increasing until some of the jobs die. I noticed this issue came up back in 2018. memory leak in OpenCL runs with – Redmine #2470 (#2470) · Issues · GROMACS / GROMACS · GitLab. Per the suggestion in the latter post, I set the GMX_DISABLE_GPU_TIMING variable. However, the problem persists. Any suggestions to avoid the memory leak? Thanks!

GROMACS version: 2023.1
Precision: mixed
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support: OpenCL
NB cluster size: 8
SIMD instructions: AVX2_256
CPU FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128
GPU FFT library: VkFFT internal (1.2.26-b15cb0ca3e884bdb6c901a12d87aa8aadf7637d8) with OpenCL backend
Multi-GPU FFT: none
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /usr/tce/bin/cc GNU 10.3.1
C compiler flags: -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -O3 -DNDEBUG
C++ compiler: /usr/tce/bin/c++ GNU 10.3.1
C++ compiler flags: -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -Wno-unused-parameter -Wno-unused-variable -Wno-newline-eof -Wno-old-style-cast -Wno-zero-as-null-pointer-constant -Wno-unused-but-set-variable -Wno-sign-compare -Wno-cast-function-type-strict -fopenmp -O3 -DNDEBUG
BLAS library: External – detected on the system
LAPACK library: External – detected on the system
OpenCL include dir: /opt/rocm-5.4.3/include
OpenCL library: /opt/rocm-5.4.3/lib/libOpenCL.so
OpenCL version: 2.2

Read more here: Source link