LAMMPS hangs with OpenMPI – LAMMPS Installation

Dear all,

I am compiling LAMMPS 8Feb23 on an old cluster. Here are the details:

OS: Linux "Ubuntu 16.04.4 LTS" 4.13.0-39-generic                                                                                                                                                                                                                                                                      
Compiler: GNU C++ 5.4.0 20160609 with OpenMP not enabled                                                                                                                                                                                                                       
C++ standard: C++11                                                                                                                                                                                                                                                            
MPI v3.1: Open MPI v4.1.5, package: Open MPI otello@vikos Distribution, ident: 4.1.5, repo rev: v4.1.5, Feb 23, 2023 

I have compiled and installed a local copy of OpenMPI 4.1.5 and updated the PATH and LD_LIBRARY_PATH variables accordingly. I can compile and run a simple MPI program:

$ mpirun -version
mpirun (Open MPI) 4.1.5
$ mpirun -np 4 ./mpi_hello_world
Hello world from processor vikos, rank 0 out of 4 processors
Hello world from processor vikos, rank 2 out of 4 processors
Hello world from processor vikos, rank 3 out of 4 processors
Hello world from processor vikos, rank 1 out of 4 processors

I compiled LAMMPS with make, as CMake is too old (3.5.1):

make make yes-ASPHERE yes-CLASS2 yes-EXTRA-DUMP yes-KSPACE yes-MISC yes-MOLECULE yes-SRD yes-RIGID
make -j4 g++_openmpi

These are the compiling options:

mpicxx -std=c++11 -g -O3  -DLAMMPS_GZIP  -DMPICH_SKIP_MPICXX -DOMPI_SKIP_MPICXX=1    -c 

The problem is that the binary runs fine on one processor but hangs at random points when using multiple cores. E.g. from examples/ASPHERE/ellipsoid

mpirun -np 1 lmp_8Feb23 -in in.ellipsoid # works fine :)
mpirun -np 2 lmp_8Feb23 -in in.ellipsoid # hangs :(

I know old machines and old OS are not my friends. But I promise, they will be used to make sound science! :abacus:


Why not use the system provided MPI library? LAMMPS’ requirements with respect to the MPI standard are extremely moderate. Thus I doubt the issue you are seeing is LAMMPS specific.

What kind of interconnects does the cluster have?
Have you tried using TCP/IP communication only?
mpirun -mca btl tcp,self ...



1 Like

The system installation of OpenMPI is 3.0.0. When I compile LAMMPS with this one, I get the following error:

An error occurred in MPI_Init on a NULL communicator
MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
and potentially your MPI job) Local abort before MPI_INIT completed completed
successfully, but am not able to aggregate error messages, and not able to
guarantee that all other processes were killed!

I have no access to the cluster architecture (no documentation from the university hosting it), but setting the architecture solved the issue! Thank you so much.

Not sure it does solve the actual problem. But that TCP/IP works and your get an MPI_Init error with the original MPI library, hints that there is some issue with the configuration of the “fast” interconnect (assuming there is some, that is).

TCP/IP is not a good choice for communication for multi-node parallel jobs with LAMMPS. Its latency is too large and thus makes the many small communications of LAMMPS rather inefficient.

For running inside a single node or perhaps using 2 to 4 nodes (assuming such an old machine won’t have too many cores per node) should still work. For more efficient intra-node communication you should use --mca btl tcp,vader,self.
To make this more permanent and save typing effort, you can also set an environment variable in your profile.
export OMPI_MCA_btl="tcp,vader,self

P.S.: Why a person in their right mind would want to run a cluster with Ubuntu as OS is beyond my understanding. Ubuntu’s design and setup choices are acceptable (albeit at times quite irritating) in a desktop setting, but in a server environment it is pure folly. I have inherited a couple of servers running Ubuntu that I do not have the resources (yet) to replace, but those are driving me quite mad whenever I have to work on them. For the most part, though, we launch applications in containers there.

A look at the output of lspci (also on the compute node) can be very helpful.
Specifically I would try: lspci | grep -i control | grep -v -i memory

Awesome! I will report how it works on multiple nodes.
PS I am just grateful that someone is paying the electricity bill to make me play with my simulations :slight_smile:

Read more here: Source link