Mpmtest error on TACC LS5

I compiled the code of branch origin/develop. The following commands were used to compile it:

module load boost hdf5 qt5 vtk
module load intel/18.0.2
export CC=icc
export CXX=icpc
cmake -DBOOST_ROOT=$TACC_BOOST_DIR -DBOOST_INCLUDE_DIRS=$TACC_BOOST_INC -DCMAKE_BUILD_TYPE=Release -DEIGEN3_INCLUDE_DIR=$WORK/eigen -DNO_KAHIP=True -DHALO_EXCHANGE=Off ..               
make -j8

It compiled successfully. But when trying the ./mpmtest, I got the following error:

login1.ls5(1046)$ ./mpmtest
[Fri Jul 17 14:47:53 2020] [unknown] Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(537):
MPID_Init(246).......: channel initialization failed
MPID_Init(638).......:  PMI2 init failed: 1
aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(537):
MPID_Init(246).......: channel initialization failed
MPID_Init(638).......:  PMI2 init failed: 1

Besides, when I run my own simulation with the submit script:

#!/bin/bash
#SBATCH -J mpm_oso        # job name
#SBATCH -o hazel.o   # output and error file name (%j expands to jobID)
#SBATCH -N 2          # number of nodes requested
#SBATCH -n 4              # total number of mpi tasks requested
#SBATCH -p development   # queue (partition) -- normal, development, etc.
#SBATCH -t 2:00:00       # run time (hh:mm:ss) - 18 hours
#SBATCH -A Material-Point-Metho
# Slurm email notifications
#SBATCH --mail-user= user.email@utexas.edu
#SBATCH --mail-type=begin   # email me when the job starts
#SBATCH --mail-type=end     # email me when the job finishes
# run the executable named a.out
module load intel/18.0.2
module load boost hdf5 vtk
ibrun $SCRATCH/mpm/build/mpm -f $SCRATCH/path/ -i mpm.json

The simulation will fail at one timestep as:

[2020-07-17 12:01:13.310] [MPMExplicit] [info] Step: 127 of 1000.

[2020-07-17 12:01:13.473] [MPMExplicit] [info] Step: 128 of 1000.

MPM main: map::at
Rank 3 [Fri Jul 17 12:01:14 2020] [c0-0c0s4n2] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
srun: error: nid00018: task 3: Exited with exit code 255
srun: Terminating job step 2962531.0
slurmstepd: error: *** STEP 2962531.0 ON nid00017 CANCELLED AT 2020-07-17T12:01:15 ***
srun: error: nid00018: task 2: Terminated
srun: error: nid00017: tasks 0-1: Terminated
srun: Force Terminated job step 2962531.0
TACC: MPI job exited with code: 143

TACC: Shutdown complete. Exiting.

I tried different nodes and mpi jobs, it always fails at different timestep with the same error. However, there are no errors in the simulation with 1 node.

For testing, could you please login to a compute node using idev and then run the test with ibrun -n 1 ./mpmtest.

Are you sure you have the latest version. Could you print the git head information: git rev-parse --short HEAD

Yeah, I tried it this morning.

login1.ls5(1057)$ git rev-parse --short HEAD
b14691d6

There are also errors on the compute node with ibrun -n 1 ./mpmtest.

Could you please load module load impi/18.0.2 and then recompile the mpm code and load these modules before you run the code. Please see: Running issue on TACC

Could you please post what modules are loaded, using module list?

Currently Loaded Modules:
  1) intel/18.0.2       4) autotools/1.2   7) TACC/1.0        10) hdf5/1.8.21  13) vtk/8.2.0
  2) cray_mpich/7.7.3   5) cmake/3.16.1    8) python2/2.7.15  11) qt5/5.11.2
  3) git/2.24.1         6) xalt/2.8        9) boost/1.64      12) swr/18.3.3

I cant load the impi/18.0.2. Do I miss something?

login1.ls5(1125)$ module load impi/18.0.2
Lmod has detected the following error:  The following module(s) are unknown: "impi/18.0.2"

If i use module load impi-largemem/18.0.2, the following modules will be inactive.

Currently Loaded Modules:
  1) intel/18.0.2   3) autotools/1.2   5) xalt/2.8   7) boost/1.64    9) impi-largemem/18.0.2
  2) git/2.24.1     4) cmake/3.16.1    6) TACC/1.0   8) hdf5/1.8.21  10) python2/2.7.15
Inactive Modules:
  1) qt5   2) swr   3) vtk

It also can compile successfully but can’t run ./mpmtest or my job.

Could you try running on Stampede2?

I tried on the stampede2 with the following commands:

module load boost hdf5 qt5 vtk
module load intel/18.0.2
module load impi/18.0.2
export CC=icc
export CXX=icpc
cmake -DBOOST_ROOT=$TACC_BOOST_DIR -DBOOST_INCLUDE_DIRS=$TACC_BOOST_INC -DCMAKE_BUILD_TYPE=Release -DEIGEN3_INCLUDE_DIR=$HOME/eigen -DKAHIP_ROOT=$HOME/KaHIP/     -DHALO_EXCHANGE=Off ..
make -j

It compiled successfully. Then I use idev and then run the test with ibrun -n 1 ./mpmtest. It shows like this:

TACC Stampede2 System
Provisioned on 24-May-2017 at 11:49

c455-133[knl](1)$ ibrun -n 1 ./mpmtest
TACC:  Starting up job 6075388
TACC:  Starting parallel tasks...
[2020-07-17 19:59:26.524] [cell2d::0] [error] /scratch/07277/lyowsn/mpm/include/cell.tcc #58: Specified number of nodes for a cell is not present

[2020-07-17 19:59:26.525] [cell2d::0] [error] /scratch/07277/lyowsn/mpm/include/cell.tcc #148: Number nodes in a cell exceeds the maximum allowed per cell

[2020-07-17 19:59:26.526] [cell2d::0] [error] /scratch/07277/lyowsn/mpm/include/cell.tcc #58: Specified number of nodes for a cell is not present

[2020-07-17 19:59:26.526] [cell2d::0] [error] /scratch/07277/lyowsn/mpm/include/cell.tcc #148: Number nodes in a cell exceeds the maximum allowed per cell

[2020-07-17 19:59:26.526] [cell2d::0] [error] /scratch/07277/lyowsn/mpm/include/cell.tcc #58: Specified number of nodes for a cell is not present

[2020-07-17 19:59:26.526] [cell2d::0] [error] /scratch/07277/lyowsn/mpm/include/cell.tcc #148: Number nodes in a cell exceeds the maximum allowed per cell

[2020-07-17 19:59:26.527] [cell2d::0] [error] /scratch/07277/lyowsn/mpm/include/cell.tcc #58: Specified number of nodes for a cell is not present

The module list is:

Currently Loaded Modules:
  1) git/2.24.1      4) xalt/2.8     7) intel/18.0.2     10) hdf5/1.10.4     13) vtk/8.1.1
  2) autotools/1.1   5) TACC         8) libfabric/1.7.0  11) impi/18.0.2
  3) cmake/3.16.1    6) qt5/5.11.2   9) boost/1.68       12) python2/2.7.15

That looks like everything is working fine.

Yes. It works fine on the stampede2. Many thanks!

For LS5, please use cray_mpich/7.7.3 module, instead of impi and export:

export CC=icc
export CXX=icpc

Although, this fails the test, the code seems to run. Let me know if you can run properly on LS5.

It said I have invoked an unsupported MPI job launch command.

[ ^[[1;31mERROR^[[0m ] You have invoked an unsupported MPI job launch command:
[ ^[[1;31mERROR^[[0m ]   /opt/apps/tacc/bin/mpiexec
[ ^[[1;31mERROR^[[0m ] Lonestar5 uses the ibrun MPI job launcher.
[ ^[[1;31mERROR^[[0m ] For more information on appropriate ibrun command options,
[ ^[[1;31mERROR^[[0m ] please visit our user guide here:
[ ^[[1;31mERROR^[[0m ] ^[[1;32mhttps://portal.tacc.utexas.edu/user-guides/lonestar5^[[0m

But I changed noththing except -n and -N. It works fine with 1 node.

#!/bin/bash
#SBATCH -J mpm_oso        # job name
#SBATCH -o hazel.o   # output and error file name (%j expands to jobID)
#SBATCH -N 2          # number of nodes requested
#SBATCH -n 4              # total number of mpi tasks requested
#SBATCH -p development   # queue (partition) -- normal, development, etc.
#SBATCH -t 2:00:00       # run time (hh:mm:ss) - 18 hours
#SBATCH -A Material-Point-Metho
# Slurm email notifications
#SBATCH --mail-user= user.email@utexas.edu
#SBATCH --mail-type=begin   # email me when the job starts
#SBATCH --mail-type=end     # email me when the job finishes
# run the executable named a.out
module load intel/18.0.2
module load cray_mpich/7.7.3
module load boost hdf5 vtk
ibrun $SCRATCH/mpm/build/mpm -f $SCRATCH/oso/hazel_coarse_efficiency_develop/ -i mpm-hazel_stage1.json

If you are still facing an issue, could you open a support ticket with TACC and cc-me. https://portal.tacc.utexas.edu/tacc-consulting

It works fine for my simulation with cray_mpich/7.7.3 on LS5 although mpmtest fails. Many thanks!