I compiled the code of branch origin/develop. The following commands were used to compile it:
module load boost hdf5 qt5 vtk
module load intel/18.0.2
export CC=icc
export CXX=icpc
cmake -DBOOST_ROOT=$TACC_BOOST_DIR -DBOOST_INCLUDE_DIRS=$TACC_BOOST_INC -DCMAKE_BUILD_TYPE=Release -DEIGEN3_INCLUDE_DIR=$WORK/eigen -DNO_KAHIP=True -DHALO_EXCHANGE=Off ..
make -j8
It compiled successfully. But when trying the ./mpmtest
, I got the following error:
login1.ls5(1046)$ ./mpmtest
[Fri Jul 17 14:47:53 2020] [unknown] Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(537):
MPID_Init(246).......: channel initialization failed
MPID_Init(638).......: PMI2 init failed: 1
aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(537):
MPID_Init(246).......: channel initialization failed
MPID_Init(638).......: PMI2 init failed: 1
Besides, when I run my own simulation with the submit script:
#!/bin/bash
#SBATCH -J mpm_oso # job name
#SBATCH -o hazel.o # output and error file name (%j expands to jobID)
#SBATCH -N 2 # number of nodes requested
#SBATCH -n 4 # total number of mpi tasks requested
#SBATCH -p development # queue (partition) -- normal, development, etc.
#SBATCH -t 2:00:00 # run time (hh:mm:ss) - 18 hours
#SBATCH -A Material-Point-Metho
# Slurm email notifications
#SBATCH --mail-user= user.email@utexas.edu
#SBATCH --mail-type=begin # email me when the job starts
#SBATCH --mail-type=end # email me when the job finishes
# run the executable named a.out
module load intel/18.0.2
module load boost hdf5 vtk
ibrun $SCRATCH/mpm/build/mpm -f $SCRATCH/path/ -i mpm.json
The simulation will fail at one timestep as:
[2020-07-17 12:01:13.310] [MPMExplicit] [info] Step: 127 of 1000.
[2020-07-17 12:01:13.473] [MPMExplicit] [info] Step: 128 of 1000.
MPM main: map::at
Rank 3 [Fri Jul 17 12:01:14 2020] [c0-0c0s4n2] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
srun: error: nid00018: task 3: Exited with exit code 255
srun: Terminating job step 2962531.0
slurmstepd: error: *** STEP 2962531.0 ON nid00017 CANCELLED AT 2020-07-17T12:01:15 ***
srun: error: nid00018: task 2: Terminated
srun: error: nid00017: tasks 0-1: Terminated
srun: Force Terminated job step 2962531.0
TACC: MPI job exited with code: 143
TACC: Shutdown complete. Exiting.
I tried different nodes and mpi jobs, it always fails at different timestep with the same error. However, there are no errors in the simulation with 1 node.