I am having problems with parallelized simulation of ogs5.7.1 on the UFZ-EVE-cluster. My reactive transport simulations (using IPQC and MPI) crash when running with more than 8 cores with the following error mesage:
ORTE has lost communication with its daemon located on node: hostname: node033 This is usually due to either a failure of the TCP network connection to the node, or possibly an internal failure of the daemon itself. We cannot recover from this failure, and therefore will terminate the job.
Does someone experienced this error ralready?
Seems to be a problem related to node communicatoin on the cluster, however, the simulations with 4 and 8 cores finished. The benchmark isofrac_2d using 20 cores finishes as well.
Sometimes the model crashed after 5min sometimes after 2h.
I am a little confused now as I do not change input files between the simulations except *.ddc.