PardisoLU fails with unspecific message

joergbuchwald · April 8, 2020, 10:34pm

Hi,
I compiled OGS with MKL support in order to use PardisoLU from the eigen library.
Unfortunately, the solver always fails with a very unspecific message:

info: ------------------------------------------------------------------
info: *** Eigen solver computation
info: -> scale
info: -> solve with PardisoLU
error: Failed during Eigen linear solver initialization
info: ------------------------------------------------------------------
info: [time] Linear solver took 2.27617 s.
error: Newton: The linear solver failed.
info: [time] Solving process #0 took 2.35626 s in time step #1 
error: The nonlinear solver failed in time step #1 at t = 5000 s for process #0.
info: [time] Output of timestep 1 took 0.00990845 s.
error: Time stepper cannot reduce the time step size further. at TimeLoop.cpp, line 667
info: OGS terminated on 2020-04-09 00:06:05+020
error: OGS terminated with error

However, I thought the debug mode might help, but it does not give any further information.
Does anyone have an idea what might be the cause?

Thomas_Nagel · April 9, 2020, 3:36pm

For future reference:
@joergbuchwald and I solved the issue. The problem occurred because of another MKL Installation from a package manager. One has to make sure to link against the native Intel MKL, i.e. that all paths in CMake are properly set to that version.

Rui_Feng · March 29, 2021, 2:28pm

Hi,
I am a novice at OGS6. I also met the similar problem to yours, but I am not sure it is the same as yours.
I had not compiled OGS, and I just download the zip from gitlab to run this prj written by myself. I tried to sovle the problem as you said, but it not worked.
Could you tell me more details to deal with this problem? Or other suggestion about my problem.

Thank you

info: ------------------------------------------------------------------
info: [time] Linear solver took 3.51127 s.
info: Convergence criterion, component 0: |dx|=1.5533e-07, |x|=3.5545e+04, |dx|/|x|=4.3699e-12
info: Convergence criterion, component 1: |dx|=2.8061e+08, |x|=1.9345e+24, |dx|/|x|=1.4505e-16
info: Convergence criterion, component 2: |dx|=6.8381e-02, |x|=1.1999e+14, |dx|/|x|=5.6991e-16
info: Convergence criterion, component 3: |dx|=2.5835e-02, |x|=5.8525e+13, |dx|/|x|=4.4143e-16
info: [time] Iteration #50 took 3.94437 s.
info: [time] Solving process #0 took 200.756 s in time step #1
error: The nonlinear solver failed in time step #1 at t = 4.32e+06 s for process #0.
info: [time] Output of timestep 1 took 0.208888 s.
critical: E:/gitlab/builds/XBgsxgtH/0/ogs/ogs/ProcessLib/TimeLoop.cpp:735 ProcessLib::TimeLoop::solveUncoupledEquationSystems()
error: Time stepper cannot reduce the time step size further.
info: OGS terminated on 2021-03-29 22:13:36+0800.
error: OGS terminated with error.

tjlyh777 · March 29, 2021, 2:43pm

Here is Rui’s prj file. I help him to update it.THM_prj.zip (367.2 KB)[quote=“Rui_Feng, post:4, topic:501, full:true”]
Hi,
I am a novice at OGS6. I also met the similar problem to yours, but I am not sure it is the same as yours.
I had not compiled OGS, and I just download the zip from gitlab to run this prj written by myself. I tried to sovle the problem as you said, but it not worked.
Could you tell me more details to deal with this problem? Or other suggestion about my problem.

Thank you

info: ------------------------------------------------------------------
info: [time] Linear solver took 3.51127 s.
info: Convergence criterion, component 0: |dx|=1.5533e-07, |x|=3.5545e+04, |dx|/|x|=4.3699e-12
info: Convergence criterion, component 1: |dx|=2.8061e+08, |x|=1.9345e+24, |dx|/|x|=1.4505e-16
info: Convergence criterion, component 2: |dx|=6.8381e-02, |x|=1.1999e+14, |dx|/|x|=5.6991e-16
info: Convergence criterion, component 3: |dx|=2.5835e-02, |x|=5.8525e+13, |dx|/|x|=4.4143e-16
info: [time] Iteration #50 took 3.94437 s.
info: [time] Solving process #0 took 200.756 s in time step #1
error: The nonlinear solver failed in time step #1 at t = 4.32e+06 s for process #0.
info: [time] Output of timestep 1 took 0.208888 s.
critical: E:/gitlab/builds/XBgsxgtH/0/ogs/ogs/ProcessLib/TimeLoop.cpp:735 ProcessLib::TimeLoop::solveUncoupledEquationSystems()
error: Time stepper cannot reduce the time step size further.
info: OGS terminated on 2021-03-29 22:13:36+0800.
error: OGS terminated with error.

[/quote]

joergbuchwald · March 29, 2021, 3:02pm

Hi Rui_Feng,
first of all, your problem is not related to the topic: You are not using the PardisoLU solver and in your case, it is also not the direct solver that failed.
I think there is no real big issue with your problem as the relative error seems to be quite low already. The main problem is with your convergence setting: you are using absolute tolerances with an extremely low pressure (second entry) which cannot be met numerically as you are already at 10^-16 for the relative error. If you using a relative tolerances instead (<reltols> instead of <abstols>) with a threshold e.g. of 10^-10 it should work.

renchao.lu · September 13, 2021, 12:46pm

I encountered the same issue on the envinf2 with the latest master. The available solution is not too clear to me what to be set with CMake…
Below is the MKL-related options I set in the CMake.
MKL_DIR: /opt/intel/mkl
MKL_LIB_CORE: /opt/intel/mkl/lib/intel64/libmkl_core.so
MKL_LIB_INTEL: /opt/intel/mkl/lib/intel64/libmkl_intel_lp64.so
MKL_LIB_THREAD: /opt/intel/mkl/lib/intel64/libmkl_gnu_thread.so.

Does any one have idea on it?

joergbuchwald · September 13, 2021, 1:40pm

For me, it was enough to set the MKL_DIR only, without sourcing mklvars.
If I do source /opt/intel/mkl/bin/mklvars.sh intel64 prior to cmake configure (I think, setting MKL_LIB_CORE, MKL_LIB_INTEL and MKL_LIB_THREAD will likely have the same effect) I get the above-mentioned error. The main difference seems to be that if the paths to the shared libraries are set prior to cmake configure, BLAS uses MKL which seems to cause the error in my case.

renchao.lu · September 13, 2021, 6:56pm

@joergbuchwald Thanks for your prompt reply and kindly sharing your workaround. Unfortunately, it doesn’t work for me… - =!

joergbuchwald · September 13, 2021, 7:04pm

Do you have also your distro-mkl installed? Did you check with ccmake which paths were set?

renchao.lu · September 17, 2021, 12:06pm

Thank to Dima for helping me out! To use the PardisoLU solver on envinf2, one may compile the source code with the command below

CC=clang CXX=clang++ cmake -S path_to_ogs_source_directory -B path_to_ogs_build_directory --preset=release -DOGS_USE_MKL=On -DCMAKE_BUILD_SHARED_LIBS=On

Note that this issue is caused by the fact that different OpenMP libraries are used. something likes this.

FZill · November 10, 2021, 12:02pm

I am experiencing the same problem right now on envinf1. the last command from @renchao.lu sadly did not make a difference for me. I also tried to remove the paths for MKL_LIB_CORE, MKL_LIB_INTEL and MKL_LIB_THREAD, but this also didn’t work. Any help would be appreciated.

joergbuchwald · November 10, 2021, 12:30pm

I think it is about the MKL paths for the BLAS library. So, you could try to delete them and reconfigure.
To be on the safe side: clean-up the build directory first.

FZill · November 10, 2021, 1:08pm

@joergbuchwald if you mean the three MKL Variables I listed above, I cleared my build directory and rebuild ogs once with them having their paths and once where I deleted the paths during the configuration. But both times it didn’t work.

joergbuchwald · November 10, 2021, 1:28pm

No, I mean the BLAS variables that contain MKL paths.

FZill · November 10, 2021, 1:45pm

Thanks, it works now. Use -t to toggle the advanced mode to see the BLAS variables with the mkl paths. Deleted them, rebuild and now it runs fine.

joergbuchwald · May 18, 2022, 4:36pm

The error can be caused by the wrong order of libgomp.so and libiomp5.so: See PardisoLU fails with "Failed during Eigen linear solver initialization" message. (#3296) · Issues · ogs / ogs · GitLab for further details.
If ninja is used for building, this script can be used to correct it. It needs to run in the build directory. After that ninja needs to be (re-)executed.