Multiple chains case is not resilient when of the forward problems in one of the chains fails #662

dmcdougall · 2018-10-22T23:20:21Z

Problem Context:
I've been running some chemical kinetics inadequacy stuff lately using
Queso. I built it successfully on our machine over here and runs with a
single core seem to go pretty well.

I want to run with many cores. I've read through the documentation,
and I think I set everything up okay. In fact, I am able to run
successfully with 16 cores. Just to be clear, my forward solves are
serial. When I say "run with many cores" I mean that I'm generating
N_cores chains with the serial forward solves (section 4.3 in the manual).

Here's my problem:
One of the jobs gets stuck. This is not a Queso problem; it's a known
issue with the ODE solver I'm using and my mathematical formulation. This
does impact my simulations though. The 15 other jobs complete successfully
but because one of the jobs isn't finishing in my requested time, Queso
doesn't generate a combined chain from the 15 successful jobs.

Here's what I've tried:
I though I saw a way to have Queso write out the chains from each job
that finishes successfully. For example, if I'm using 3 cores and 2 of
them finish their job while 1 fails, the two successful ones should still
write out their chains. Unfortunately, this isn't working for me. I might
have mis-specified something in my input file.

In section 5.3.7 of the Queso manual, it says that specifying
ip_mh_rawChain_dataOutputAllowedSet
in the input file should allow writing individual chains upon completion of
the individual job.

Why do you think that if one forward evaluation fails, the others should still produce chains? I feel like this will depend on the MPI stack; if one MPI process fails I don't know if MPI guarantees the other MPI processes can continue in a fault-tolerant way. I understand conceptually what you're asking, I'm just not sure if MPI is happy about it.

If MPI isn't happy about it, I wonder if there's something we (QUESO) can do to deal with it. This is something I think is an excellent use-case for fault-tolerant parallel software.

My particular use-case is embarrassingly parallel. The forward solve isn't parallel. It was my understanding that in this situation the full chain is distributed across the processes but the processes do not communicate chain information until the very end when the entire chain is reconstructed. In this special case, some of the chains will complete before others. So I was wondering if QUESO could have a process write out its chain when it's done. The user could then construct a partial chain from the chains that were written to disk.

Embarrassingly parallel or not doesn't really matter; MPI doesn't care. MPI sees a process failure (presumably it exits with a nonzero exit code) and then what happens next is... I don't know. Is it up to the MPI stack to decide what to do when a single MPI process fails? Perhaps I need to look at the MPI standard.

The process isn't failing. MPI isn't exiting. The forward solve just hangs. Calculations are still being made (presumably). Eventually I get a time-out.

Then I presume the application would hang at MPI_Finalize() as there's an implicit barrier there. If that were the case, then your other chains would have already written their respective outputs. If that's the case, then I'm starting to think this is a QUESO bug.

Perhaps I can work on a MWE where one of the forward solves just calls sleep().

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple chains case is not resilient when of the forward problems in one of the chains fails #662

Multiple chains case is not resilient when of the forward problems in one of the chains fails #662

dmcdougall commented Oct 22, 2018

Multiple chains case is not resilient when of the forward problems in one of the chains fails #662

Multiple chains case is not resilient when of the forward problems in one of the chains fails #662

Comments

dmcdougall commented Oct 22, 2018