[Pkg-openmpi-maintainers] Bug#598553: Bug#598553: r-cran-rmpi: slave processes eat CPU when they have nothing to do

Mon Oct 11 19:26:59 UTC 2010

Sorry for the delay in answering.  I'll try to address all points:

1. Yes, the busy-poll design is intentional in Open MPI.  :-(
1a. Yes, it probably does cause some performance degradation when used with TCP.
1b. It quite definitely is a (major) performance win for non-TCP networks.  That's (unfortunately) why it's there -- you can't poll/select/epoll/whatever for these non-TCP kinds of networks (E.g., openfabrics networks) without killing performance.  So you have to busy poll those networks with their native poll functions and then periodically select/poll/epoll/whatever all file descriptors.  This unfortunately became a central architecture point for Open MPI's progression engine (because it's in the performance-critical code path).

2. The behavior you're seeing with yield_when_idle is also intentional.  We're busy polling but we're yielding so that we play well with others.  It does not in any way reduce the CPU utilization; it just make Open MPI share the CPU better.  But it got somewhat weakened when sched_yield() lost its meaning in recent kernels. 

3. We do know how to make our progression engine switch between blocking and busy-polling (i.e., we've had many discussions about it over the years -- shared memory message passing is the Big Problem).  But no one has ever had the time / resources / motivation to implement it.  If anyone has some time, I would love to explain what would need to be done (it's not rocket science, but it is a bit tricky and will require getting into some minutia in the guts of Open MPI :-\ ).

Does that help at least explain why the code is the way it is?

On Oct 2, 2010, at 6:30 PM, Manuel Prinz wrote:

> On Sat, Oct 02, 2010 at 01:37:42PM -0700, Zack Weinberg wrote:
>> I wrote a test MPI program that just calls MPI_Probe() once - this
>> should block forever, since there are no sends happening.  When run
>> with
>> 
>> $ mpirun -np 2 ./a.out
>> 
>> MPI_Probe never returns and the processes spin through poll(), which
>> is what I originally reported.  So far so good.  If I change the
>> invocation to
>> 
>> $ mpirun -np 2 --mca mpi_yield_when_idle 1 ./a.out
>> 
>> the behavior is the same, except that the processes alternate between
>> poll() and sched_yield().  This doesn't help anything; the scheduler
>> is still being thrashed, and the CPU is not allowed to go idle.  [In
>> fact, my understanding of the Linux scheduler is that a zero-timeout
>> poll() counts as a yield, so "Aggressive" mode isn't even doing
>> anything constructive!]
>> 
>> The desired behavior is for an idle cluster's processes to BLOCK in
>> poll().  So mpi_yield_when_idle does not do what I want.
>> 
>> Also, putting "mpi_yield_when_idle = 1" into
>> ~/.openmpi/mca-params.conf has no effect, contra the documentation --
>> this perhaps ought to be its own bug.  (I can set MCA parameters for R
>> with environment variables, but that's not nearly as convenient as the
>> host file.)
> 
> I'm out of ideas here. Jeff, could you please comment on the issue?
> You can find the full log here:
> 
>  http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=598553
> 
> Thanks in advance!
> 
> Best regards,
> Manuel

-- 
Jeff Squyres
jsquyres at cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/