Bug#821778: autopkgtest: adt-run fails with QEMU - BrokenPipe

Sun Apr 24 11:40:50 UTC 2016

Control: tag -1 confirmed

Hello Neil,

Neil Williams [2016-04-19  9:36 +0100]:
> adt-run /tmp/tmp.lXjW2N5fd3/lava-server_2016.4-1.dsc -U --- qemu adt-sid-2.raw
> adt-run [09:21:28]: ERROR: testbed failure: cannot send to testbed: ['BrokenPipeError: [Errno 32] Broken pipe\n']

I confirm that too, with a freshly built sid testbed, thanks for the
report!

Some debugging notes (mostly for myself):

 - This is because eofcat hangs around for a long time even after the
   executed command finished long ago. The exit.tmp stamp also exists.

 - Adding some logging to eofcat shows that it doesn't really begin
   running and polling for half a minute or so.

 - After the first runcmd fails, re-running it again works fine.

 - I noticed that the time when the first eofcat finally finishes
   coincides with this kernel log entry:

   [    1.549882] [TTM] Initializing DMA pool allocator
   [   39.586483] random: nonblocking pool is initialized

  I. e. in this case it took 39s (after boot) to collect enough
  entropy, and that's exactly the time that eofcat hangs.

 - So I attached strace to eofcat, and this confirms the suspicion
   above:

   437   11:21:36.118034 getrandom("/V#\200^O*HD+D_\32\345\223M\205a\336/\36x\335\246", 24, 0) = 24
   437   11:21:57.939999 ioctl(0, TCGETS, 0x7ffde1d152a0) = -1 ENOTTY (Inappropriate ioctl for device)

   which blocks for that time.

So this comes down to a regression in python3.5 3.5.1-11 (which
Antonio and Felix confirmed):

With -10:
  $ strace -e getrandom python3 -c 'True'
  +++ exited with 0 +++

With -11:
  $ strace -e getrandom python3 -c 'True'
  getrandom("\300\0209\26&v\232\264\325\217\322\303:]\30\212Q\314\244\257t%\206\"", 24, 0) = 24
  +++ exited with 0 +++

This is really unfriendly -- it essentially means that you stop being
able to use python3 early in the boot process. It would be better to
initialize that random stuff lazily, until/if things actually need it.
This could very well be the same reason as in
https://bugs.debian.org/821877, I'll follow up there.

In the diff between -10 and -11 I do seem some getrandom() fixes to
supply the correct buffer size (but that should be irrelevant as in
-10 getrandom() wasn't called in the first place), and a new call
which should apply to Solaris only (#ifdef sun), so it's not entirely
clear where that comes from or how to work around it. The "eofcat"
helper can't be implemented in shell sensibly, so we need some more
powerful scripting language.

In the meantime I'll think about a workaround.

Martin

-- 
Martin Pitt                        | http://www.piware.de
Ubuntu Developer (www.ubuntu.com)  | Debian Developer  (www.debian.org)