[pkg-ntp-maintainers] Bug#802040: ntp: Please fix ntpd race when using "systemctl restart ntp"

Sat Oct 17 08:37:02 UTC 2015

Package: ntp
Version: 1:4.2.6.p5+dfsg-7
Severity: important
Tags: Patch

Dear Maintainer,

In the "stop" target, the ntp init script does not wait for the ntpd to
terminate before exiting. As a result, when calling "stop", then
"start" shortly afterwards, there is a race between the old daemon not
gone yet and the new one trying to listen on an interface, failing.

This happened to make a lot of systems in a major data center running
without any ntpd, since the "needrestart" program restarted ntp by
invoking "systemctl restart ntp", which itself behaves as above. These
two however are only the bringer of the bad news, the root problem lies
in the ntp init script.


How to repeat:

On a reasonably equipped system, run "systemctl restart ntp".
Afterwards, there might be no ntpd running at all. The precise
parameters for failure are not clear, it's certainly fast CPU, and
probably a long-running ntp, thus partially swapped to disk.

Example:

Okt 17 09:58:58.607829 host systemd[1]: Stopping LSB: Start NTP daemon...
Okt 17 09:58:58.884818 host ntp[633715]: Stopping NTP server: ntpd.
Okt 17 09:58:58.891273 host systemd[1]: Starting LSB: Start NTP daemon...
Okt 17 09:58:58.921358 host ntpd[2871]: ntpd exiting on signal 15
Okt 17 09:58:58.925381 host ntpd[633732]: ntpd 4.2.6p5 at 1.2349-o Fri Apr 10 19:04:04 UTC 2015 (1)
Okt 17 09:58:58.927660 host ntpd[633733]: proto: precision = 0.251 usec
Okt 17 09:58:58.927815 host systemd[1]: Started LSB: Start NTP daemon.
Okt 17 09:58:58.958606 host ntpd[633733]: unable to bind to wildcard address 0.0.0.0 - another process may be running - EXITING
Okt 17 09:58:58.967355 host ntp[633725]: Starting NTP server: ntpd.

Note that 37ms between the old ntpd's last message (it hasn't exited
yet at that time) and the new one's attempt to listen were not
sufficient. That's quite extreme though, in other tests 13ms were
enough for a succesful re-start. Also, the "Starting LSB" message
before "ntpd exiting" indicates the race.

Aside, the other bad news is systemd does *not* mark the ntp unit as
"failed":

● ntp.service - LSB: Start NTP daemon
   Loaded: loaded (/etc/init.d/ntp)
   Active: active (exited) since Sa 2015-10-17 09:58:58 CEST; 11min ago
  Process: 633715 ExecStop=/etc/init.d/ntp stop (code=exited, status=0/SUCCESS)
  Process: 633725 ExecStart=/etc/init.d/ntp start (code=exited, status=0/SUCCESS)


Suggested fix:

This utlizes the --retry option of s-s-d to wait until ntp really has
exited. Additionally, the --exec check is used as a safeguard against
weird pid file reusage. This is unlikely to happen but does no harm
otherwise.

--- /etc/init.d/ntp     2009-12-26 18:29:45.000000000 +0100
+++ /etc/init.d/ntp     2015-10-17 09:00:47.000000000 +0200
@@ -65,7 +65,7 @@
                ;;
        stop)
                log_daemon_msg "Stopping NTP server" "ntpd"
-               start-stop-daemon --stop --quiet --oknodo --pidfile $PIDFILE
+               start-stop-daemon --stop --quiet --oknodo --pidfile $PIDFILE --retry=TERM/30/KILL/5 --exec $DAEMON
                log_end_msg $?
                rm -f $PIDFILE
                ;;

Please apply this or something similar also in the next jessie point
release so such havoc will not happen again.

Regards,

    Christoph

-- System Information:
Debian Release: 8.2
  APT prefers stable-updates
  APT policy: (500, 'stable-updates'), (500, 'stable')
Architecture: amd64 (x86_64)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: Digital signature
URL: <http://lists.alioth.debian.org/pipermail/pkg-ntp-maintainers/attachments/20151017/2be35bd1/attachment.sig>