Sockfile check retries too short for a busy system boot
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
libvirt (Ubuntu) |
Fix Released
|
High
|
Unassigned | ||
Precise |
Won't Fix
|
High
|
Unassigned | ||
Trusty |
Fix Released
|
High
|
Unassigned | ||
Wily |
Won't Fix
|
High
|
Unassigned | ||
Xenial |
Fix Released
|
High
|
Unassigned | ||
Zesty |
Fix Released
|
Undecided
|
Unassigned | ||
Artful |
Fix Released
|
High
|
Unassigned |
Bug Description
[Impact]
* Libvirt service reports to be ready, but it has not spawned the libvirt
socket yet. Depending services fail. There was an SRU (#1455608) meant
to fix that but it has many deficiencies (not considering config,
giving up after 10 seconds, being an unconditional sleep 2, taking up
to 2 seconds to a service stop while in pist-start).
* This is the backport and improvement of a change that was brought to
Yakkety already, but there due to systemd it doesn't matter too much.
[Test Case]
* There are two very different ways to "test" this due to the overload
based scenario where this really becomes important.
* Version #1 - being lame
One can just modify the upstart script and exchange the check for the
socket with /bin/true.
That way it waits forever which allows you to check the log entries,
the abort responsiveness and similar.
* Version #2 - recreating the case
- This mostly means the system has to be very slow and overloaded.
You can either just slow down the system (e.g. run a qemu with nice
MAX). Stress your host with other things burning CPU/memory/disk.
- we worked with adding autostart guests (see comment #35) but that
actually takes place after the socket is created. The reported acse
had a raid rebuilding.
- TL;DR get your system slow enough so that libvirt exceeds 10 seconds
to start properly (the old limit is 5*2 seconds)
[Regression Potential]
* I'd think that there might exist (super rare) cases were the post-start
now does spin forever. But by the definition
http://
started (yes) but not yet ready. Yet this might appear as a regression
to some.
* Other than that clearly this should fix more issues than it (hopefully
not) causes.
[Other Info]
* n/a
--- END SRU Template ---
[ problem description ]
sockfile_
#1455608 - https:/
[ step to reproduce ]
setup a clean install system (Ubuntu Server 14.04.4 LTS), and assemble os disk as RAID-1, boot up some guest instances (count > 10, start-at-boot), force shutdown host by pressing power-button for 3s ~ 5s, or via IPMI command, then power-on afterward. it may sometimes failed to get sockfile ready after in "post-start" script, with an line of error in /var/log/syslog,
==> kernel: [ 313.059830] init: libvirt-bin post-start process (2430) terminated with status 1 <==
since there's multiple VMs Read/Write before a non-graceful shutdown, RAID devices need to re-sync after boot, and lead to a slow response, but start-up script for libvirt-bin can only wait 5 cycles, 2 seconds wait for each cycle, so it will timed-out after 10s, and exit with "1".
[ possible solution ]
extend the retry times for sockfile waiting, and make it possible to change via editing `/etc/default/
<please see the patch file as attachment>
[ sysinfo ]
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.4 LTS
Release: 14.04
Codename: trusty
$ uname -a
Linux host2 4.2.0-35-generic #40~14.04.1-Ubuntu SMP Fri Mar 18 16:37:35 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
[ related issue ]
#1386465 - https:/
The attachment "allow- change- sockfile_ check_retries. diff" seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.
[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]