Bug #1571209 “Sockfile check retries too short for a busy system...” : Bugs : libvirt package : Ubuntu

Revision history for this message

guessi (guessi) wrote on 2016-04-16:

#1

allow-change-sockfile_check_retries.diff Edit (2.1 KiB, text/plain)

Revision history for this message

Ubuntu Foundations Team Bug Bot (crichton) wrote on 2016-04-16:

#2

The attachment "allow-change-sockfile_check_retries.diff" seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]

tags:

added: patch

guessi (guessi) on 2016-04-16

tags:

added: trusty xenial

guessi (guessi) on 2016-04-16

description:

updated

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2016-04-18:

#3

This is the very definition of racy.

It appears to me that the point of the post-start is to make sure libvirt is ready, no matter how long it takes. If libvirt actually fails to start, upstart should catch that and mark it failed and abort post-start, is that right? If not, then we should complicate the loop in the post-start to make that logic happen.

But my point is that having a timeout there at all is wrong - we are waiting for the socket to be ready, not trying to catch failure to start. So we should effectively wait forever for either libvirt to be ready, or to have failed to start. No guessing at timeouts here.

Changed in libvirt (Ubuntu):
assignee:	nobody → Edward Hope-Morley (hopem)
importance:	Undecided → High
status:	New → Confirmed

Revision history for this message

guessi (guessi) wrote on 2016-04-19:

#4

trusty patch Edit (1.4 KiB, text/plain)

@serge, @hopem,

I'd like to propose a new approach, wait infinitely until the sockfile is ready,
and allow it to exit if there's an interrupt of stop/restart/force-stop event.
* excluding systemd solution, sorry I'm not familiar with systemd.

it is inspired by the following PR from Docker project,

- https://github.com/docker/docker/pull/9287/files
- https://github.com/docker/docker/pull/11945/files

should be smoothly apply on libvirt-bin as well,
please give help to review it again, thanks !

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2016-04-19:

#5

Thanks, @guessi, that patch looks great. The only thing i wonder is whether sleep 0.5s is a bit too short. I assume you had a rationale for shortening the sleep time?

Revision history for this message

guessi (guessi) wrote on 2016-04-19:

#6

@serge,

no, I don't,
it's simply shortened the waiting time for service back or down.

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2016-04-20: Re: [Bug 1571209] Re: Sockfile check retries too short for a busy system boot

#7

Quoting guessi (<email address hidden>):
> @serge,
>
> no, I don't,
> it's simply shortened the waiting time for service back or down.

Ok - the shorter time itself is probably fine, but printing a
message every 0.2 seconds seemed a bit much. On the other
hand I don't want to complicate the logic to only print every
5 iterations either...

Revision history for this message

guessi (guessi) wrote on 2016-04-21:

#8

trusty patch Edit (1.4 KiB, text/plain)

this is the patch file from #4, but keep sleep wait time "2"

https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1571209/comments/4

Revision history for this message

guessi (guessi) wrote on 2016-04-21:

#9

libvirt-bin-precise.diff Edit (823 bytes, text/plain)

Revision history for this message

guessi (guessi) wrote on 2016-04-21:

#10

libvirt-bin-vivid.diff Edit (1.4 KiB, text/plain)

Revision history for this message

guessi (guessi) wrote on 2016-04-21:

#11

libvirt-bin-wily.diff Edit (1.5 KiB, text/plain)

Revision history for this message

guessi (guessi) wrote on 2016-04-21:

#12

libvirt-bin-xenial.diff Edit (1.5 KiB, text/plain)

Revision history for this message

guessi (guessi) wrote on 2016-04-21:

#13

@serge,
to avoid emit too much log, I've change the delay time back to "2".

@serge, @hopem,
patch files above for precise/trusty/vivid/wily/xenial have been uploaded,
should work as expected, but still, please have time to review them.

thanks !!!

tags:

added: precise vivid wily

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2016-04-25:

#14

Heh.

So the importance of this patch is underscored by the fact that this has
basically turned every 'start libvirt-bin' into a sleep 2s.

So I guess we should drop the number. Or perhaps we should increment the
sleep time on every iteration? start at 0.5 second and go up to a 2s sleep
at the end?

or just go back to 0.5 or 1 second sleep on every loop.

Your patch does work (at least with upstart). Let me know what you think
we should do for the sleep time, and i'll either change the number, or if
you want to do incremental please post a new set of patches.

Revision history for this message

guessi (guessi) wrote on 2016-04-25:

#15

@serge,

sorry for the late reply,
we live in different timezone :p

I would prefer decrease the delay time down to 1s, but keep it wait infinitely.

according my original post (boot up with degraded RAID, 5s -> 10s), it takes up to 10s in my case, so I don't think increasing delay from 0.5s to 2s is enough for even worse cases, that's why I leave an option in default/libvirt-bin.

do I need to post the patch set again if we just change to number? or maybe I should create a merge-request?

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2016-04-26:

#16

Hi,

I'll just drop it to 1s, thanks.

Revision history for this message

guessi (guessi) wrote on 2016-04-26:

#17

thanks, I'll keep track on this issue.

Revision history for this message

guessi (guessi) wrote on 2016-05-22:

#18

@serge, is there any milestone to fix this rare case problem?

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2016-05-23:

#19

Sorry, working on it for yakkety now.

Changed in libvirt (Ubuntu):
assignee:	Edward Hope-Morley (hopem) → Serge Hallyn (serge-hallyn)
status:	Confirmed → In Progress
Changed in libvirt (Ubuntu Precise):
importance:	Undecided → High
Changed in libvirt (Ubuntu Trusty):
importance:	Undecided → High
Changed in libvirt (Ubuntu Wily):
importance:	Undecided → High
Changed in libvirt (Ubuntu Xenial):
importance:	Undecided → High

Revision history for this message

Launchpad Janitor (janitor) wrote on 2016-05-23:

#20

This bug was fixed in the package libvirt - 1.3.4-1ubuntu4

---------------
libvirt (1.3.4-1ubuntu4) yakkety; urgency=medium

  * Re-enable the upstart job by renaming the file.
  * Include patchby @guessi to continally wait for libvirtd to start when
    using sysvinit or upstart. (LP: #1571209)

-- Serge Hallyn <email address hidden> Mon, 23 May 2016 13:50:22 -0500

Changed in libvirt (Ubuntu):
status:	In Progress → Fix Released

Revision history for this message

guessi (guessi) wrote on 2016-05-26:

#21

@serge, great thanks! looking for LTS (trusty/xenial).

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2016-05-26:

#22

Yes, I suppose xenial we could do right now, but I'm waiting on
another package to clear trusty-proposed.

Revision history for this message

Jon Grimm (jgrimm) wrote on 2016-09-30:

#23

IRC chat with hallyn is that he'd like to hand this off to someone else to SRU, but is a willing sponsor if needed.

Changed in libvirt (Ubuntu Wily):
status:	New → Won't Fix

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-09-06:

#24

This bug was dead for quite some time, but the issue still exists for Trusty (were we could take the supposed change) and we have to check the systemd cases in >=Xenial.
Adapting tasks and will come back once I tested on >=Xenial.

Changed in libvirt (Ubuntu Artful):
status:	Fix Released → Incomplete
Changed in libvirt (Ubuntu Zesty):
status:	New → Incomplete
Changed in libvirt (Ubuntu Xenial):
status:	New → Incomplete
Changed in libvirt (Ubuntu Trusty):
status:	New → Triaged
Changed in libvirt (Ubuntu Precise):
status:	New → Won't Fix
Changed in libvirt (Ubuntu Artful):
assignee:	Serge Hallyn (serge-hallyn) → nobody

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-09-06:

#25

Found while testing:
BTW - the base path of the socket is configurable, so the fixes would need to consider "unix_sock_dir" from /etc/libvirt/libvirtd.conf to be correct.
If anyone changes that in the config the service will become "unstartable" as it always would consider it not up and break it.
That should be wrapped up in the fix for Trusty where sysv/ustart is used.
Continuing on systemd tests (while there is no explicit libvirtd.target to wait on I want to know how it behaves).

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-09-06:

#26

Download full text (8.9 KiB)

To check on the systemd case I picked a rather slow system and spawned 30 guests.
I set all those to autostart and then shut them down.

Starting them sequentially as well as concurrently takes about ~20 seconds.

/var/run/libvirt/libvirt-sock is gone after the service is stopped.

With the following script this can be tested:
$ cat test-restart.sh
#!/bin/bash
set -uxe
date
sudo systemctl start libvirtd
date
# right after start (1-2 seconds usually) check status of sockets and guests
while /bin/true; do
    ls -laF /var/run/libvirt/libvirt-sock
    virsh list
    sudo systemctl status libvirtd --no-pager
    date
    sleep 2s
done

You will see that the "virsh list" is kind of blocking until other internal tasks are done.
That pretty much ends up after the ~20 seconds that I counted until it goes on.
All guests are started then and things are good.
But the socket is available right after start, yet if you do a request you are enqueued and have to wait - which is fine.

Output:
$ ./test-restart.sh
+ date
Mi 6. Sep 04:40:34 EDT 2017
+ sudo systemctl start libvirtd
+ date
Mi 6. Sep 04:40:34 EDT 2017
+ /bin/true
+ ls -laF /var/run/libvirt/libvirt-sock
srwxrwx--- 1 root libvirtd 0 Sep 6 04:40 /var/run/libvirt/libvirt-sock=
+ virsh list
Id Name State
----------------------------------------------------
1 artful-test30 running
2 artful-test9 running
3 artful-test8 running
4 artful-test23 running
5 artful-test13 running
6 artful-test19 running
7 artful-test11 running
8 artful-test15 running
9 artful-test10 running
10 artful-test17 running
11 artful-test5 running
12 artful-test24 running
13 artful-test26 running
14 artful-test20 running
15 artful-test2 running
16 artful-test29 running
17 artful-test22 running
18 artful-test27 running
19 artful-test14 running
20 artful-test6 running
21 artful-test25 running
22 artful-test12 running
23 artful-test3 running
24 artful-test21 running
25 artful-test7 running
26 artful-test16 running
27 artful-test28 running
28 artful-test18 running
29 artful-test1 running
30 artful-test4 running

+ sudo systemctl status libvirtd --no-pager
● libvirtd.service - Virtualization daemon
   Loaded: loaded (/lib/systemd/system/libvirtd.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2017-09-06 04:40:34 EDT; 23s ago
     Docs: man:libvirtd(8)
           http://libvirt.org
Main PID: 171137 (libvirtd)
    Tasks: 303 (limit: 32768)
   Memory: 93.7M
      CPU: 15.383s
   CGroup: /system.slice/libvirtd.service
           ├─ 3848 /usr/...

To check on the systemd case I picked a rather slow system and spawned 30 guests.
I set all those to autostart and then shut them down.

Starting them sequentially as well as concurrently takes about ~20 seconds.

/var/run/libvirt/libvirt-sock is gone after the service is stopped.

With the following script this can be tested:
$ cat test-restart.sh 
#!/bin/bash
set -uxe
date
sudo systemctl start libvirtd
date
# right after start (1-2 seconds usually) check status of sockets and guests
while /bin/true; do
    ls -laF /var/run/libvirt/libvirt-sock
    virsh list
    sudo systemctl status libvirtd --no-pager
    date
    sleep 2s
done

You will see that the "virsh list" is kind of blocking until other internal tasks are done.
That pretty much ends up after the ~20 seconds that I counted until it goes on.
All guests are started then and things are good.
But the socket is available right after start, yet if you do a request you are enqueued and have to wait - which is fine.

Output:
$ ./test-restart.sh  
+ date
Mi 6. Sep 04:40:34 EDT 2017
+ sudo systemctl start libvirtd
+ date
Mi 6. Sep 04:40:34 EDT 2017
+ /bin/true
+ ls -laF /var/run/libvirt/libvirt-sock
srwxrwx--- 1 root libvirtd 0 Sep  6 04:40 /var/run/libvirt/libvirt-sock=
+ virsh list
 Id    Name                           State
----------------------------------------------------
 1     artful-test30                  running
 2     artful-test9                   running
 3     artful-test8                   running
 4     artful-test23                  running
 5     artful-test13                  running
 6     artful-test19                  running
 7     artful-test11                  running
 8     artful-test15                  running
 9     artful-test10                  running
 10    artful-test17                  running
 11    artful-test5                   running
 12    artful-test24                  running
 13    artful-test26                  running
 14    artful-test20                  running
 15    artful-test2                   running
 16    artful-test29                  running
 17    artful-test22                  running
 18    artful-test27                  running
 19    artful-test14                  running
 20    artful-test6                   running
 21    artful-test25                  running
 22    artful-test12                  running
 23    artful-test3                   running
 24    artful-test21                  running
 25    artful-test7                   running
 26    artful-test16                  running
 27    artful-test28                  running
 28    artful-test18                  running
 29    artful-test1                   running
 30    artful-test4                   running

+ sudo systemctl status libvirtd --no-pager
● libvirtd.service - Virtualization daemon
   Loaded: loaded (/lib/systemd/system/libvirtd.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2017-09-06 04:40:34 EDT; 23s ago
     Docs: man:libvirtd(8)
           http://libvirt.org
 Main PID: 171137 (libvirtd)
    Tasks: 303 (limit: 32768)
   Memory: 93.7M
      CPU: 15.383s
   CGroup: /system.slice/libvirtd.service
           ├─  3848 /usr/sbin/dnsmasq --conf-file=/var/lib/libvirt/dnsmasq/default.conf --leasefile-ro --dhcp-script=/usr/lib/libvirt/libvirt_…er
           ├─  3849 /usr/sbin/dnsmasq --conf-file=/var/lib/libvirt/dnsmasq/default.conf --leasefile-ro --dhcp-script=/usr/lib/libvirt/libvirt_…er
           ├─171137 /usr/sbin/libvirtd
           ├─171530 qemu-system-s390x -enable-kvm -name guest=artful-test30,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/…on
           ├─171640 qemu-system-s390x -enable-kvm -name guest=artful-test9,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/v…on
           ├─171789 qemu-system-s390x -enable-kvm -name guest=artful-test8,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/v…on
           ├─171897 qemu-system-s390x -enable-kvm -name guest=artful-test23,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/…on
           ├─171970 qemu-system-s390x -enable-kvm -name guest=artful-test13,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/…on
           ├─172041 qemu-system-s390x -enable-kvm -name guest=artful-test19,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/…on
           ├─172116 qemu-system-s390x -enable-kvm -name guest=artful-test11,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/…on
           ├─172196 qemu-system-s390x -enable-kvm -name guest=artful-test15,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/…on
           ├─172273 qemu-system-s390x -enable-kvm -name guest=artful-test10,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/…on
           ├─172348 qemu-system-s390x -enable-kvm -name guest=artful-test17,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/…on
           ├─172429 qemu-system-s390x -enable-kvm -name guest=artful-test5,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/v…on
           ├─172506 qemu-system-s390x -enable-kvm -name guest=artful-test24,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/…on
           ├─172588 qemu-system-s390x -enable-kvm -name guest=artful-test26,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/…on
           ├─172671 qemu-system-s390x -enable-kvm -name guest=artful-test20,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/…on
           ├─172748 qemu-system-s390x -enable-kvm -name guest=artful-test2,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/v…on
           ├─172827 qemu-system-s390x -enable-kvm -name guest=artful-test29,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/…on
           ├─172899 qemu-system-s390x -enable-kvm -name guest=artful-test22,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/…on
           ├─172977 qemu-system-s390x -enable-kvm -name guest=artful-test27,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/…on
           ├─173175 qemu-system-s390x -enable-kvm -name guest=artful-test14,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/…on
           ├─173318 qemu-system-s390x -enable-kvm -name guest=artful-test6,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/v…on
           ├─173404 qemu-system-s390x -enable-kvm -name guest=artful-test25,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/…on
           ├─173477 qemu-system-s390x -enable-kvm -name guest=artful-test12,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/…on
           ├─173554 qemu-system-s390x -enable-kvm -name guest=artful-test3,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/v…on
           ├─173629 qemu-system-s390x -enable-kvm -name guest=artful-test21,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/…on
           ├─173706 qemu-system-s390x -enable-kvm -name guest=artful-test7,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/v…on
           ├─173784 qemu-system-s390x -enable-kvm -name guest=artful-test16,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/…on
           ├─173878 qemu-system-s390x -enable-kvm -name guest=artful-test28,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/…on
           ├─173954 qemu-system-s390x -enable-kvm -name guest=artful-test18,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/…on
           ├─174190 qemu-system-s390x -enable-kvm -name guest=artful-test1,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/v…on
           └─174303 qemu-system-s390x -enable-kvm -name guest=artful-test4,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/v…on

Sep 06 04:40:55 s1lp5 dnsmasq-dhcp[3848]: DHCPDISCOVER(virbr0) 192.168.122.48 52:54:00:52:27:0d
Sep 06 04:40:55 s1lp5 dnsmasq-dhcp[3848]: DHCPOFFER(virbr0) 192.168.122.48 52:54:00:52:27:0d
Sep 06 04:40:55 s1lp5 dnsmasq-dhcp[3848]: DHCPREQUEST(virbr0) 192.168.122.225 52:54:00:bf:24:14
Sep 06 04:40:55 s1lp5 dnsmasq-dhcp[3848]: DHCPACK(virbr0) 192.168.122.225 52:54:00:bf:24:14 artful-test17
Sep 06 04:40:55 s1lp5 dnsmasq-dhcp[3848]: DHCPREQUEST(virbr0) 192.168.122.115 52:54:00:d7:97:d4
Sep 06 04:40:55 s1lp5 dnsmasq-dhcp[3848]: DHCPACK(virbr0) 192.168.122.115 52:54:00:d7:97:d4 artful-test24
Sep 06 04:40:55 s1lp5 dnsmasq-dhcp[3848]: DHCPREQUEST(virbr0) 192.168.122.18 52:54:00:75:b1:b2
Sep 06 04:40:55 s1lp5 dnsmasq-dhcp[3848]: DHCPACK(virbr0) 192.168.122.18 52:54:00:75:b1:b2 artful-test26
Sep 06 04:40:55 s1lp5 dnsmasq-dhcp[3848]: DHCPREQUEST(virbr0) 192.168.122.172 52:54:00:99:82:92
Sep 06 04:40:55 s1lp5 dnsmasq-dhcp[3848]: DHCPACK(virbr0) 192.168.122.172 52:54:00:99:82:92 artful-test15
+ date
Mi 6. Sep 04:40:57 EDT 2017
+ sleep 2s

That said artful is good, I have to check former versions if that was an upstream change in the daemons behavior to only return once the socket is avail.

Changed in libvirt (Ubuntu Artful):
status:	Incomplete → Fix Released

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-09-06:

#27

Changed the order of status and list to get it even better while testing on Xenial.
Also the system was even slower which is good I think (for the test).

So things are good on Xenial onward (with systemd).

$ sudo ./test-restart.sh
+ date
Mi 6. Sep 09:03:35 UTC 2017
+ sudo systemctl start libvirtd
+ date
Mi 6. Sep 09:03:36 UTC 2017
+ /bin/true
+ ls -laF /var/run/libvirt/libvirt-sock
srwxrwx--- 1 root libvirtd 0 Sep 6 09:03 /var/run/libvirt/libvirt-sock=
+ sudo systemctl status libvirtd --no-pager
* libvirt-bin.service - Virtualization daemon
   Loaded: loaded (/lib/systemd/system/libvirt-bin.service; enabled; vendor preset: enabled)
   Active: active (running) since Mi 2017-09-06 09:03:36 UTC; 48ms ago
     Docs: man:libvirtd(8)
           http://libvirt.org
Main PID: 261895 (libvirtd)
    Tasks: 20
   Memory: 30.5M
      CPU: 75ms
   CGroup: /system.slice/libvirt-bin.service
           |- 3932 /usr/sbin/dnsmasq --conf-file=/var/lib/libvirt/dnsmasq/default.conf --leasefile-ro --dhcp-script=/usr/lib/libvirt/libvirt_...
           |- 3933 /usr/sbin/dnsmasq --conf-file=/var/lib/libvirt/dnsmasq/default.conf --leasefile-ro --dhcp-script=/usr/lib/libvirt/libvirt_...
           |-261895 /usr/sbin/libvirtd
           `-261913 /usr/sbin/libvirtd

Sep 06 09:03:35 s1lp04 systemd[1]: Starting Virtualization daemon...
Sep 06 09:03:36 s1lp04 systemd[1]: Started Virtualization daemon.
+ virsh list
Id Name State
----------------------------------------------------
1 artful-test2 running
2 artful-test13 running
3 artful-test27 running
4 artful-test28 running
5 artful-test17 running
6 artful-test26 running
7 artful-test25 running
8 artful-test20 running
9 artful-test21 running
10 artful-test11 running
11 artful-test23 running
12 artful-test16 running
13 artful-test10 running
14 artful-test30 running
15 artful-test8 running
16 artful-test18 running
17 artful-test5 running
18 artful-test3 running
19 artful-test4 running
20 artful-test12 running
21 artful-test24 running
22 artful-test9 running
23 artful-test14 running
24 artful-test19 running
25 artful-test1 running
26 artful-test15 running
27 artful-test29 running
28 artful-test6 running
29 artful-test7 running
30 artful-test22 running

+ date
Mi 6. Sep 09:04:55 UTC 2017
+ sleep 2s

Changed the order of status and list to get it even better while testing on Xenial.
Also the system was even slower which is good I think (for the test).

So things are good on Xenial onward (with systemd).

$ sudo ./test-restart.sh 
+ date
Mi 6. Sep 09:03:35 UTC 2017
+ sudo systemctl start libvirtd
+ date
Mi 6. Sep 09:03:36 UTC 2017
+ /bin/true
+ ls -laF /var/run/libvirt/libvirt-sock
srwxrwx--- 1 root libvirtd 0 Sep   6 09:03 /var/run/libvirt/libvirt-sock=
+ sudo systemctl status libvirtd --no-pager
* libvirt-bin.service - Virtualization daemon
   Loaded: loaded (/lib/systemd/system/libvirt-bin.service; enabled; vendor preset: enabled)
   Active: active (running) since Mi 2017-09-06 09:03:36 UTC; 48ms ago
     Docs: man:libvirtd(8)
           http://libvirt.org
 Main PID: 261895 (libvirtd)
    Tasks: 20
   Memory: 30.5M
      CPU: 75ms
   CGroup: /system.slice/libvirt-bin.service
           |-  3932 /usr/sbin/dnsmasq --conf-file=/var/lib/libvirt/dnsmasq/default.conf --leasefile-ro --dhcp-script=/usr/lib/libvirt/libvirt_...
           |-  3933 /usr/sbin/dnsmasq --conf-file=/var/lib/libvirt/dnsmasq/default.conf --leasefile-ro --dhcp-script=/usr/lib/libvirt/libvirt_...
           |-261895 /usr/sbin/libvirtd
           `-261913 /usr/sbin/libvirtd

Sep 06 09:03:35 s1lp04 systemd[1]: Starting Virtualization daemon...
Sep 06 09:03:36 s1lp04 systemd[1]: Started Virtualization daemon.
+ virsh list
 Id    Name                           State
----------------------------------------------------
 1     artful-test2                   running
 2     artful-test13                  running
 3     artful-test27                  running
 4     artful-test28                  running
 5     artful-test17                  running
 6     artful-test26                  running
 7     artful-test25                  running
 8     artful-test20                  running
 9     artful-test21                  running
 10    artful-test11                  running
 11    artful-test23                  running
 12    artful-test16                  running
 13    artful-test10                  running
 14    artful-test30                  running
 15    artful-test8                   running
 16    artful-test18                  running
 17    artful-test5                   running
 18    artful-test3                   running
 19    artful-test4                   running
 20    artful-test12                  running
 21    artful-test24                  running
 22    artful-test9                   running
 23    artful-test14                  running
 24    artful-test19                  running
 25    artful-test1                   running
 26    artful-test15                  running
 27    artful-test29                  running
 28    artful-test6                   running
 29    artful-test7                   running
 30    artful-test22                  running

+ date
Mi 6. Sep 09:04:55 UTC 2017
+ sleep 2s

Changed in libvirt (Ubuntu Xenial):
status:	Incomplete → Fix Released
Changed in libvirt (Ubuntu Zesty):
status:	Incomplete → Fix Released

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-09-06:

#28

That leaves to fix Trusty:
- do verifications on Trusty before working on SRU (we need working tests steps for SRU anyway)
If confirmed:
- backport supposed fix from guessi (thanks for all your work btw and sorry this fell out of sight)
- adapt to not print messages too often (adaptive sleep)
- has to consider the configuration chance to the base path (that needs to be done no matter what the startup tests show btw).

Changed in libvirt (Ubuntu Trusty):
status:	Triaged → Incomplete

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-09-06:

#29

My Trusty system wan't as slow, so I used 100 guests on autostart.

But that seemed to work as well (test adapted for upstart):

./test-restart.sh
+ date
Wed Sep 6 09:34:58 UTC 2017
+ sudo service libvirt-bin start
libvirt-bin start/running, process 19805
+ date
Wed Sep 6 09:35:00 UTC 2017
+ /bin/true
+ ls -laF /var/run/libvirt/libvirt-sock
srwxrwx--- 1 root libvirtd 0 Sep 6 09:34 /var/run/libvirt/libvirt-sock=
+ sudo service libvirt-bin status
libvirt-bin start/running, process 19805
+ ls -laF /var/run/libvirt/libvirt-sock
srwxrwx--- 1 root libvirtd 0 Sep 6 09:34 /var/run/libvirt/libvirt-sock=
+ virsh list

So the socket is immediately available even with all sorts of slow guests starting.
Unfortunately the initial bug 1455608 wasn't too detailed on an actual case.
Actually I'd have refused that as a SRU team member :-/

Well live with what we have.
For now I'll try to slow more things down and recheck.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-09-06:

#30

Slowed down further (lmited to 2 cpus), startup of the guests takes minutes now.
Still no issue to be seen.
The socket is around and working, while in background the guests start slowly but surely.

The test is now:
# cat test-restart.sh
#!/bin/bash
set -uxe
#starts off
date
ls -laF /var/run/libvirt/libvirt-sock
sudo service libvirt-bin status
date
# start service
sudo service libvirt-bin start
date
# check if socket is available immediately
ls -laF /var/run/libvirt/libvirt-sock
sudo service libvirt-bin status
ls -laF /var/run/libvirt/libvirt-sock
# this might take a while as the dameon queues requests
virsh list
date

With 100 Guests shut down, but set to autostart on a 2 cpu system.
That takes about 5 minutes to start the guests, but the service is up and ready on the socket before doing all that.
Did you face something else?

@Ed / @Guessi
I know it is a long time, but I'd need to get something reproducible to go on fixing this for artful. If you can help me with examples that do please let me know.

@Guessi:
I tried with the degraded RAID but it worked for me as well, maybe it is just not the most reliable repro.

@Ed:
the original case lacks some details what actually failed e.g. what dependent services and how it was triggered in your case.

Keeping incomplete status on Trusty for now :-/

Log of the last test:
# ./test-restart.sh
+ date
Wed Sep 6 09:56:01 UTC 2017
+ ls -laF /var/run/libvirt/libvirt-sock
ls: cannot access /var/run/libvirt/libvirt-sock: No such file or directory
+ /bin/true
+ sudo service libvirt-bin status
libvirt-bin stop/waiting
+ date
Wed Sep 6 09:56:01 UTC 2017
+ sudo service libvirt-bin start
libvirt-bin start/running, process 9730
+ date
Wed Sep 6 09:56:03 UTC 2017
+ ls -laF /var/run/libvirt/libvirt-sock
srwxrwx--- 1 root libvirtd 0 Sep 6 09:56 /var/run/libvirt/libvirt-sock=
+ sudo service libvirt-bin status
libvirt-bin start/running, process 9730
+ ls -laF /var/run/libvirt/libvirt-sock
srwxrwx--- 1 root libvirtd 0 Sep 6 09:56 /var/run/libvirt/libvirt-sock=
+ virsh list
Id Name State
----------------------------------------------------
2 artful-test32 running
[...]
101 artful-test45 running
+ date
Wed Sep 6 10:01:41 UTC 2017

Slowed down further (lmited to 2 cpus), startup of the guests takes minutes now.
Still no issue to be seen.
The socket is around and working, while in background the guests start slowly but surely.

The test is now:
# cat test-restart.sh 
#!/bin/bash
set -uxe
#starts off
date
ls -laF /var/run/libvirt/libvirt-sock
sudo service libvirt-bin status
date
# start service
sudo service libvirt-bin start
date
# check if socket is available immediately
ls -laF /var/run/libvirt/libvirt-sock
sudo service libvirt-bin status
ls -laF /var/run/libvirt/libvirt-sock
# this might take a while as the dameon queues requests
virsh list
date

With 100 Guests shut down, but set to autostart on a 2 cpu system.
That takes about 5 minutes to start the guests, but the service is up and ready on the socket before doing all that.
Did you face something else?

@Ed / @Guessi
I know it is a long time, but I'd need to get something reproducible to go on fixing this for artful. If you can help me with examples that do please let me know.

@Guessi:
I tried with the degraded RAID but it worked for me as well, maybe it is just not the most reliable repro.

@Ed:
the original case lacks some details what actually failed e.g. what dependent services and how it was triggered in your case.

Keeping incomplete status on Trusty for now :-/

Log of the last test:
# ./test-restart.sh   
+ date
Wed Sep  6 09:56:01 UTC 2017
+ ls -laF /var/run/libvirt/libvirt-sock
ls: cannot access /var/run/libvirt/libvirt-sock: No such file or directory
+ /bin/true
+ sudo service libvirt-bin status
libvirt-bin stop/waiting
+ date
Wed Sep  6 09:56:01 UTC 2017
+ sudo service libvirt-bin start
libvirt-bin start/running, process 9730
+ date
Wed Sep  6 09:56:03 UTC 2017
+ ls -laF /var/run/libvirt/libvirt-sock
srwxrwx--- 1 root libvirtd 0 Sep  6 09:56 /var/run/libvirt/libvirt-sock=
+ sudo service libvirt-bin status
libvirt-bin start/running, process 9730
+ ls -laF /var/run/libvirt/libvirt-sock
srwxrwx--- 1 root libvirtd 0 Sep  6 09:56 /var/run/libvirt/libvirt-sock=
+ virsh list
 Id    Name                           State
----------------------------------------------------
 2     artful-test32                  running
[...]
 101   artful-test45                  running
+ date
Wed Sep  6 10:01:41 UTC 2017

Revision history for this message

guessi (guessi) wrote on 2017-09-06:

#31

@paelzer,

glad to see people start re-work on this issue, but it's been 16+ months old,

I can only tell the scenario I remember,
I think it would not happened if it is already booted,
it is only happened *at boot time*, as I describe in originally message,

key points are,
* it is running with >10 virtual instances
* it is running under RAID-1 system (traditional disk, *NOT* SSD)
* it is ungracefully shut down (unplug power cord with rude)

under RAID-1 mode, it would try to recovery data between disks, due to ungracefully shut down,
it would finally lead to a slow start up state *during boot*

the latest merge was made by @searge, see comment #19 above,
maybe he/she could help to add more detail on his test/patch?

Revision history for this message

guessi (guessi) wrote on 2017-09-06:

#32

@paelzer,

by the way, the final patch that merged into Yakkety is "sleep 0.5"

see "diff from 1.3.4-1ubuntu3 to 1.3.4-1ubuntu4 (1.6 KiB)"
- https://launchpad.net/ubuntu/+source/libvirt/1.3.4-1ubuntu4

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-09-06:

#33

Yeah, nobody minds yakkety anymore (yes - it is that much time that passed :-/) and on the others there is no change yet (and systemd by default).

I'll go into another repro session closer to the case you had adding:
- reboots
- unclean shutdown
- maybe a raid or something similar to stress this even more

I'll start with Trusty where it was surely seen by you and once I have a case that works move forward.

Changed in libvirt (Ubuntu Trusty):
status:	Incomplete → Triaged

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-09-06:

#34

Acitivty-Ball moved to my side of the field, setting triaged to make it clear that we do not wait for the reporter - even though we havn't reached "confirmed" yet in a strict sense.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-09-07:

#35

Thank you Guessi for your summary once more.
I tried to get closer to your case.

# Install a Trusty system with Raid-1, that has space for many guests (a lot of memory) but is slow (1 cpu)
$ wget http://releases.ubuntu.com/14.04/ubuntu-14.04.5-server-amd64.iso
$ qemu-img create -f qcow2 trusty-disk1.qcow2 10G
$ qemu-img create -f qcow2 trusty-disk2.qcow2 10G
$ sudo qemu-system-x86_64 -enable-kvm -hda trusty-disk1.qcow2 -hdb trusty-disk2.qcow2 -cdrom ubuntu-14.04.5-server-amd64.iso -m 16384 -smp 1
# On this install SW Raid1 according to [1]
# I set up 20 guests with autostart as before
# system seems slow enough, but in addition break the secondary raid device

# I also took a copy and blasted a long 256M whole of zeroes after its partition table
$ dd if=/dev/zero bs=1024 of=trusty-disk2.qcow2 seek=4096 count=$((4096*64))
# Start Trusty system again, but that blow was too much for it to recover at all
# given that the startup took so long (I made the qemu guest rather slow in general) I don't think I need that part of it to slow it down more.
# So further on I killed it hard to cause minor FS issues, but that matches your power loss case better anyway

On reboot then the service still was good and not timing out.
Guests were still starting a minute later, but nothing stopped libvirt from not having it's socket ready in time.

Message in /var/log/upstart/libvirt-bin.log is just ".. ready", no failure to be seen.

Instead of making the system even slower (unusable) I decided to shorten the retries and instead of 5*2sec I did only 2*1 sec, with that it finally triggered *yay*.

Checked another reboot and yes, still showing up.

Error is:
...
Giving up waiting for /var/tun/libvirt/libvirt-sock
libvirt-bin stop/post-start, process 1217
post-start process 1218
/usr/sbin/libvirtd: error: unable to determine if daemon is running: No such file or directory

So in >=Xenial as I mentioned all of this doesn't matter unless you explicitly opt into sysv/upstart. But in Trusty we now finally have a case to work with.

Trying a fix combining your suggestion with:
- incremental sleep times
- considering the potential path being changes in libvirtd.conf

[1]: https://help.ubuntu.com/community/Installation/SoftwareRAID

Thank you Guessi for your summary once more.
I tried to get closer to your case.

# Install a Trusty system with Raid-1, that has space for many guests (a lot of memory) but is slow (1 cpu)
$ wget http://releases.ubuntu.com/14.04/ubuntu-14.04.5-server-amd64.iso
$ qemu-img create -f qcow2 trusty-disk1.qcow2 10G
$ qemu-img create -f qcow2 trusty-disk2.qcow2 10G
$ sudo qemu-system-x86_64 -enable-kvm -hda trusty-disk1.qcow2 -hdb trusty-disk2.qcow2 -cdrom ubuntu-14.04.5-server-amd64.iso -m 16384 -smp 1
# On this install SW Raid1 according to [1]
# I set up 20 guests with autostart as before
# system seems slow enough, but in addition break the secondary raid device

# I also took a copy and blasted a long 256M whole of zeroes after its partition table
$ dd if=/dev/zero bs=1024 of=trusty-disk2.qcow2 seek=4096 count=$((4096*64))
# Start Trusty system again, but that blow was too much for it to recover at all
# given that the startup took so long (I made the qemu guest rather slow in general) I don't think I need that part of it to slow it down more.
# So further on I killed it hard to cause minor FS issues, but that matches your power loss case better anyway

On reboot then the service still was good and not timing out.
Guests were still starting a minute later, but nothing stopped libvirt from not having it's socket ready in time.

Message in /var/log/upstart/libvirt-bin.log is just ".. ready", no failure to be seen.

Instead of making the system even slower (unusable) I decided to shorten the retries and instead of 5*2sec I did only 2*1 sec, with that it finally triggered *yay*.

Checked another reboot and yes, still showing up.

Error is:
...
Giving up waiting for /var/tun/libvirt/libvirt-sock
libvirt-bin stop/post-start, process 1217
    post-start process 1218
/usr/sbin/libvirtd: error: unable to determine if daemon is running: No such file or directory

So in >=Xenial as I mentioned all of this doesn't matter unless you explicitly opt into sysv/upstart. But in Trusty we now finally have a case to work with.

Trying a fix combining your suggestion with:
- incremental sleep times
- considering the potential path being changes in libvirtd.conf

[1]: https://help.ubuntu.com/community/Installation/SoftwareRAID

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-09-07:

#36

Backport of what was applied to zesty plus the fixes discussed here is building in ppa [1]

  * d/libvirt-bin.init, d/libvirt-bin.upstart: fix waiting for the libvirt
    socket (LP: #1571209)
    - avoid timing out on slow systems (only stop when service is stopped)
    - fix whitespace damage formerly added to d/libvirt-bin.init
    - no more sleep without announcing why
    - use incremental sleep to be fast in most fast path cases
    - use incremental sleep to avoid too much repetive messages
    - fix check if unix_sock_dir path is set in /etc/libvirt/libvirtd.conf

I'll test that (so far untested) now and if successful open a merge proposal for team review.
More tests / verifications are welcome.

[1]: https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/2950

Revision history for this message

guessi (guessi) wrote on 2017-09-07:

#37

@paelzer,

glad to see you reproduce the problem successfully,
and simulate by creating nested KVM is better the my test, thanks

here I would like to add more history for this bug report,
* it is running under KVM with multiple instance
* it is ungracefully shut-down (could be simulated by virsh if test under nested KVM, or kill)
* disk for physical host is running under RAID-1
* disk for physical host is formatted with 2-partition, / and /boot
* one of the KVM instance running with 2 vHDD, partition: / and /data
* * the /data is mounted as data volume for DB (yeah, it is reading/writing all the time)

as for the patch in the PPA above,
let's focus on libvirt-related section only,
I would prefer to keep sleep interval short (1s),
see discussion #14 ~ #16 above,

consider the following scenario,
if under a slow environment, waiting for socket ready required 65s,
but due to the loop with incremental condition, it will required 91s, there's 26s wasted
(1s + 2s + 4s + 8s + 16s + 30s + 30s = 91s)

and the worst case,
it will lead to 29s wasted. it is slowing down boot up speed,
I think it's not a good user experience, although it is no harm for system.

BTW,
I saw you marked Xenial/Zesty/Artful as Fixed Released in #26, #27
does it mean test#35 all passed for Xenial/Zesty/Artful?

Revision history for this message

guessi (guessi) wrote on 2017-09-07:

#38

forget to mention,
the RAID system I mentioned above, it is talking about software-raid system,
there's no raid card on server

hope these info would help :)

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-09-07:

#39

Fix seems to work, I see it waiting for 1+2+4+8 second overall before declaring the job ready by detecting the socket.

Also the extended test adapting unix_sock_path in /etc/libvirt/libvirtd.conf e.g. to "/var/run" worked (which would have failed before). Libvirt comes up and is considered up correctly even using a non default path.

As long as it is in the post phase it is reporting eventually every 30 seconds to the log.
Status is like ""respawn/post-start, (post-start) process 1177

Prepping an MP to review and a SRU Template.
But heads up - there is another SRU ongoing that has to complete before that.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-09-07:

#40

MP open and linked here

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-09-13:

#41

The MP review seems to get close to approval, added an SRU Template here.

description:

updated

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-09-13:

#42

Uploaded to trusty-unapproved for SRU Team review.

Please note: there is another upload [1] currently passing through SRU processing.
We want [1] to completely pass before the new upload [2] is accepted.

The former one is all complete and just needs a few days more in proposed.
So if all goes right on ~next Monday [1] is released and [2] can enter trusty-proposed.

[1]: https://launchpad.net/ubuntu/+source/libvirt/1.2.2-0ubuntu13.1.22
[2]: https://launchpadlibrarian.net/336851479/libvirt_1.2.2-0ubuntu13.1.23_source.changes

Revision history for this message

Brian Murray (brian-murray) wrote on 2017-09-21: Please test proposed package

#43

Hello guessi, or anyone else affected,

Accepted libvirt into trusty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/libvirt/1.2.2-0ubuntu13.1.23 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-trusty to verification-done-trusty. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-trusty. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in libvirt (Ubuntu Trusty):
status:	Triaged → Fix Committed
tags:	added: verification-needed verification-needed-trusty

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-09-28:

#44

I tested again the same way I did with the ppa, and it works for me with the fix that is in proposed.
So the behavior should be more reliable now for those "real" cases affected by slow systems that open up the window for the former issue to happen.

tags:

added: verification-done verification-done-trusty
removed: verification-needed verification-needed-trusty

Revision history for this message

Launchpad Janitor (janitor) wrote on 2017-10-05:

#45

This bug was fixed in the package libvirt - 1.2.2-0ubuntu13.1.23

---------------
libvirt (1.2.2-0ubuntu13.1.23) trusty; urgency=medium

  * d/libvirt-bin.init, d/libvirt-bin.upstart: fix waiting for the libvirt
    socket (LP: #1571209)
    - avoid timing out on slow systems (only stop when service is stopped)
    - fix whitespace damage formerly added to d/libvirt-bin.init
    - no more long sleep without announcing to log
    - check socket and service status more often for lower latency on changes
    - fix check if unix_sock_dir path is set in /etc/libvirt/libvirtd.conf
    - fix the upstart service name that is checked

-- Christian Ehrhardt <email address hidden> Thu, 07 Sep 2017 14:22:45 +0200

Changed in libvirt (Ubuntu Trusty):
status:	Fix Committed → Fix Released

Revision history for this message

Brian Murray (brian-murray) wrote on 2017-10-05: Update Released

#46

The verification of the Stable Release Update for libvirt has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Ubuntu
libvirt package

Sockfile check retries too short for a busy system boot

Bug Description

Other bug subscribers

Patches

Remote bug watches

	Status	Importance	Assigned to
libvirt (Ubuntu)	Fix Released	High	Unassigned
Precise	Won't Fix	High	Unassigned
Trusty	Fix Released	High	Unassigned
Wily	Won't Fix	High	Unassigned
Xenial	Fix Released	High	Unassigned
Zesty	Fix Released	Undecided	Unassigned
Artful	Fix Released	High	Unassigned

Ubuntulibvirt package

Sockfile check retries too short for a busy system boot

Bug Description

Other bug subscribers

Patches

Remote bug watches

Ubuntu
libvirt package