xenstore crashes with SIGBUS in domain_can_read()

Bug #1538049 reported by Stefan Bader
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
xen (Ubuntu)
Fix Released
Critical
Stefan Bader
Xenial
Fix Released
Critical
Stefan Bader

Bug Description

Problem got introduced by xen-4.6 in Xenial. Older versions are unaffected.

Testcase:
 - Start one additional domU (PV or HVM does not matter)
 - Repeatedly execute "xenstore-ls"

Eventually the xenstore-ls call locks up while xenstored crashes when it tries to access memory in

tools/xenstore/xenstored_domain.c:255 domain_can_read(struct connection *conn)

Debugging shows that conn->domain->interface at the time of crash returns a completely different address than before (there should only be one per running domain, so 2).

Revision history for this message
Stefan Bader (smb) wrote :

Hm, when repeating with a xenstored that prints additional trace messages about domain->interface values, I now got a case where the SIGBUS seems to have happened while the interface pointer looks valid.

(gdb) where
#0 domain_can_read (conn=conn@entry=0x8eb890) at xenstored_domain.c:261
#1 0x0000000000402718 in main (argc=<optimized out>, argv=<optimized out>)
    at xenstored_core.c:2145
(gdb) p *((struct connection *) 0x8eb890)
$1 = {list = {next = 0x8eca60, prev = 0x8ecdf0}, fd = -1, pollfd_idx = -1,
  id = 1, can_write = true, in = 0x8ef290, out_list = {next = 0x8eb8b8,
    prev = 0x8eb8b8}, transaction = 0x0, transaction_list = {next = 0x8eb8d0,
    prev = 0x8eb8d0}, next_transaction_id = 10, transaction_started = 0,
  domain = 0x8eced0, target = 0x0, watches = {next = 0x8edd30,
    prev = 0x8ee9a0}, write = 0x406140 <writechn>, read = 0x406240 <readchn>}
(gdb) p *((struct domain *) 0x8eced0)
$2 = {list = {next = 0x8e81b0, prev = 0x6145a0 <domains>}, domid = 1,
  port = 48, remote_port = 1, mfn = 2173329,
  path = 0x8ec460 "/local/domain/1", interface = 0x7fc8acb0f000,
  conn = 0x8eb890, shutdown = 0, nbentry = 44, nbwatch = 9}

Revision history for this message
Stefan Bader (smb) wrote :

For the recent crashes the pointer seemed always consistent with previous mappings. Wondering whether the mapping might be invalidated by the hypervisor side. For additional information, when I start the domU, grant table debug shows the following at the beginning:

(XEN) grant-table for remote domain: 1 (v1)
(XEN) [ 0] 0 0x212992 0x00000002 0 0x212992 0x19
(XEN) [ 1] 0 0x212991 0x00000002 0 0x212991 0x19
(XEN) [ 8] 0 0x1d26b7 0x00000001 0 0x1d26b7 0x19

when xenstored crashes, the middle entry goes away. But its impossible to say whether it goes away before the crash or just as a result of it.

Revision history for this message
Stefan Bader (smb) wrote :

Not sure this has some meaning or is just a red herring but comparing the grant-rable debug output with Wily (Xen-4.5) it looks mostly similar except the entry #8 which has back then the same pin number as the other two...

(XEN) -------- active -------- -------- shared --------
(XEN) [ref] localdom mfn pin localdom gmfn flags
(XEN) grant-table for remote domain: 1 (v1)
(XEN) [ 0] 0 0x8091ff 0x00000002 0 0x8091ff 0x19
(XEN) [ 1] 0 0x809200 0x00000002 0 0x809200 0x19
(XEN) [ 8] 0 0x5353ba 0x00000002 0 0x5353ba 0x19

Reading the code #0 is related to console and #1 to xenstore. Not sure what that #8 is.

Revision history for this message
Stefan Bader (smb) wrote :

Crash also happens if dom0 is restricted to a single, pinned vCPU. So I guess it isn't an inconsistency. Only vague explanation I can think of is that the hypervisor for some reason invalidates the mapping without dom0(and xenstored running there) realizing.

Revision history for this message
Stefan Bader (smb) wrote :

Installed a system with Debian testing and the crash does not happen there. That left two things: the additional security fixes we got or the kernel. And it was not the security fixes... Should have tried a different kernel sooner. Neither an old 4.2 based one nor the pending 4.4 are causing this. Only the current 4.3 based one. Maybe some Xen related stable patches did get mis-applied or not applied at all. Who knows. Will concentrate on the coming 4.4 and if I am happy, close this report.

Revision history for this message
Stefan Bader (smb) wrote :

Hint from upstream. Sounds plausible that this might be the upstream fix:

9c17d96500f78 "xen/gntdev: Grant maps should not be subject to NUMA balancing"

I guess 4.3 started to be more aggressive there since I never saw this before 4.3 and there even on a non-NUMA system. But having a grant-table mapping taken away from the dom0 kernel without xenstored being involved would explain it tripping over that now bad pointer.

Revision history for this message
Stefan Bader (smb) wrote :

The 4.4 kernel is now available and shows no issues.

Changed in xen (Ubuntu Xenial):
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.