Comment 9 for bug 1534345

Revision history for this message
Stefan Bader (smb) wrote :

Note that the following is not final statement but sharing some thoughts to whoever else is looking at this report (and for me to remember). So while I did find nothing that really looked odd in the xen-netfront code I saw there was some change to the generic timer code:

commit 1dabbcec2c0a36fe43509d06499b9e512e70a028
  timer: Use hlist for the timer wheel hash buckets

That change was part of 4.2 but if it would be the cause I would expect problems not only on AWS instances. But then it might just be that bare-metal servers with a similarly high traffic tend to be upgraded much less often.... anyway... Part of the change above seems to be some exchange of special meaning of list pointer values. Not sure I grasp the implications, yet. While using double linked lists before, the pointer to the next element seemed to serve as pending indicator and the pointer to the previous element was invalidated with a LIST_POISON2 value. Now its the other way round. Referring to the detach_timer function which is called from __run_timers via detached_expired_timer.

The crash happens at offset 0x116 in run_timer_softirq (thats 278 decimal). The disassembly of that function around there is:

   0xffffffff810e5c1e <+254>: mov %r15,0x8(%rbx)
   0xffffffff810e5c22 <+258>: nopl 0x0(%rax,%rax,1)
   // Guest this is __hlist_del(struct hlist_node *n)
   // rax = n->next
   0xffffffff810e5c27 <+263>: mov (%r15),%rax
   // rdx = n->ppev
   0xffffffff810e5c2a <+266>: mov 0x8(%r15),%rdx
   0xffffffff810e5c2e <+270>: test %rax,%rax
   // *(n->pprev) = n->next
   0xffffffff810e5c31 <+273>: mov %rax,(%rdx)
   // if (n->next == NULL) jump
   0xffffffff810e5c34 <+276>: je 0xffffffff810e5c3a <run_timer_softirq+282>
   // (n->next)->pprev = n->pprev (but n->next is LIST_POISON2 / invalid ptr)
   0xffffffff810e5c36 <+278>: mov %rdx,0x8(%rax)
   0xffffffff810e5c3a <+282>: testb $0x10,0x2a(%r15)
   // here we seem back at detach_timer inlined and clear_pending assumed true
   // entry->next = LIST_POISON2 and entry->pprev = NULL
   0xffffffff810e5c3f <+287>: movabs $0xdead000000200200,%rax
   0xffffffff810e5c49 <+297>: movq $0x0,0x8(%r15)
   0xffffffff810e5c51 <+305>: mov %rax,(%r15)