Comment 12 for bug 1570936

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

So, I checked the core dump of nova-cert process and it looks interesting:

(gdb) info locals
_save = 0x1ef70a0
dtimeout = 7500.1718587875366
timeout = 7500171
maxevents = 1023
nfds = <optimized out>
i = <optimized out>
elist = 0x0
etuple = 0x0
evs = 0x41ef400
kwlist = {0x5d99e1 "timeout", 0x5d9b76 "maxevents", 0x0}
kwlist = {0x5d99e1 "timeout", 0x5d9b76 "maxevents", 0x0}
(gdb) frame
#1 0x0000000000435ef8 in pyepoll_poll (self=0x7f4167656eb8, args=<optimized out>, kwds=<optimized out>) at ../Modules/selectmodule.c:1034
1034 in ../Modules/selectmodule.c

the timeout value is actually ~7500 seconds and epoll_wait() rightfully blocks as there are no events on the sockets being watched. Attaching with gdb or strace sends a SIGSTOP to the process and causes epoll_wait() to be interrupted - then nova-cert tries to send something via mysql socket and finally figures out it's broken.

AFAIU, the "snapshotting-reverting" done by dos.py somehow causes the issue with timeout - maybe due to the fact we try to sync time on revert?

Looks like epoll_wait() wrapper in Python 2.x does not use monotonic clock - https://github.com/python/cpython/blob/2.7/Modules/selectmodule.c#L986-L1058 , while Python 3.x does - https://github.com/python/cpython/blob/master/Modules/selectmodule.c#L1485-L1592 - this must be the reason the timeout value is unexpectedly big.

The bottom line is, this must only be reproduced on the virtual environments in our tests.