the timeout value is actually ~7500 seconds and epoll_wait() rightfully blocks as there are no events on the sockets being watched. Attaching with gdb or strace sends a SIGSTOP to the process and causes epoll_wait() to be interrupted - then nova-cert tries to send something via mysql socket and finally figures out it's broken.
AFAIU, the "snapshotting-reverting" done by dos.py somehow causes the issue with timeout - maybe due to the fact we try to sync time on revert?
So, I checked the core dump of nova-cert process and it looks interesting:
(gdb) info locals 56eb8, args=<optimized out>, kwds=<optimized out>) at ../Modules/ selectmodule. c:1034 selectmodule. c
_save = 0x1ef70a0
dtimeout = 7500.1718587875366
timeout = 7500171
maxevents = 1023
nfds = <optimized out>
i = <optimized out>
elist = 0x0
etuple = 0x0
evs = 0x41ef400
kwlist = {0x5d99e1 "timeout", 0x5d9b76 "maxevents", 0x0}
kwlist = {0x5d99e1 "timeout", 0x5d9b76 "maxevents", 0x0}
(gdb) frame
#1 0x0000000000435ef8 in pyepoll_poll (self=0x7f41676
1034 in ../Modules/
the timeout value is actually ~7500 seconds and epoll_wait() rightfully blocks as there are no events on the sockets being watched. Attaching with gdb or strace sends a SIGSTOP to the process and causes epoll_wait() to be interrupted - then nova-cert tries to send something via mysql socket and finally figures out it's broken.
AFAIU, the "snapshotting- reverting" done by dos.py somehow causes the issue with timeout - maybe due to the fact we try to sync time on revert?
Looks like epoll_wait() wrapper in Python 2.x does not use monotonic clock - https:/ /github. com/python/ cpython/ blob/2. 7/Modules/ selectmodule. c#L986- L1058 , while Python 3.x does - https:/ /github. com/python/ cpython/ blob/master/ Modules/ selectmodule. c#L1485- L1592 - this must be the reason the timeout value is unexpectedly big.
The bottom line is, this must only be reproduced on the virtual environments in our tests.