OpenStack Compute (nova)

Bug #1357578
Comment #26

Comment 26 for bug 1357578

Revision history for this message

Sean Dague (sdague) wrote on 2014-09-23:

#26

So... my current theory is the following:

    def _get_workers(self):
        # NOTE(hartsocks): use of ps checks the process table for child pid
        # entries these processes may be ended but not reaped so ps may
        # show processes that are still being cleaned out of the table.
        f = os.popen('ps ax -o pid,ppid,command')
        # Skip ps header
        f.readline()

        processes = [tuple(int(p) for p in l.strip().split()[:2])
                     for l in f.readlines()]
        workers = [p for p, pp in processes if pp == self.pid]
        LOG.info('in _get_workers(), workers: %r for PPID %d',
                 workers, self.pid)
        return workers

Collects the subprocesses via ps.

However, when we loop on them, we assume that they were all sent the correct SIGTERM previously. And we os.kill(pid, 0), which doesn't actually send a kill signal.

So, if some earlier test failed on us, and it spawned a worker that we didn't expect, we could very easily stall polling the status of that process. And we'll never even try the rest of them based on the way the for loop works.

Proposed fix is to send a SIGTERM on every loop iteration.