StarlingX

Bug #1884560
Comment #5

Comment 5 for bug 1884560

Revision history for this message

Bart Wensley (bartwensley) wrote on 2020-06-22:

The subcloud audit has stalled completely. For 1884560 I see the last logs from the subcloud audit here:
2020-06-21 20:39:44.917 126591 INFO dcmanager.audit.subcloud_audit_manager [-] Triggered subcloud audit.
2020-06-21 20:39:44.924 126591 INFO dcmanager.audit.subcloud_audit_manager [-] Waiting for subcloud audits to complete.

Immediately before this, the subcloud was unmanaged:
2020-06-21 20:39:25.642 126274 INFO dcmanager.manager.service [req-b2be186a-ace3-4e0f-a7a5-54a85b841008 6f8a527c496342c9a3ab6f3fbc5b12ee - - default default] Handling update_subcloud request for: 6
2020-06-21 20:39:25.642 126274 INFO dcmanager.manager.subcloud_manager [req-b2be186a-ace3-4e0f-a7a5-54a85b841008 6f8a527c496342c9a3ab6f3fbc5b12ee - - default default] Updating subcloud 6.
2020-06-21 20:39:25.668 126274 INFO dcmanager.manager.subcloud_manager [req-b2be186a-ace3-4e0f-a7a5-54a85b841008 6f8a527c496342c9a3ab6f3fbc5b12ee - - default default] Notifying dcorch, subcloud:subcloud2 management: unmanaged, availability:online
2020-06-21 20:39:25.680 126274 INFO dcmanager.manager.subcloud_manager [req-b2be186a-ace3-4e0f-a7a5-54a85b841008 6f8a527c496342c9a3ab6f3fbc5b12ee - - default default] Updating all subclouds, endpoint: None sync: unknown

The logs indicate the subcloud audit is in the following loop waiting for all audit greenthreads to complete:
        # Wait for all greenthreads to complete
        LOG.info('Waiting for subcloud audits to complete.')
        for thread in self.subcloud_workers.values():
            thread.wait()

This means one of the audit threads have not returned (or raised an exception). I did some digging and I think the problem is the sysinv API – we are using the cgtsclient Client class and we are creating it without specifying a timeout. This class creates an HTTPClient (from httplib2) without a timeout and the httplib2 code indicates that this results in the default socket timeout being used, which is “no timeout”.

So what happens is this:
- Subcloud audit is happily running and has cached the subcloud keystone/sysinv clients.
- Subcloud is unmanaged and immediately powered down.
- Subcloud audit runs again and uses the cached clients (this is key because there are timeouts when creating a new keystone client).
- Subcloud audit uses cgtsclient to attempt to get the service groups from the subcloud. This hangs forever.
- Subcloud audit never runs again.

This means the subcloud will never go offline (as reported in the LP). I think this problem only started when the subcloud audit was restructured and moved into a separate process – I don’t think we were caching the subcloud clients before then, so we would have always timed out when trying to create the client.

I think the solution here is to supply a “reasonable” timeout when creating the cgtsclient – this needs to be pretty short (maybe 30s or less) because it blocks the next subcloud audit cycle). We also need to look at all the other clients the subcloud audit is using (e.g. patching) and ensure they all have a timeout implemented.

Finally – it would be good to protect that wait loop above from hanging forever if one of the audit greenthreads gets stuck. Reading through the eventlet docs I came across a Timeout class that seems to do what we want (https://eventlet.net/doc/modules/timeout.html). If we insert the following statement at the top of SubcloudAuditManager._audit_subcloud() we can cause each audit to abort with an exception after 60 seconds:
timeout = Timeout(60, some_exception)

If we don’t have time to do both, I think we should add the client timeouts first and then do the overall timeout in the next release.

The logs indicate the subcloud audit is in the following loop waiting for all audit greenthreads to complete: 
        # Wait for all greenthreads to complete
        LOG.info('Waiting for subcloud audits to complete.')
        for thread in self.subcloud_workers.values():
            thread.wait()

If we don’t have time to do both, I think we should add the client timeouts first and then do the overall timeout in the next release.