StarlingX

Bug #1970645
Comment #17

Comment 17 for bug 1970645

Revision history for this message

Thales Elero Cervi (tcervi) wrote on 2022-05-30:

#17

Hi, thanks for sharing those logs.
They were indeed a great help to understand better the full picture. So, apparently the majority of the Tests passed during that execution, but some "key tests" did not. In fact, the first test to fail was "test_lock_unlock_standby_controller" that tried to lock and unlock controller-1 but stx-os could not reach the applied status before the timeout:

[2022-05-22 07:24:44,669] 290 WARNING MainThread container_helper.wait_for_apps_status:: ['stx-openstack'] did not reach status applied within 360s

That reapply did not fail inside the timeout though, its last status update that I can see is from [2022-05-22 07:24:34,605] and it was:
applying | processing chart: osh-openstack-nginx-ports-control, overall completion: 10.0%

And around that time controller-1 was not stable:
sysinv 2022-05-22 07:22:47.882 835363 INFO sysinv.conductor.manager [-] Node(s) are in an unstable state. Defer audit.
sysinv 2022-05-22 07:23:29.696 835363 INFO sysinv.conductor.manager [-] Updating platform data for host: 4ddfe5d8-92ed-4df0-9912-2136257f3a81 with: {u'first_report': True}

What happened right before the reapply error that I could find on armada log:
2022-05-22 07:23:35.853 344 WARNING urllib3.connectionpool [-] Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))': /apis/armada.process/v1/namespaces/kube-system/locks/locks.armada.process.lock[00m

Not sure what caused that unstable status on controller-1, but the logs have a few leads on it:

[2022-05-22 07:18:31,993] 4808 WARNING MainThread host_helper.wait_for_tasks_affined:: /etc/platform/.task_affining_incomplete did not clear on controller-1

2022-05-22T07:22:02.753737 | log | 200.022 | controller-1 is now 'offline' host=controller-1.status=offline
2022-05-22T07:22:02.752373 | set | 400.005 | Communication failure detected with peer over port eno1 on host controller-0 | host=controller-0.network=oam
2022-05-22T07:22:02.751492 | log | 200.022 | controller-1 is now 'disabled' host=controller-1.state=disabled
2022-05-22T07:22:02.749402 | set | 200.004 | controller-1 experienced a service-affecting failure. Auto-recovery in progress. Manual Lock and Unlock may be required if auto-recovery is unsuccessful host=controller-1
2022-05-22T07:22:02.748747 | log | 401.005 | Communication failure detected with peer over port eno1 on host controller-0 | host=controller-0.network=oam
2022-05-22T07:22:02.748030 | set | 400.005 | Communication failure detected with peer over port eno2 on host controller-0 | host=controller-0.network=cluster
2022-05-22T07:22:02.684878 | set | 200.005 | controller-1 experienced a persistent critical 'Management Network' communication failure.| host=controller-1.network=Management

Around the same time I could also see 3 openstack pods evicted on controller-0 due to low ephemeral-storage resource: fm-rest-api-5dcd9d9484-jnsbx, ingress-84c5f4749f-ntvnq and nova-api-proxy-667468b59d-hbthb

So that reapply failed and let the stx-openstack with the "apply-failure" status, although apparently later the pods came up since the high majority of the instance tests passed. The stand-by controller went down during the reapply so I guess a failure here would be expected, but from the logs I could not get the exact reason for the unstable state on controller-1.

The last reapply was triggered around 2022-05-22 08:45:41.481 for the "test_openstack_pod_healthy.py::test_openstack_pods_healthy [20220522 08:44:38]" that also failed with a nova-api-proxy-667468b59d-vzzsb on an Evicted status, although for this one there is no additional describe or pod logs to gather more information.

It's a very problematic and strange behavior that definitely needs to be looked into. Both the unstable controller-1 and the pod evictions are major concerns for us to have a stable sanity execution.

Do you think we can share access to those servers ins which the tests are running? Will contact will directly via email to go further on this option.

[2022-05-22 07:24:44,669] 290  WARNING MainThread container_helper.wait_for_apps_status:: ['stx-openstack'] did not reach status applied within 360s

That reapply did not fail inside the timeout though, its last status update that I can see is from [2022-05-22 07:24:34,605] and it was:
 applying | processing chart: osh-openstack-nginx-ports-control, overall completion: 10.0%

Not sure what caused that unstable status on controller-1, but the logs have a few leads on it:

[2022-05-22 07:18:31,993] 4808 WARNING MainThread host_helper.wait_for_tasks_affined:: /etc/platform/.task_affining_incomplete did not clear on controller-1

2022-05-22T07:22:02.753737 | log | 200.022 | controller-1 is now 'offline'                                 host=controller-1.status=offline 
2022-05-22T07:22:02.752373 | set | 400.005 | Communication failure detected with peer over port eno1 on host controller-0 | host=controller-0.network=oam                                                                                   
2022-05-22T07:22:02.751492 | log | 200.022 | controller-1 is now 'disabled'                                                                                    host=controller-1.state=disabled 
2022-05-22T07:22:02.749402 | set | 200.004 | controller-1 experienced a service-affecting failure. Auto-recovery in progress. Manual Lock and Unlock may be required if auto-recovery is unsuccessful   host=controller-1 
2022-05-22T07:22:02.748747 | log | 401.005 | Communication failure detected with peer over port eno1 on host controller-0 | host=controller-0.network=oam                                                                                          
2022-05-22T07:22:02.748030 | set | 400.005 | Communication failure detected with peer over port eno2 on host controller-0 | host=controller-0.network=cluster
2022-05-22T07:22:02.684878 | set | 200.005 | controller-1 experienced a persistent critical 'Management Network' communication failure.| host=controller-1.network=Management

It's a very problematic and strange behavior that definitely needs to be looked into. Both the unstable controller-1 and the pod evictions are major concerns for us to have a stable sanity execution.

Do you think we can share access to those servers ins which the tests are running? Will contact will directly via email to go further on this option.