nova and neutron service didn't recover after force unlocking the host

Bug #1839378 reported by Ming Lei on 2019-08-07
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Medium
Jim Gauld

Bug Description

Brief Description
-----------------
After force rebooting a host, the neuron and nova services keep in Init status and did not recover.

Severity
--------
Provide the severity of the defect.
Critical

Steps to Reproduce
------------------
1. When the host is unlocked and available, use "sudo reboot -f" to reboot the host. eg. compute-0
2. Waiting for enough time and run "kubectl get pod" to check the pods status

Expected Behavior
------------------
All pods are running or completed

Actual Behavior
----------------
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
openstack libvirt-libvirt-default-sdpz2 0/1 Init:0/3 1 90m 192.168.204.174 compute-0 <none> <none>
openstack neutron-dhcp-agent-compute-0-5621f953-jgq5b 0/1 Init:0/1 1 90m 192.168.204.174 compute-0 <none> <none>
openstack neutron-l3-agent-compute-0-5621f953-fgcsl 0/1 Init:0/1 1 90m 192.168.204.174 compute-0 <none> <none>
openstack neutron-metadata-agent-compute-0-5621f953-j62ts 0/1 Init:0/2 1 90m 192.168.204.174 compute-0 <none> <none>
openstack neutron-ovs-agent-compute-0-5621f953-mvwck 0/1 Init:0/3 1 90m 192.168.204.174 compute-0 <none> <none>
openstack neutron-sriov-agent-compute-0-5621f953-rbfs8 0/1 Init:0/2 1 90m 192.168.204.174 compute-0 <none> <none>
openstack nova-compute-compute-0-5621f953-6rpfx 0/2 Init:0/6 1 90m 192.168.204.174 compute-0 <none> <none>

Reproducibility
---------------
100% Reproducible

System Configuration
--------------------
2 + 2 system or two node system

Branch/Pull Time/Commit
-----------------------
stx master as of: 20190720T013000Z

Last Pass
---------
20190720T013000Z

Timestamp/Logs
--------------
[2019-08-06 02:38:58,214] 165 INFO MainThread host_helper.reboot_hosts:: Rebooting compute-0
[2019-08-06 02:38:58,214] 301 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'
[2019-08-06 02:38:58,328] 423 DEBUG MainThread ssh.expect :: Output:
Password:
[2019-08-06 02:38:58,329] 301 DEBUG MainThread ssh.send :: Send 'Li69nux*'
[2019-08-06 02:39:08,488] 423 DEBUG MainThread ssh.expect :: Output:
Rebooting.
packet_write_wait: Connection to 192.168.204.174 port 22: Broken pipe
controller-1:~$
[2019-08-06 02:39:38,507] 3619 INFO MainThread system_helper.wait_for_hosts_states:: Waiting for ['compute-0'] to reach state(s): {'availability': ['offline', 'failed']}...
[2019-08-06 02:39:38,508] 466 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-08-06 02:39:38,508] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-list'
[2019-08-06 02:39:40,047] 423 DEBUG MainThread ssh.expect :: Output:
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | compute-0 | worker | unlocked | disabled | offline |
| 3 | compute-1 | worker | unlocked | enabled | available |
| 4 | controller-1 | controller | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

[2019-08-06 02:49:45,734] 301 DEBUG MainThread ssh.send :: Send 'kubectl get pod --all-namespaces --field-selector=status.phase!=Running,status.phase!=Succeeded -o=wide'
[2019-08-06 02:49:46,009] 423 DEBUG MainThread ssh.expect :: Output:
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
openstack libvirt-libvirt-default-sdpz2 0/1 Init:0/3 1 90m 192.168.204.174 compute-0 <none> <none>
openstack neutron-dhcp-agent-compute-0-5621f953-jgq5b 0/1 Init:0/1 1 90m 192.168.204.174 compute-0 <none> <none>
openstack neutron-l3-agent-compute-0-5621f953-fgcsl 0/1 Init:0/1 1 90m 192.168.204.174 compute-0 <none> <none>
openstack neutron-metadata-agent-compute-0-5621f953-j62ts 0/1 Init:0/2 1 90m 192.168.204.174 compute-0 <none> <none>
openstack neutron-ovs-agent-compute-0-5621f953-mvwck 0/1 Init:0/3 1 90m 192.168.204.174 compute-0 <none> <none>
openstack neutron-sriov-agent-compute-0-5621f953-rbfs8 0/1 Init:0/2 1 90m 192.168.204.174 compute-0 <none> <none>
openstack nova-compute-compute-0-5621f953-6rpfx 0/2 Init:0/6 1 90m 192.168.204.174 compute-0 <none> <none>

[2019-08-06 03:02:08,744] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-08-06 03:02:10,193] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+---------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+---------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------+----------+----------------------------+
| 0192e25d-def0-4134-ad62-a64aaf495695 | 200.006 | compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress. | host=compute-0.process=pci-irq-affinity-agent | major | 2019-08-06T02:43:27.705369 |
| 4cb4a0ee-f493-420b-a218-20759a112258 | 250.001 | compute-0 Configuration is out-of-date. | host=compute-0 | major | 2019-08-06T02:41:40.375879 |
| 9ac05c3b-a79e-4544-877f-720c8056ef5f | 270.001 | Host compute-1 compute services failure, failed to disable nova services | host=compute-1.services=compute | critical | 2019-08-06T02:39:52.177900 |
| a2e2ec3c-9490-42fc-9099-bd4427daf5af | 270.001 | Host compute-0 compute services failure, failed to disable nova services | host=compute-0.services=compute | critical | 2019-08-06T02:39:04.766953 |
| 2409cab2-28e3-45ca-b0fe-0712c3134366 | 750.002 | Application Apply Failure | k8s_application=stx-openstack | major | 2019-08-03T17:28:23.838877 |
+--------------------------------------+----------+---------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------+----------+----------------------------+
controller-1:~$
[2019-08-06 03:02:10,194] 301 DEBUG MainThread ssh.send :: Send 'echo $?'
[2019-08-06 03:02:10,297] 423 DEBUG MainThread ssh.expect :: Output:
0
controller-1:~$
[2019-08-06 03:02:10,297] 1534 DEBUG MainThread ssh.get_active_controller:: Getting active controller client for wcp_63_66
[2019-08-06 03:02:10,297] 466 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-08-06 03:02:10,297] 301 DEBUG MainThread ssh.send :: Send 'kubectl get pod --all-namespaces --field-selector=status.phase!=Running,status.phase!=Succeeded -o=wide'
[2019-08-06 03:02:10,528] 423 DEBUG MainThread ssh.expect :: Output:
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
openstack libvirt-libvirt-default-sdpz2 0/1 Init:0/3 1 102m 192.168.204.174 compute-0 <none> <none>
openstack neutron-dhcp-agent-compute-0-5621f953-jgq5b 0/1 Init:0/1 1 102m 192.168.204.174 compute-0 <none> <none>
openstack neutron-l3-agent-compute-0-5621f953-fgcsl 0/1 Init:0/1 1 102m 192.168.204.174 compute-0 <none> <none>
openstack neutron-metadata-agent-compute-0-5621f953-j62ts 0/1 Init:0/2 1 102m 192.168.204.174 compute-0 <none> <none>
openstack neutron-ovs-agent-compute-0-5621f953-mvwck 0/1 Init:0/3 1 102m 192.168.204.174 compute-0 <none> <none>
openstack neutron-sriov-agent-compute-0-5621f953-rbfs8 0/1 Init:0/2 1 102m 192.168.204.174 compute-0 <none> <none>
openstack nova-compute-compute-0-5621f953-6rpfx 0/2 Init:0/6 1 102m 192.168.204.174 compute-0 <none> <none>
openstack nova-service-cleaner-1565060400-kkg26 0/1 Init:0/1 0 2m4s 172.16.166.255 controller-1 <none> <none>
controller-1:~$
[2019-08-06 03:02:10,528] 301 DEBUG MainThread ssh.send :: Send 'echo $?'
[2019-08-06 03:02:10,631] 423 DEBUG MainThread ssh.expect :: Output:
0
controller-1:~$
[2019-08-06 03:02:10,632] 1534 DEBUG MainThread ssh.get_active_controller:: Getting active controller client for wcp_63_66
[2019-08-06 03:02:10,632] 466 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-08-06 03:02:10,632] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-list'
[2019-08-06 03:02:12,153] 423 DEBUG MainThread ssh.expect :: Output:
+---------------------+--------------------------------+-------------------------------+--------------------+--------------+------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+--------------------------------+-------------------------------+--------------------+--------------+------------------------------------------+
| platform-integ-apps | 1.0-7 | platform-integration-manifest | manifest.yaml | applied | completed |
| stx-openstack | 1.0-17-centos-stable-versioned | armada-manifest | stx-openstack.yaml | apply-failed | operation aborted, check logs for detail |
+---------------------+--------------------------------+-------------------------------+--------------------+--------------+------------------------------------------+
controller-1:~$
[2019-08-06 03:02:12,153] 301 DEBUG MainThread ssh.send :: Send 'echo $?'
[2019-08-06 03:02:12,256] 423 DEBUG MainThread ssh.expect :: Output:
0
controller-1:~$
[2019-08-06 03:02:12,258] 266 DEBUG MainThread conftest.testcase_log::
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Test steps started for: testcases/functional/mtc/test_multi_node_failure_avoidance.py::test_multi_node_failure_avoidance[300-5]
[2019-08-06 03:02:12,258] 1534 DEBUG MainThread ssh.get_active_controller:: Getting active controller client for wcp_63_66
[2019-08-06 03:02:12,259] 466 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-08-06 03:02:12,259] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-list'
[2019-08-06 03:02:13,793] 423 DEBUG MainThread ssh.expect :: Output:
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | compute-0 | worker | unlocked | enabled | degraded |
| 3 | compute-1 | worker | unlocked | enabled | available |
| 4 | controller-1 | controller | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

Test Activity
-------------
MTC Regression Testing

Ming Lei (mlei) wrote :
summary: - nova and neutron service didn't recover after force unlocking the
- compute host
+ nova and neutron service didn't recover after force unlocking the host
Ming Lei (mlei) wrote :
Ming Lei (mlei) wrote :
Brent Rowsell (brent-rowsell) wrote :

Did you wait for the node to return to unlocked/enabled ?

| 2 | compute-0 | worker | unlocked | disabled | offline |

Ming Lei (mlei) wrote :

It unlocked and enabled but in degraded state.

[2019-08-06 03:02:13,793] 423 DEBUG MainThread ssh.expect :: Output:
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | compute-0 | worker | unlocked | enabled | degraded |
| 3 | compute-1 | worker | unlocked | enabled | available |
| 4 | controller-1 | controller | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

description: updated
Frank Miller (sensfan22) wrote :

This issue looks similar to https://bugs.launchpad.net/starlingx/+bug/1839160. Assigning to Jim Gauld to review logs and confirm it is a duplicate.

tags: added: stx.2.0 stx.containers
Changed in starlingx:
status: New → Triaged
importance: Undecided → Medium
assignee: nobody → Jim Gauld (jgauld)
Ghada Khalil (gkhalil) wrote :

As per agreement with the community, moving all unresolved medium priority bugs from stx.2.0 to stx.3.0

tags: added: stx.3.0
removed: stx.2.0
Yang Liu (yliu12) on 2019-09-09
tags: added: stx.retestneeded
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers