nova and neutron service didn't recover after force unlocking the host

Bug #1839378 reported by Ming Lei
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Triaged
Medium
Jim Gauld

Bug Description

Brief Description
-----------------
After force rebooting a host, the neuron and nova services keep in Init status and did not recover.

Severity
--------
Provide the severity of the defect.
Critical

Steps to Reproduce
------------------
1. When the host is unlocked and available, use "sudo reboot -f" to reboot the host. eg. compute-0
2. Waiting for enough time and run "kubectl get pod" to check the pods status

Expected Behavior
------------------
All pods are running or completed

Actual Behavior
----------------
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
openstack libvirt-libvirt-default-sdpz2 0/1 Init:0/3 1 90m 192.168.204.174 compute-0 <none> <none>
openstack neutron-dhcp-agent-compute-0-5621f953-jgq5b 0/1 Init:0/1 1 90m 192.168.204.174 compute-0 <none> <none>
openstack neutron-l3-agent-compute-0-5621f953-fgcsl 0/1 Init:0/1 1 90m 192.168.204.174 compute-0 <none> <none>
openstack neutron-metadata-agent-compute-0-5621f953-j62ts 0/1 Init:0/2 1 90m 192.168.204.174 compute-0 <none> <none>
openstack neutron-ovs-agent-compute-0-5621f953-mvwck 0/1 Init:0/3 1 90m 192.168.204.174 compute-0 <none> <none>
openstack neutron-sriov-agent-compute-0-5621f953-rbfs8 0/1 Init:0/2 1 90m 192.168.204.174 compute-0 <none> <none>
openstack nova-compute-compute-0-5621f953-6rpfx 0/2 Init:0/6 1 90m 192.168.204.174 compute-0 <none> <none>

Reproducibility
---------------
100% Reproducible

System Configuration
--------------------
2 + 2 system or two node system

Branch/Pull Time/Commit
-----------------------
stx master as of: 20190720T013000Z

Last Pass
---------
20190720T013000Z

Timestamp/Logs
--------------
[2019-08-06 02:38:58,214] 165 INFO MainThread host_helper.reboot_hosts:: Rebooting compute-0
[2019-08-06 02:38:58,214] 301 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'
[2019-08-06 02:38:58,328] 423 DEBUG MainThread ssh.expect :: Output:
Password:
[2019-08-06 02:38:58,329] 301 DEBUG MainThread ssh.send :: Send 'Li69nux*'
[2019-08-06 02:39:08,488] 423 DEBUG MainThread ssh.expect :: Output:
Rebooting.
packet_write_wait: Connection to 192.168.204.174 port 22: Broken pipe
controller-1:~$
[2019-08-06 02:39:38,507] 3619 INFO MainThread system_helper.wait_for_hosts_states:: Waiting for ['compute-0'] to reach state(s): {'availability': ['offline', 'failed']}...
[2019-08-06 02:39:38,508] 466 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-08-06 02:39:38,508] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-list'
[2019-08-06 02:39:40,047] 423 DEBUG MainThread ssh.expect :: Output:
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | compute-0 | worker | unlocked | disabled | offline |
| 3 | compute-1 | worker | unlocked | enabled | available |
| 4 | controller-1 | controller | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

[2019-08-06 02:49:45,734] 301 DEBUG MainThread ssh.send :: Send 'kubectl get pod --all-namespaces --field-selector=status.phase!=Running,status.phase!=Succeeded -o=wide'
[2019-08-06 02:49:46,009] 423 DEBUG MainThread ssh.expect :: Output:
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
openstack libvirt-libvirt-default-sdpz2 0/1 Init:0/3 1 90m 192.168.204.174 compute-0 <none> <none>
openstack neutron-dhcp-agent-compute-0-5621f953-jgq5b 0/1 Init:0/1 1 90m 192.168.204.174 compute-0 <none> <none>
openstack neutron-l3-agent-compute-0-5621f953-fgcsl 0/1 Init:0/1 1 90m 192.168.204.174 compute-0 <none> <none>
openstack neutron-metadata-agent-compute-0-5621f953-j62ts 0/1 Init:0/2 1 90m 192.168.204.174 compute-0 <none> <none>
openstack neutron-ovs-agent-compute-0-5621f953-mvwck 0/1 Init:0/3 1 90m 192.168.204.174 compute-0 <none> <none>
openstack neutron-sriov-agent-compute-0-5621f953-rbfs8 0/1 Init:0/2 1 90m 192.168.204.174 compute-0 <none> <none>
openstack nova-compute-compute-0-5621f953-6rpfx 0/2 Init:0/6 1 90m 192.168.204.174 compute-0 <none> <none>

[2019-08-06 03:02:08,744] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-08-06 03:02:10,193] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+---------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+---------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------+----------+----------------------------+
| 0192e25d-def0-4134-ad62-a64aaf495695 | 200.006 | compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress. | host=compute-0.process=pci-irq-affinity-agent | major | 2019-08-06T02:43:27.705369 |
| 4cb4a0ee-f493-420b-a218-20759a112258 | 250.001 | compute-0 Configuration is out-of-date. | host=compute-0 | major | 2019-08-06T02:41:40.375879 |
| 9ac05c3b-a79e-4544-877f-720c8056ef5f | 270.001 | Host compute-1 compute services failure, failed to disable nova services | host=compute-1.services=compute | critical | 2019-08-06T02:39:52.177900 |
| a2e2ec3c-9490-42fc-9099-bd4427daf5af | 270.001 | Host compute-0 compute services failure, failed to disable nova services | host=compute-0.services=compute | critical | 2019-08-06T02:39:04.766953 |
| 2409cab2-28e3-45ca-b0fe-0712c3134366 | 750.002 | Application Apply Failure | k8s_application=stx-openstack | major | 2019-08-03T17:28:23.838877 |
+--------------------------------------+----------+---------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------+----------+----------------------------+
controller-1:~$
[2019-08-06 03:02:10,194] 301 DEBUG MainThread ssh.send :: Send 'echo $?'
[2019-08-06 03:02:10,297] 423 DEBUG MainThread ssh.expect :: Output:
0
controller-1:~$
[2019-08-06 03:02:10,297] 1534 DEBUG MainThread ssh.get_active_controller:: Getting active controller client for wcp_63_66
[2019-08-06 03:02:10,297] 466 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-08-06 03:02:10,297] 301 DEBUG MainThread ssh.send :: Send 'kubectl get pod --all-namespaces --field-selector=status.phase!=Running,status.phase!=Succeeded -o=wide'
[2019-08-06 03:02:10,528] 423 DEBUG MainThread ssh.expect :: Output:
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
openstack libvirt-libvirt-default-sdpz2 0/1 Init:0/3 1 102m 192.168.204.174 compute-0 <none> <none>
openstack neutron-dhcp-agent-compute-0-5621f953-jgq5b 0/1 Init:0/1 1 102m 192.168.204.174 compute-0 <none> <none>
openstack neutron-l3-agent-compute-0-5621f953-fgcsl 0/1 Init:0/1 1 102m 192.168.204.174 compute-0 <none> <none>
openstack neutron-metadata-agent-compute-0-5621f953-j62ts 0/1 Init:0/2 1 102m 192.168.204.174 compute-0 <none> <none>
openstack neutron-ovs-agent-compute-0-5621f953-mvwck 0/1 Init:0/3 1 102m 192.168.204.174 compute-0 <none> <none>
openstack neutron-sriov-agent-compute-0-5621f953-rbfs8 0/1 Init:0/2 1 102m 192.168.204.174 compute-0 <none> <none>
openstack nova-compute-compute-0-5621f953-6rpfx 0/2 Init:0/6 1 102m 192.168.204.174 compute-0 <none> <none>
openstack nova-service-cleaner-1565060400-kkg26 0/1 Init:0/1 0 2m4s 172.16.166.255 controller-1 <none> <none>
controller-1:~$
[2019-08-06 03:02:10,528] 301 DEBUG MainThread ssh.send :: Send 'echo $?'
[2019-08-06 03:02:10,631] 423 DEBUG MainThread ssh.expect :: Output:
0
controller-1:~$
[2019-08-06 03:02:10,632] 1534 DEBUG MainThread ssh.get_active_controller:: Getting active controller client for wcp_63_66
[2019-08-06 03:02:10,632] 466 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-08-06 03:02:10,632] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-list'
[2019-08-06 03:02:12,153] 423 DEBUG MainThread ssh.expect :: Output:
+---------------------+--------------------------------+-------------------------------+--------------------+--------------+------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+--------------------------------+-------------------------------+--------------------+--------------+------------------------------------------+
| platform-integ-apps | 1.0-7 | platform-integration-manifest | manifest.yaml | applied | completed |
| stx-openstack | 1.0-17-centos-stable-versioned | armada-manifest | stx-openstack.yaml | apply-failed | operation aborted, check logs for detail |
+---------------------+--------------------------------+-------------------------------+--------------------+--------------+------------------------------------------+
controller-1:~$
[2019-08-06 03:02:12,153] 301 DEBUG MainThread ssh.send :: Send 'echo $?'
[2019-08-06 03:02:12,256] 423 DEBUG MainThread ssh.expect :: Output:
0
controller-1:~$
[2019-08-06 03:02:12,258] 266 DEBUG MainThread conftest.testcase_log::
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Test steps started for: testcases/functional/mtc/test_multi_node_failure_avoidance.py::test_multi_node_failure_avoidance[300-5]
[2019-08-06 03:02:12,258] 1534 DEBUG MainThread ssh.get_active_controller:: Getting active controller client for wcp_63_66
[2019-08-06 03:02:12,259] 466 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-08-06 03:02:12,259] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-list'
[2019-08-06 03:02:13,793] 423 DEBUG MainThread ssh.expect :: Output:
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | compute-0 | worker | unlocked | enabled | degraded |
| 3 | compute-1 | worker | unlocked | enabled | available |
| 4 | controller-1 | controller | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

Test Activity
-------------
MTC Regression Testing

Revision history for this message
Ming Lei (mlei) wrote :
summary: - nova and neutron service didn't recover after force unlocking the
- compute host
+ nova and neutron service didn't recover after force unlocking the host
Revision history for this message
Ming Lei (mlei) wrote :
Revision history for this message
Ming Lei (mlei) wrote :
Revision history for this message
Brent Rowsell (brent-rowsell) wrote :

Did you wait for the node to return to unlocked/enabled ?

| 2 | compute-0 | worker | unlocked | disabled | offline |

Revision history for this message
Ming Lei (mlei) wrote :

It unlocked and enabled but in degraded state.

[2019-08-06 03:02:13,793] 423 DEBUG MainThread ssh.expect :: Output:
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | compute-0 | worker | unlocked | enabled | degraded |
| 3 | compute-1 | worker | unlocked | enabled | available |
| 4 | controller-1 | controller | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

description: updated
Revision history for this message
Frank Miller (sensfan22) wrote :

This issue looks similar to https://bugs.launchpad.net/starlingx/+bug/1839160. Assigning to Jim Gauld to review logs and confirm it is a duplicate.

tags: added: stx.2.0 stx.containers
Changed in starlingx:
status: New → Triaged
importance: Undecided → Medium
assignee: nobody → Jim Gauld (jgauld)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per agreement with the community, moving all unresolved medium priority bugs from stx.2.0 to stx.3.0

tags: added: stx.3.0
removed: stx.2.0
Yang Liu (yliu12)
tags: added: stx.retestneeded
Revision history for this message
Frank Miller (sensfan22) wrote :

Moving this issue to stx.4.0. Action for this LP is to see if the issue is reproduced on the train release. If not then mark this LP as Won't Fix since it is fixed in the train release.

tags: added: stx.4.0
removed: stx.3.0
Frank Miller (sensfan22)
tags: added: stx.distro.openstack
removed: stx.containers
Revision history for this message
Frank Miller (sensfan22) wrote :

Marking this issue as a duplicate of https://bugs.launchpad.net/starlingx/+bug/1839160 -- that issue was addressed in 2019.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.