AIO-DX: "Hypervisor list" cmd not working after force reboot standby controller

Bug #1829933 reported by Peng Peng
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Bart Wensley

Bug Description

Brief Description
-----------------
after sudo -f reboot standby controller, "hypervisor list" cmd not working with error msg: Internal Server Error (HTTP 500)

Severity
--------
Major

Steps to Reproduce
------------------

TC-name: mtc/test_evacuate.py::TestTisGuest::()::test_evacuate_vms

Expected Behavior
------------------

Actual Behavior
----------------

Reproducibility
---------------
Intermittent

System Configuration
--------------------
Two node system

Lab-name: WP_1-2

Branch/Pull Time/Commit
-----------------------
stx master as of 2019-05-18_06-36-50

Last Pass
---------
20190503T013000Z

Timestamp/Logs
--------------

[2019-05-21 15:01:38,300] 262 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne hypervisor list'
[2019-05-21 15:01:40,570] 387 DEBUG MainThread ssh.expect :: Output:
+----+---------------------+-----------------+---------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+---------------+-------+
| 3 | controller-1 | QEMU | 192.168.206.4 | up |
| 7 | controller-0 | QEMU | 192.168.206.3 | up |
+----+---------------------+-----------------+---------------+-------+

[2019-05-21 15:01:52,500] 126 INFO MainThread host_helper.reboot_hosts:: Rebooting controller-1
[2019-05-21 15:01:52,500] 262 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'

[2019-05-21 15:02:51,042] 262 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne hypervisor list'
[2019-05-21 15:02:52,360] 387 DEBUG MainThread ssh.expect :: Output:
Internal Server Error (HTTP 500)
[wrsroot@controller-0 ~(keystone_admin)]$

[2019-05-21 15:03:10,620] 262 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne hypervisor list'
[2019-05-21 15:03:11,663] 387 DEBUG MainThread ssh.expect :: Output:
Internal Server Error (HTTP 500)

Test Activity
-------------
Sanity

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Yang Liu (yliu12) wrote :

Note this is not only for 'openstack hypervisor list' cmd. Basically after force reboot of a controller on AIO-duplex system, the containerized nova/cinder cmds stopped working even though the active controller is up. They may recover after the rebooted controller came up.

Ghada Khalil (gkhalil)
tags: added: stx.sanity
Numan Waheed (nwaheed)
tags: added: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

How long does it take until the commands are functional again?

tags: added: stx.containers
Changed in starlingx:
status: New → Incomplete
assignee: nobody → Peng Peng (ppeng)
Revision history for this message
Peng Peng (ppeng) wrote :

In our test suite, the next "hypervisor list" cmd call is after 1 hour. We are not sure how long it is recovered.

[2019-05-21 16:01:56,822] 262 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne hypervisor list'
[2019-05-21 16:01:59,198] 387 DEBUG MainThread ssh.expect :: Output:
+----+---------------------+-----------------+---------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+---------------+-------+
| 3 | controller-1 | QEMU | 192.168.206.4 | up |
| 7 | controller-0 | QEMU | 192.168.206.3 | up |
+----+---------------------+-----------------+---------------+-------+

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; requires further investigation

This sounds similar to https://bugs.launchpad.net/starlingx/+bug/1830650
If the logs have the same signature, please mark open as duplicate

Changed in starlingx:
importance: Undecided → Medium
status: Incomplete → Triaged
assignee: Peng Peng (ppeng) → Bart Wensley (bartwensley)
tags: added: stx.2.0
Ghada Khalil (gkhalil)
summary: - "Hypervisor list" cmd not working after force reboot standby controller
+ AIO: "Hypervisor list" cmd not working after force reboot standby
+ controller
summary: - AIO: "Hypervisor list" cmd not working after force reboot standby
+ AIO-DX: "Hypervisor list" cmd not working after force reboot standby
controller
Revision history for this message
Bart Wensley (bartwensley) wrote :

When controller-1 is rebooted at 15:01:52, the nova-api on controller-0 loses access to mariadb, which prevents any of the nova related CLIs (e.g. server list) from running. For example:
{"log":"2019-05-21 15:02:18.491 1 ERROR oslo_db.sqlalchemy.exc_filters [req-2c25a5f7-22c7-4736-8c06-5d8c10e5c159 53347a9c6a7340f5be77da8cd6f2b3d8 efe5dd8fd94c428fa1eac07d31dc4184 - default default] DBAPIError exception wrapped from (pymysql.err.InternalError) (1047, u'WSREP has not yet prepared node for application use') [SQL: u'SELECT instance_mappings.created_at AS instance_mappings_created_at, instance_mappings.updated_at AS instance_mappings_updated_at, instance_mappings.id AS instance_mappings_id, instance_mappings.instance_uuid AS instance_mappings_instance_uuid, instance_mappings.cell_id AS instance_mappings_cell_id, instance_mappings.project_id AS instance_mappings_project_id, instance_mappings.user_id AS instance_mappings_user_id, instance_mappings.queued_for_delete AS instance_mappings_queued_for_delete, cell_mappings_1.created_at AS cell_mappings_1_created_at, cell_mappings_1.updated_at AS cell_mappings_1_updated_at, cell_mappings_1.id AS cell_mappings_1_id, cell_mappings_1.uuid AS cell_mappings_1_uuid, cell_mappings_1.name AS cell_mappings_1_name, cell_mappings_1.transport_url AS cell_mappings_1_transport_url, cell_mappings_1.database_connection AS cell_mappings_1_database_connection, cell_mappings_1.disabled AS cell_mappings_1_disabled \\nFROM instance_mappings LEFT OUTER JOIN cell_mappings AS cell_mappings_1 ON instance_mappings.cell_id = cell_mappings_1.id \\nWHERE instance_mappings.instance_uuid = %(instance_uuid_1)s \\n LIMIT %(param_1)s'] [parameters: {u'param_1': 1, u'instance_uuid_1': u'6e8a1aea-b316-4c5d-af48-5d32ffe0a443'}] (Background on this error at: http://sqlalche.me/e/2j85): InternalError: (1047, u'WSREP has not yet prepared node for application use')\n","stream":"stdout","time":"2019-05-21T15:02:18.492940446Z"}

The mariadb is running with a single node and is non-Primary, so is not functional:
{"log":"2019-05-21 15:01:58,543 - OpenStack-Helm Mariadb - INFO - 2019-05-21 15:01:58 139633481164544 [Note] WSREP: New cluster view: global state: 09649517-7ba1-11e9-8a1c-6e7cd8f18acb:42492, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3\n","stream":"stderr","time":"2019-05-21T15:01:58.543232866Z"}

There are other issues after that, but the trigger was the mariadb problem. I can see this log constantly coming out on both controllers:
2019-05-21T15:00:13.530 controller-0 OCF_dbmon(dbmon)[2903462]: info INFO: Unable to get mariadb password. Exiting.

This is due to a bug in the dbmon ocf script:
https://bugs.launchpad.net/starlingx/+bug/1826891

I am marking this LP as a duplicate of the dbmon issue (LP1826891). Once that issue is fixed, please re-test.

Peng Peng (ppeng)
tags: removed: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Duplicate bug was fixed on 2019-06-06
https://review.opendev.org/663318

Marking as Fix Released

Changed in starlingx:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.