AIO-DX: "Hypervisor list" cmd not working after force reboot standby controller

Bug #1829933 reported by Peng Peng
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Bart Wensley

Bug Description

Brief Description
-----------------
after sudo -f reboot standby controller, "hypervisor list" cmd not working with error msg: Internal Server Error (HTTP 500)

Severity
--------
Major

Steps to Reproduce
------------------

TC-name: mtc/test_evacuate.py::TestTisGuest::()::test_evacuate_vms

Expected Behavior
------------------

Actual Behavior
----------------

Reproducibility
---------------
Intermittent

System Configuration
--------------------
Two node system

Lab-name: WP_1-2

Branch/Pull Time/Commit
-----------------------
stx master as of 2019-05-18_06-36-50

Last Pass
---------
20190503T013000Z

Timestamp/Logs
--------------

[2019-05-21 15:01:38,300] 262 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne hypervisor list'
[2019-05-21 15:01:40,570] 387 DEBUG MainThread ssh.expect :: Output:
+----+---------------------+-----------------+---------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+---------------+-------+
| 3 | controller-1 | QEMU | 192.168.206.4 | up |
| 7 | controller-0 | QEMU | 192.168.206.3 | up |
+----+---------------------+-----------------+---------------+-------+

[2019-05-21 15:01:52,500] 126 INFO MainThread host_helper.reboot_hosts:: Rebooting controller-1
[2019-05-21 15:01:52,500] 262 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'

[2019-05-21 15:02:51,042] 262 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne hypervisor list'
[2019-05-21 15:02:52,360] 387 DEBUG MainThread ssh.expect :: Output:
Internal Server Error (HTTP 500)
[wrsroot@controller-0 ~(keystone_admin)]$

[2019-05-21 15:03:10,620] 262 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne hypervisor list'
[2019-05-21 15:03:11,663] 387 DEBUG MainThread ssh.expect :: Output:
Internal Server Error (HTTP 500)

Test Activity
-------------
Sanity

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Yang Liu (yliu12) wrote :

Note this is not only for 'openstack hypervisor list' cmd. Basically after force reboot of a controller on AIO-duplex system, the containerized nova/cinder cmds stopped working even though the active controller is up. They may recover after the rebooted controller came up.

Ghada Khalil (gkhalil)
tags: added: stx.sanity
Numan Waheed (nwaheed)
tags: added: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

How long does it take until the commands are functional again?

tags: added: stx.containers
Changed in starlingx:
status: New → Incomplete
assignee: nobody → Peng Peng (ppeng)
Revision history for this message
Peng Peng (ppeng) wrote :

In our test suite, the next "hypervisor list" cmd call is after 1 hour. We are not sure how long it is recovered.

[2019-05-21 16:01:56,822] 262 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne hypervisor list'
[2019-05-21 16:01:59,198] 387 DEBUG MainThread ssh.expect :: Output:
+----+---------------------+-----------------+---------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+---------------+-------+
| 3 | controller-1 | QEMU | 192.168.206.4 | up |
| 7 | controller-0 | QEMU | 192.168.206.3 | up |
+----+---------------------+-----------------+---------------+-------+

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; requires further investigation

This sounds similar to https://bugs.launchpad.net/starlingx/+bug/1830650
If the logs have the same signature, please mark open as duplicate

Changed in starlingx:
importance: Undecided → Medium
status: Incomplete → Triaged
assignee: Peng Peng (ppeng) → Bart Wensley (bartwensley)
tags: added: stx.2.0
Ghada Khalil (gkhalil)
summary: - "Hypervisor list" cmd not working after force reboot standby controller
+ AIO: "Hypervisor list" cmd not working after force reboot standby
+ controller
summary: - AIO: "Hypervisor list" cmd not working after force reboot standby
+ AIO-DX: "Hypervisor list" cmd not working after force reboot standby
controller
Revision history for this message
Bart Wensley (bartwensley) wrote :

When controller-1 is rebooted at 15:01:52, the nova-api on controller-0 loses access to mariadb, which prevents any of the nova related CLIs (e.g. server list) from running. For example:
{"log":"2019-05-21 15:02:18.491 1 ERROR oslo_db.sqlalchemy.exc_filters [req-2c25a5f7-22c7-4736-8c06-5d8c10e5c159 53347a9c6a7340f5be77da8cd6f2b3d8 efe5dd8fd94c428fa1eac07d31dc4184 - default default] DBAPIError exception wrapped from (pymysql.err.InternalError) (1047, u'WSREP has not yet prepared node for application use') [SQL: u'SELECT instance_mappings.created_at AS instance_mappings_created_at, instance_mappings.updated_at AS instance_mappings_updated_at, instance_mappings.id AS instance_mappings_id, instance_mappings.instance_uuid AS instance_mappings_instance_uuid, instance_mappings.cell_id AS instance_mappings_cell_id, instance_mappings.project_id AS instance_mappings_project_id, instance_mappings.user_id AS instance_mappings_user_id, instance_mappings.queued_for_delete AS instance_mappings_queued_for_delete, cell_mappings_1.created_at AS cell_mappings_1_created_at, cell_mappings_1.updated_at AS cell_mappings_1_updated_at, cell_mappings_1.id AS cell_mappings_1_id, cell_mappings_1.uuid AS cell_mappings_1_uuid, cell_mappings_1.name AS cell_mappings_1_name, cell_mappings_1.transport_url AS cell_mappings_1_transport_url, cell_mappings_1.database_connection AS cell_mappings_1_database_connection, cell_mappings_1.disabled AS cell_mappings_1_disabled \\nFROM instance_mappings LEFT OUTER JOIN cell_mappings AS cell_mappings_1 ON instance_mappings.cell_id = cell_mappings_1.id \\nWHERE instance_mappings.instance_uuid = %(instance_uuid_1)s \\n LIMIT %(param_1)s'] [parameters: {u'param_1': 1, u'instance_uuid_1': u'6e8a1aea-b316-4c5d-af48-5d32ffe0a443'}] (Background on this error at: http://sqlalche.me/e/2j85): InternalError: (1047, u'WSREP has not yet prepared node for application use')\n","stream":"stdout","time":"2019-05-21T15:02:18.492940446Z"}

The mariadb is running with a single node and is non-Primary, so is not functional:
{"log":"2019-05-21 15:01:58,543 - OpenStack-Helm Mariadb - INFO - 2019-05-21 15:01:58 139633481164544 [Note] WSREP: New cluster view: global state: 09649517-7ba1-11e9-8a1c-6e7cd8f18acb:42492, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3\n","stream":"stderr","time":"2019-05-21T15:01:58.543232866Z"}

There are other issues after that, but the trigger was the mariadb problem. I can see this log constantly coming out on both controllers:
2019-05-21T15:00:13.530 controller-0 OCF_dbmon(dbmon)[2903462]: info INFO: Unable to get mariadb password. Exiting.

This is due to a bug in the dbmon ocf script:
https://bugs.launchpad.net/starlingx/+bug/1826891

I am marking this LP as a duplicate of the dbmon issue (LP1826891). Once that issue is fixed, please re-test.

Peng Peng (ppeng)
tags: removed: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Duplicate bug was fixed on 2019-06-06
https://review.opendev.org/663318

Marking as Fix Released

Changed in starlingx:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers