Fuel for OpenStack

Nova api call to get instance details returned <class 'oslo_db.exception.DBConnectionError'> (HTTP 500)

Bug #1587027 reported by Andrey Sledzinskiy on 2016-05-30

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Invalid	High	Bogdan Dobrelya	Fuel for OpenStack 10.0
	Mitaka	Invalid	High	Bogdan Dobrelya	Fuel for OpenStack 9.0

Bug Description

fuel version - 9.0-mos-416

Steps:
1. Deploy next cluster - Neutron Vlan, ceph for volumes and images, ceph replication factor - 2, 2 controller, 1 controller+ceph, compute+ceph, compute
2. Run ha, sanity OSTF suites
3. Create instance
4. Get instance details from nova

Expected result - instance details are returned
Actual result - ClientException: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_db.exception.DBConnectionError'> (HTTP 500) (Request-ID: req-089cfd42-f094-4f8d-a77e-5fe4b9fffac5)

rabbitmq.log shows:
2016-05-29T23:50:32.748925+00:00 notice: =ERROR REPORT==== 29-May-2016::23:50:26 ===
2016-05-29T23:50:32.748925+00:00 notice: Mnesia('rabbit@messaging-node-1'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@messaging-node-2'}

Tags:

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2016-05-30:

fail_error_ha_ceph_neutron_rabbit_master_destroy-fuel-snapshot-2016-05-29_23-50-26.tar.gz Edit (61.8 MiB, application/x-tar)

Ilya Kutukov (ikutukov) on 2016-05-30

Changed in fuel:
status:	New → Confirmed

Dmitry Pyzhov (dpyzhov) on 2016-05-30

tags:

added: area-library

Georgy Kibardin (gkibardin) on 2016-05-31

Changed in fuel:
assignee:	Fuel Sustaining (fuel-sustaining-team) → Georgy Kibardin (gkibardin)

Georgy Kibardin (gkibardin) on 2016-05-31

Changed in fuel:
assignee:	Georgy Kibardin (gkibardin) → Bogdan Dobrelya (bogdando)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-05-31:

There are multiple Read error messages from all nodes right near to the "Get instance details from nova" operation, see in the sshd logs. This points out networking issues specific to the env. Those have caused a temporary, but healed later, split-brain to the management VIP:

2016-05-29T23:50:28.148935+00:00 node-1 pengine err: error: Resource vip__management (ocf::ns_IPaddr2) is active on 2 nodes attempting recovery

These makes me thing the bug is invalid. Here is events flow right after that anyway, http://pastebin.com/MPvgG85x. Network disruption made galera nodes disturbed (line 30), some DB ops failed for neutron and nova (as reported in the bug), then a reply_q have been lost after MQ cluster failover, which made the neutron-server to retry for ever (why not to recreate?)... A tipical failover, recovered later. I see no other issues