Nova api call to get instance details returned <class 'oslo_db.exception.DBConnectionError'> (HTTP 500)

Bug #1587027 reported by Andrey Sledzinskiy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Bogdan Dobrelya
Mitaka
Invalid
High
Bogdan Dobrelya

Bug Description

fuel version - 9.0-mos-416

Steps:
1. Deploy next cluster - Neutron Vlan, ceph for volumes and images, ceph replication factor - 2, 2 controller, 1 controller+ceph, compute+ceph, compute
2. Run ha, sanity OSTF suites
3. Create instance
4. Get instance details from nova

Expected result - instance details are returned
Actual result - ClientException: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_db.exception.DBConnectionError'> (HTTP 500) (Request-ID: req-089cfd42-f094-4f8d-a77e-5fe4b9fffac5)

rabbitmq.log shows:
2016-05-29T23:50:32.748925+00:00 notice: =ERROR REPORT==== 29-May-2016::23:50:26 ===
2016-05-29T23:50:32.748925+00:00 notice: Mnesia('rabbit@messaging-node-1'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@messaging-node-2'}

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Ilya Kutukov (ikutukov)
Changed in fuel:
status: New → Confirmed
Dmitry Pyzhov (dpyzhov)
tags: added: area-library
Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Georgy Kibardin (gkibardin)
Changed in fuel:
assignee: Georgy Kibardin (gkibardin) → Bogdan Dobrelya (bogdando)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

There are multiple Read error messages from all nodes right near to the "Get instance details from nova" operation, see in the sshd logs. This points out networking issues specific to the env. Those have caused a temporary, but healed later, split-brain to the management VIP:

2016-05-29T23:50:28.148935+00:00 node-1 pengine err: error: Resource vip__management (ocf::ns_IPaddr2) is active on 2 nodes attempting recovery

These makes me thing the bug is invalid. Here is events flow right after that anyway, http://pastebin.com/MPvgG85x. Network disruption made galera nodes disturbed (line 30), some DB ops failed for neutron and nova (as reported in the bug), then a reply_q have been lost after MQ cluster failover, which made the neutron-server to retry for ever (why not to recreate?)... A tipical failover, recovered later. I see no other issues

Changed in fuel:
status: Confirmed → Invalid
tags: added: swarm-fail
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.