Cinder volumes management fails even though Cinder nodes are available if amqp is restarted
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Cinder |
Fix Released
|
High
|
Ivan Kolodyazhny | ||
oslo.messaging |
Fix Released
|
Undecided
|
Unassigned | ||
tripleo |
Fix Released
|
High
|
Giulio Fidente |
Bug Description
Volume operations, eg. create/delete, may remain stuck in 'scheduling' state even though there are more Cinder nodes up and running.
Problem is due to cinder-scheduler getting disconnected from the rabbit cluster without it noticing and as a result, being unable to receive updates from api.
The disconnection may happen following for example a reconfig of a rabbit node, the VIP moving to a different node when rabbit is load balanced, or even _during_ tripleo overcloud deployment due to rabbit cluster configuration changes.
This was observed using Kombu 3.0.33 as well as 2.5.
Use of some aggressive (low) kernel keepalive probes interval seems to improve the reliability but a more appropriate fix seems to be support for heartbeat in oslo.messaging
Changed in tripleo: | |
importance: | Undecided → High |
description: | updated |
description: | updated |
description: | updated |
Changed in tripleo: | |
status: | New → Triaged |
Changed in cinder: | |
status: | New → Triaged |
importance: | Undecided → High |
Changed in tripleo: | |
status: | Fix Committed → Fix Released |
Changed in cinder: | |
assignee: | nobody → Ivan Kolodyazhny (e0ne) |
Changed in oslo.messaging: | |
status: | Incomplete → Fix Released |
related to https:/ /bugs.launchpad .net/oslo. messaging/ +bug/856764