Cinder volumes management fails even though Cinder nodes are available if amqp is restarted

Bug #1385240 reported by Giulio Fidente
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Cinder
Fix Released
High
Ivan Kolodyazhny
oslo.messaging
Fix Released
Undecided
Unassigned
tripleo
Fix Released
High
Giulio Fidente

Bug Description

Volume operations, eg. create/delete, may remain stuck in 'scheduling' state even though there are more Cinder nodes up and running.

Problem is due to cinder-scheduler getting disconnected from the rabbit cluster without it noticing and as a result, being unable to receive updates from api.

The disconnection may happen following for example a reconfig of a rabbit node, the VIP moving to a different node when rabbit is load balanced, or even _during_ tripleo overcloud deployment due to rabbit cluster configuration changes.

This was observed using Kombu 3.0.33 as well as 2.5.

Use of some aggressive (low) kernel keepalive probes interval seems to improve the reliability but a more appropriate fix seems to be support for heartbeat in oslo.messaging

Changed in tripleo:
importance: Undecided → High
description: updated
description: updated
Revision history for this message
Giulio Fidente (gfidente) wrote :
description: updated
Ben Nemec (bnemec)
Changed in tripleo:
status: New → Triaged
Mike Perez (thingee)
Changed in cinder:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Mehdi Abaakouk (sileht) wrote :

Hi,

Can you test with oslo.messaging 1.5.0 if the issue still exists, timeout and reconnection handling have got many fixes, and I think this one should not occurs anymore.

Cheers

Changed in oslo.messaging:
status: New → Incomplete
Revision history for this message
Giulio Fidente (gfidente) wrote :

unfortunately this seems to persist, looks like there are patches still on review for a fix

Revision history for this message
Giulio Fidente (gfidente) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-incubator (master)

Fix proposed to branch: master
Review: https://review.openstack.org/142524

Changed in tripleo:
assignee: nobody → Giulio Fidente (gfidente)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.openstack.org/142527

Revision history for this message
Mike Perez (thingee) wrote :

Since there is a fix being done in oslo.message...should we not target cinder on this?

Revision history for this message
Giulio Fidente (gfidente) wrote :

hi Mike, agreed Cinder can't to much about this, my purpose was more to track progress

this is essentially just one of the consequences of https://bugs.launchpad.net/oslo.messaging/+bug/856764 and the problem indeed disappears when using the workarounds suggested there

I'll let you decide what is best to do with the bug

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-incubator (master)

Reviewed: https://review.openstack.org/142524
Committed: https://git.openstack.org/cgit/openstack/tripleo-incubator/commit/?id=bdfde7186ba77c66bfa4b492fe7847d54d008a29
Submitter: Jenkins
Branch: master

commit bdfde7186ba77c66bfa4b492fe7847d54d008a29
Author: Giulio Fidente <email address hidden>
Date: Wed Dec 17 18:55:54 2014 +0100

    Add sysctl to all images

    This is needed to customize the default kernel keepalive timings, see
    https://bugs.launchpad.net/oslo.messaging/+bug/856764/comments/19

    Change-Id: I9a8ace4712c8a3b71f63b0bafe381e3bc6c707da
    Partial-Bug: 1301431
    Partial-Bug: 1385240
    Partial-Bug: 1385234

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/142527
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=2f7f4ed50c53e25041bf29d317c5f3358e46e706
Submitter: Jenkins
Branch: master

commit 2f7f4ed50c53e25041bf29d317c5f3358e46e706
Author: Giulio Fidente <email address hidden>
Date: Wed Dec 17 19:06:28 2014 +0100

    Set more aggressive keepalive timings

    We want to customize the default kernel keepalive timings and
    make them more aggressive to workaround lack of hearbeat support
    in the Oslo RabbitMQ client, see:

    https://bugs.launchpad.net/oslo.messaging/+bug/856764/comments/19
    and
    https://bugs.launchpad.net/oslo.messaging/+bug/856764/comments/70

    Change-Id: Ieac08f595086acb8dd336e33efc705ee0b8a3a87
    Closes-Bug: 1301431
    Closes-Bug: 1385240
    Closes-Bug: 1385234

Changed in tripleo:
status: In Progress → Fix Committed
Derek Higgins (derekh)
Changed in tripleo:
status: Fix Committed → Fix Released
Ivan Kolodyazhny (e0ne)
Changed in cinder:
assignee: nobody → Ivan Kolodyazhny (e0ne)
Revision history for this message
Michal Dulko (michal-dulko-f) wrote :

This should be fixed in Kilo by https://bugs.launchpad.net/cinder/+bug/1409012.

Changed in cinder:
status: Triaged → Fix Released
Ben Nemec (bnemec)
Changed in oslo.messaging:
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.