tripleo

Cinder volumes management fails even though Cinder nodes are available if amqp is restarted

Bug #1385240 reported by Giulio Fidente on 2014-10-24

This bug affects 2 people

	Status	Importance	Assigned to
Cinder	Fix Released	High	Ivan Kolodyazhny
oslo.messaging	Fix Released	Undecided	Unassigned
tripleo	Fix Released	High	Giulio Fidente

Bug Description

Volume operations, eg. create/delete, may remain stuck in 'scheduling' state even though there are more Cinder nodes up and running.

Problem is due to cinder-scheduler getting disconnected from the rabbit cluster without it noticing and as a result, being unable to receive updates from api.

The disconnection may happen following for example a reconfig of a rabbit node, the VIP moving to a different node when rabbit is load balanced, or even _during_ tripleo overcloud deployment due to rabbit cluster configuration changes.

This was observed using Kombu 3.0.33 as well as 2.5.

Use of some aggressive (low) kernel keepalive probes interval seems to improve the reliability but a more appropriate fix seems to be support for heartbeat in oslo.messaging

See original description

Giulio Fidente (gfidente) on 2014-10-24

Changed in tripleo:
importance:	Undecided → High
description:	updated

Giulio Fidente (gfidente) on 2014-10-24

description:

updated

Revision history for this message

Giulio Fidente (gfidente) wrote on 2014-10-24:

related to https://bugs.launchpad.net/oslo.messaging/+bug/856764

Giulio Fidente (gfidente) on 2014-10-24

description:

updated

Ben Nemec (bnemec) on 2014-10-24

Changed in tripleo:
status:	New → Triaged

Mike Perez (thingee) on 2014-10-26

Changed in cinder:
status:	New → Triaged
importance:	Undecided → High

Revision history for this message

Mehdi Abaakouk (sileht) wrote on 2014-12-03:

Hi,

Can you test with oslo.messaging 1.5.0 if the issue still exists, timeout and reconnection handling have got many fixes, and I think this one should not occurs anymore.

Cheers

Changed in oslo.messaging:
status:	New → Incomplete

Revision history for this message

Giulio Fidente (gfidente) wrote on 2014-12-05:

unfortunately this seems to persist, looks like there are patches still on review for a fix

Revision history for this message

Giulio Fidente (gfidente) wrote on 2014-12-05:

see https://review.openstack.org/#/c/126330/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-12-17: Fix proposed to tripleo-incubator (master)

Fix proposed to branch: master
Review: https://review.openstack.org/142524

Changed in tripleo:
assignee:	nobody → Giulio Fidente (gfidente)
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-12-17: Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.openstack.org/142527

Revision history for this message

Mike Perez (thingee) wrote on 2014-12-17:

Since there is a fix being done in oslo.message...should we not target cinder on this?

Revision history for this message

Giulio Fidente (gfidente) wrote on 2014-12-18:

hi Mike, agreed Cinder can't to much about this, my purpose was more to track progress

this is essentially just one of the consequences of https://bugs.launchpad.net/oslo.messaging/+bug/856764 and the problem indeed disappears when using the workarounds suggested there

I'll let you decide what is best to do with the bug

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-12-18: Fix merged to tripleo-incubator (master)

Reviewed: https://review.openstack.org/142524
Committed: https://git.openstack.org/cgit/openstack/tripleo-incubator/commit/?id=bdfde7186ba77c66bfa4b492fe7847d54d008a29
Submitter: Jenkins
Branch: master

commit bdfde7186ba77c66bfa4b492fe7847d54d008a29
Author: Giulio Fidente <email address hidden>
Date: Wed Dec 17 18:55:54 2014 +0100

Add sysctl to all images

This is needed to customize the default kernel keepalive timings, see
https://bugs.launchpad.net/oslo.messaging/+bug/856764/comments/19

    Change-Id: I9a8ace4712c8a3b71f63b0bafe381e3bc6c707da
    Partial-Bug: 1301431
    Partial-Bug: 1385240
    Partial-Bug: 1385234

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-12-18: Fix merged to tripleo-heat-templates (master)

#10

Reviewed: https://review.openstack.org/142527
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=2f7f4ed50c53e25041bf29d317c5f3358e46e706
Submitter: Jenkins
Branch: master

commit 2f7f4ed50c53e25041bf29d317c5f3358e46e706
Author: Giulio Fidente <email address hidden>
Date: Wed Dec 17 19:06:28 2014 +0100

Set more aggressive keepalive timings

    We want to customize the default kernel keepalive timings and
    make them more aggressive to workaround lack of hearbeat support
    in the Oslo RabbitMQ client, see:

    https://bugs.launchpad.net/oslo.messaging/+bug/856764/comments/19
    and
    https://bugs.launchpad.net/oslo.messaging/+bug/856764/comments/70

    Change-Id: Ieac08f595086acb8dd336e33efc705ee0b8a3a87
    Closes-Bug: 1301431
    Closes-Bug: 1385240
    Closes-Bug: 1385234

Changed in tripleo:
status:	In Progress → Fix Committed

Derek Higgins (derekh) on 2014-12-24

Changed in tripleo:
status:	Fix Committed → Fix Released

Ivan Kolodyazhny (e0ne) on 2015-03-26

Changed in cinder:
assignee:	nobody → Ivan Kolodyazhny (e0ne)

Revision history for this message

Michal Dulko (michal-dulko-f) wrote on 2015-06-18:

#11

This should be fixed in Kilo by https://bugs.launchpad.net/cinder/+bug/1409012.

Changed in cinder:
status:	Triaged → Fix Released

Ben Nemec (bnemec) on 2018-12-04

Changed in oslo.messaging:
status:	Incomplete → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.