Cells: Broadcast call messages fail if a child cell goes down

Bug #1312468 reported by Sam Morrison
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
Undecided
Unassigned

Bug Description

If a child cell stops functioning we still include it when we send down broadcast messages that require a response.
This causes things like listing hosts, hypervisor-stats etc. to fail if one of your compute cells is down.

We know if the cell is mute so we shouldn't send messages to it and expect replies while it's in this state.

Tags: cells
Revision history for this message
Sam Morrison (sorrison) wrote :
Changed in nova:
assignee: nobody → Sam Morrison (sorrison)
status: New → In Progress
Revision history for this message
Chris Behrens (cbehrens) wrote :

But I think you want it to fail. You really should not return a partial response from the API. It should be all or nothing. (Or, there needs to be some way to say "Hey, here's what I know, but there's some data missing because cell 'x' is mute".)

Revision history for this message
Sam Morrison (sorrison) wrote :

OK I guess we could raise a CellTimout straight away if a cell was mute instead of waiting for it to timeout (default 60s )

I'm still a bit torn, I agree with you but it would be nice if when one cell went down it wouldn't cause things to fail when information from that cell might not be wanted.

Things like the ability of taking a cell down for maintenance would be great, currently it's not possible.

Stephen Gordon (sgordon)
tags: added: cells
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Sam Morrison (<email address hidden>) on branch: master
Review: https://review.openstack.org/90589
Reason: Not the right way to do this. Need another approach

Revision history for this message
Sean Dague (sdague) wrote :

Low priority as cells remains experimental in the codebase

Changed in nova:
assignee: Sam Morrison (sorrison) → nobody
status: In Progress → Confirmed
importance: Undecided → Low
Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote : Cleanup EOL bug report

This is an automated cleanup. This bug report has been closed because it
is older than 18 months and there is no open code change to fix this.
After this time it is unlikely that the circumstances which lead to
the observed issue can be reproduced.

If you can reproduce the bug, please:
* reopen the bug report (set to status "New")
* AND add the detailed steps to reproduce the issue (if applicable)
* AND leave a comment "CONFIRMED FOR: <RELEASE_NAME>"
  Only still supported release names are valid (LIBERTY, MITAKA, OCATA, NEWTON).
  Valid example: CONFIRMED FOR: LIBERTY

Changed in nova:
importance: Low → Undecided
status: Confirmed → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.