Nova waits indefinitely on ceph client hangs due to network problems
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
Low
|
Lee Yarwood |
Bug Description
Description
===========
Requested to be filed by sean-k-mooney as "not a ceph problem".
During what looks like the update_
This freezes up nova's queue and enough sequential failures will eventually shows up with a symptom of "too many missed heartbeats" rabbitmq error, which interrupts and restarts the cycle over again.
As suggested by Sean, it might be best to put a configurable timeout on ceph calls during this process to ensure nova doesnt lock up/flap, and ceph backend network issues are reported for debug.
Steps to reproduce
==================
1. introduce a silent failure of ceph client, oneway packet loss via mismatched LACP MTU across switches, bad triangular routing, flapping links, etc.
2. observe symptom of nova hanging long enough to miss 60 seconds of rabbitmq heartbeats, debug hanging on update_
Expected result
===============
nova alerting of ceph connection timeout
Actual result
=============
nova hangs for 60 seconds, while being in "up" state, flapping for a couple seconds every 60 seconds as it hits the rabbitmq error and reconnects, but is in non-functional state and ignores all instructions on the messagebus.
Environment
===========
nova==18.1.0
rocky
Logs & Configs
==============
No direct logs other than rabbitmq's complaints of timeouts as a symptom.
tags: | added: ceph libvirt resource-tracker |
Changed in nova: | |
status: | New → Confirmed |
tags: | added: serviceability |
Changed in nova: | |
importance: | Undecided → Low |
Fix proposed to branch: master /review. opendev. org/667421
Review: https:/