Ephemeral storage removal fails with message rbd remove failed

Bug #1856845 reported by Sasha Andonov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Sasha Andonov

Bug Description

Description
===========
After destroying instances, ephemeral storage removal intermittently fails with message:

2019-10-17 11:21:08.122 398018 INFO nova.virt.libvirt.driver [-] [instance: 87096add-348e-4c94-8f31-066346e32eef] Instance destroyed successfully.
2019-10-17 11:21:14.619 398018 WARNING nova.virt.libvirt.storage.rbd_utils [-] rbd remove 87096add-348e-4c94-8f31-066346e32eef_disk in pool rbd_pool failed

Ceph logs report lossy connection error:
2019-10-17 11:21:06.181233 7fbbdf2f4700 0 -- 10.248.83.92:6808/20526 submit_message osd_op_reply(192922 rbd_data.77c63845d27cdd.0000000000004728 [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 1273856~262144] v1504399'62984460 uv62984460 ack = 0) v7 remote, 10.248.54.216:0/2391175308, failed lossy con, dropping message 0x56545f021e40

Steps to reproduce
==================
- Deploy Nova with Ceph ephemeral storage RBD
- Create an instance
- Destroy an instance

Expected result
===============
Nova instance destroyed, ceph ephemeral storage always removed from pool

Actual result
=============
Nova instance destroyed, ceph ephemeral storage sometimes remains in pool

Tags: ceph
Revision history for this message
Matt Riedemann (mriedem) wrote :

Which release of nova and ceph are being used?

Revision history for this message
Sasha Andonov (sandonov) wrote :

Reproduced on openstack-nova Newton and ceph 10.2.11 Jewel

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/705764

Changed in nova:
assignee: nobody → Sasha Andonov (sandonov)
status: New → In Progress
melanie witt (melwitt)
tags: added: ceph
melanie witt (melwitt)
Changed in nova:
importance: Undecided → Medium
Changed in nova:
assignee: Sasha Andonov (sandonov) → melanie witt (melwitt)
melanie witt (melwitt)
Changed in nova:
assignee: melanie witt (melwitt) → Sasha Andonov (sandonov)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/705764
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6458c3dba53b9a9fb903bdb6e5e08af14ad015d6
Submitter: Zuul
Branch: master

commit 6458c3dba53b9a9fb903bdb6e5e08af14ad015d6
Author: Sasha Andonov <email address hidden>
Date: Tue Feb 4 16:59:14 2020 +0100

    rbd_utils: increase _destroy_volume timeout

    If RBD backend is used for Nova ephemeral storage, Nova tries to remove
    ephemeral storage volume from Ceph in a retry loop: 10 attempts at 1
    second intervals, totaling 10 seconds overall - which, due to a thirty
    second ceph watcher timeout, might result in intermittent volume
    removal failures on Ceph side.
    This patch adds params rbd_destroy_volume_retries, defaulting to 12, and
    rbd_destroy_volume_retry_interval, defaulting to 5, which multiplied, give
    Ceph reasonable amount of time to complete the operation successfully.

    Closes-Bug: #1856845
    Change-Id: Icfd55617f0126f79d9610f8a2fc6b4c817d1a2bd

Changed in nova:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.