Activity log for bug #1888895

Date Who What changed Old value New value Message
2020-07-24 19:40:49 melanie witt bug added bug
2020-07-24 19:41:29 melanie witt bug task added devstack-plugin-ceph
2020-07-24 19:42:58 melanie witt devstack-plugin-ceph: status New In Progress
2020-07-24 19:59:43 melanie witt description The nova-ceph-multistore job is a relatively new job name in nova which is a version of the devstack-plugin-ceph-tempest-py3 job with some tweaks to make it run with multiple glance stores in ceph [1]. The job has started failing recently with 'No valid host was found. There are not enough hosts available.' errors. We discussed this today in the #openstack-nova channel [2] and found the reason we're getting NoValidHost is because nova-compute is reporting only 10G space available in ceph despite the fact that our ceph volume was created with size 24G. We're getting no allocation candidates from placement. We traced down the source of the 10G limit to the bluestore ceph backend. When backed by a file, ceph will create the file for the OSD if it doesn't already exist and it will create that file with a default size. Example of it resizing to 10G [3] today: 2020-07-24 03:51:44.470 7f36d4689f00 1 bluestore(/var/lib/ceph/osd/ceph-0) _setup_block_symlink_or_file resized block file to 10 GiB When the job first began running, we were pulling ceph version tag 14.2.10 [4]: 2020-07-23 16:10:50.781 7f1132261c00 0 ceph version 14.2.10-138-g1dfef83eeb (1dfef83eeb53147a5da8484f54fbcf46693b748f) nautilus (stable), process ceph-osd, pid 9309 which uses a default block file size of 100G [5]. However, today, we're pulling ceph version tag 14.2.2 [6]: 2020-07-24 03:51:44.462 7f36d4689f00 0 ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable), process ceph-osd, pid 9317 which uses a default block file size of 10G [7]. So with the reduced file size we're seeing a lot of NoValidHost failures for lack of space. We don't yet know what caused the change in the ceph version tag we're pulling in CI. To address the issue, we're trying out a patch to the devstack-plugin-ceph to set the global bluestore_block_size config option to a more reasonable value and not rely on the default: https://review.opendev.org/742961 Setting this bug as Critical as the failure rate looks to be about 80% from the most recent job runs and this job is voting: https://zuul.openstack.org/builds?job_name=nova-ceph-multistore [1] https://review.opendev.org/734184 [2] http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2020-07-24.log.html#t2020-07-24T14:30:12 [3] https://zuul.openstack.org/build/fad88249c7e548d3b946c6d5792fd8fe/log/controller/logs/ceph/ceph-osd.0_log.txt#10 [4] https://zuul.openstack.org/build/ee7cedae2c6e43908a79d89a68554649/log/controller/logs/ceph/ceph-osd.0_log.txt#2 [5] https://github.com/ceph/ceph/blob/v14.2.10/src/common/options.cc#L4445-L4448 [6] https://zuul.openstack.org/build/fad88249c7e548d3b946c6d5792fd8fe/log/controller/logs/ceph/ceph-osd.0_log.txt#2 [7] https://github.com/ceph/ceph/blob/v14.2.2/src/common/options.cc#L4338-L4341 The nova-ceph-multistore job is a relatively new job name in nova which is a version of the devstack-plugin-ceph-tempest-py3 job with some tweaks to make it run with multiple glance stores in ceph [1]. The job has started failing recently with 'No valid host was found. There are not enough hosts available.' errors. We discussed this today in the #openstack-nova channel [2] and found the reason we're getting NoValidHost is because nova-compute is reporting only 10G space available in ceph despite the fact that our ceph volume was created with size 24G. We're getting no allocation candidates from placement. We traced down the source of the 10G limit to the bluestore ceph backend. When backed by a file, ceph will create the file for the OSD if it doesn't already exist and it will create that file with a default size. Example of it resizing to 10G [3] today:   2020-07-24 03:51:44.470 7f36d4689f00 1 bluestore(/var/lib/ceph/osd/ceph-0) _setup_block_symlink_or_file resized block file to 10 GiB When the job first began running, we were pulling ceph version tag 14.2.10 [4]:   2020-07-23 16:10:50.781 7f1132261c00 0 ceph version 14.2.10-138-g1dfef83eeb (1dfef83eeb53147a5da8484f54fbcf46693b748f) nautilus (stable), process ceph-osd, pid 9309 which uses a default block file size of 100G [5]: 2020-07-23 16:10:50.793 7f1132261c00 1 bluestore(/var/lib/ceph/osd/ceph-0) _setup_block_symlink_or_file resized block file to 100 GiB However, today, we're pulling ceph version tag 14.2.2 [6]:   2020-07-24 03:51:44.462 7f36d4689f00 0 ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable), process ceph-osd, pid 9317 which uses a default block file size of 10G [7]. So with the reduced file size we're seeing a lot of NoValidHost failures for lack of space. We don't yet know what caused the change in the ceph version tag we're pulling in CI. To address the issue, we're trying out a patch to the devstack-plugin-ceph to set the global bluestore_block_size config option to a more reasonable value and not rely on the default:   https://review.opendev.org/742961 Setting this bug as Critical as the failure rate looks to be about 80% from the most recent job runs and this job is voting:   https://zuul.openstack.org/builds?job_name=nova-ceph-multistore [1] https://review.opendev.org/734184 [2] http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2020-07-24.log.html#t2020-07-24T14:30:12 [3] https://zuul.openstack.org/build/fad88249c7e548d3b946c6d5792fd8fe/log/controller/logs/ceph/ceph-osd.0_log.txt#10 [4] https://zuul.openstack.org/build/ee7cedae2c6e43908a79d89a68554649/log/controller/logs/ceph/ceph-osd.0_log.txt#2 [5] https://github.com/ceph/ceph/blob/v14.2.10/src/common/options.cc#L4445-L4448 [6] https://zuul.openstack.org/build/fad88249c7e548d3b946c6d5792fd8fe/log/controller/logs/ceph/ceph-osd.0_log.txt#2 [7] https://github.com/ceph/ceph/blob/v14.2.2/src/common/options.cc#L4338-L4341
2020-07-24 20:00:12 melanie witt description The nova-ceph-multistore job is a relatively new job name in nova which is a version of the devstack-plugin-ceph-tempest-py3 job with some tweaks to make it run with multiple glance stores in ceph [1]. The job has started failing recently with 'No valid host was found. There are not enough hosts available.' errors. We discussed this today in the #openstack-nova channel [2] and found the reason we're getting NoValidHost is because nova-compute is reporting only 10G space available in ceph despite the fact that our ceph volume was created with size 24G. We're getting no allocation candidates from placement. We traced down the source of the 10G limit to the bluestore ceph backend. When backed by a file, ceph will create the file for the OSD if it doesn't already exist and it will create that file with a default size. Example of it resizing to 10G [3] today:   2020-07-24 03:51:44.470 7f36d4689f00 1 bluestore(/var/lib/ceph/osd/ceph-0) _setup_block_symlink_or_file resized block file to 10 GiB When the job first began running, we were pulling ceph version tag 14.2.10 [4]:   2020-07-23 16:10:50.781 7f1132261c00 0 ceph version 14.2.10-138-g1dfef83eeb (1dfef83eeb53147a5da8484f54fbcf46693b748f) nautilus (stable), process ceph-osd, pid 9309 which uses a default block file size of 100G [5]: 2020-07-23 16:10:50.793 7f1132261c00 1 bluestore(/var/lib/ceph/osd/ceph-0) _setup_block_symlink_or_file resized block file to 100 GiB However, today, we're pulling ceph version tag 14.2.2 [6]:   2020-07-24 03:51:44.462 7f36d4689f00 0 ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable), process ceph-osd, pid 9317 which uses a default block file size of 10G [7]. So with the reduced file size we're seeing a lot of NoValidHost failures for lack of space. We don't yet know what caused the change in the ceph version tag we're pulling in CI. To address the issue, we're trying out a patch to the devstack-plugin-ceph to set the global bluestore_block_size config option to a more reasonable value and not rely on the default:   https://review.opendev.org/742961 Setting this bug as Critical as the failure rate looks to be about 80% from the most recent job runs and this job is voting:   https://zuul.openstack.org/builds?job_name=nova-ceph-multistore [1] https://review.opendev.org/734184 [2] http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2020-07-24.log.html#t2020-07-24T14:30:12 [3] https://zuul.openstack.org/build/fad88249c7e548d3b946c6d5792fd8fe/log/controller/logs/ceph/ceph-osd.0_log.txt#10 [4] https://zuul.openstack.org/build/ee7cedae2c6e43908a79d89a68554649/log/controller/logs/ceph/ceph-osd.0_log.txt#2 [5] https://github.com/ceph/ceph/blob/v14.2.10/src/common/options.cc#L4445-L4448 [6] https://zuul.openstack.org/build/fad88249c7e548d3b946c6d5792fd8fe/log/controller/logs/ceph/ceph-osd.0_log.txt#2 [7] https://github.com/ceph/ceph/blob/v14.2.2/src/common/options.cc#L4338-L4341 The nova-ceph-multistore job is a relatively new job name in nova which is a version of the devstack-plugin-ceph-tempest-py3 job with some tweaks to make it run with multiple glance stores in ceph [1]. The job has started failing recently with 'No valid host was found. There are not enough hosts available.' errors. We discussed this today in the #openstack-nova channel [2] and found the reason we're getting NoValidHost is because nova-compute is reporting only 10G space available in ceph despite the fact that our ceph volume was created with size 24G. We're getting no allocation candidates from placement. We traced down the source of the 10G limit to the bluestore ceph backend. When backed by a file, ceph will create the file for the OSD if it doesn't already exist and it will create that file with a default size. Example of it resizing to 10G [3] today:   2020-07-24 03:51:44.470 7f36d4689f00 1 bluestore(/var/lib/ceph/osd/ceph-0) _setup_block_symlink_or_file resized block file to 10 GiB When the job first began running, we were pulling ceph version tag 14.2.10 [4]:   2020-07-23 16:10:50.781 7f1132261c00 0 ceph version 14.2.10-138-g1dfef83eeb (1dfef83eeb53147a5da8484f54fbcf46693b748f) nautilus (stable), process ceph-osd, pid 9309 which uses a default block file size of 100G [5]:   2020-07-23 16:10:50.793 7f1132261c00 1 bluestore(/var/lib/ceph/osd/ceph-0) _setup_block_symlink_or_file resized block file to 100 GiB However, today, we're pulling ceph version tag 14.2.2 [6]:   2020-07-24 03:51:44.462 7f36d4689f00 0 ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable), process ceph-osd, pid 9317 which uses a default block file size of 10G [7]: 2020-07-24 03:51:44.470 7f36d4689f00 1 bluestore(/var/lib/ceph/osd/ceph-0) _setup_block_symlink_or_file resized block file to 10 GiB So with the reduced file size we're seeing a lot of NoValidHost failures for lack of space. We don't yet know what caused the change in the ceph version tag we're pulling in CI. To address the issue, we're trying out a patch to the devstack-plugin-ceph to set the global bluestore_block_size config option to a more reasonable value and not rely on the default:   https://review.opendev.org/742961 Setting this bug as Critical as the failure rate looks to be about 80% from the most recent job runs and this job is voting:   https://zuul.openstack.org/builds?job_name=nova-ceph-multistore [1] https://review.opendev.org/734184 [2] http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2020-07-24.log.html#t2020-07-24T14:30:12 [3] https://zuul.openstack.org/build/fad88249c7e548d3b946c6d5792fd8fe/log/controller/logs/ceph/ceph-osd.0_log.txt#10 [4] https://zuul.openstack.org/build/ee7cedae2c6e43908a79d89a68554649/log/controller/logs/ceph/ceph-osd.0_log.txt#2 [5] https://github.com/ceph/ceph/blob/v14.2.10/src/common/options.cc#L4445-L4448 [6] https://zuul.openstack.org/build/fad88249c7e548d3b946c6d5792fd8fe/log/controller/logs/ceph/ceph-osd.0_log.txt#2 [7] https://github.com/ceph/ceph/blob/v14.2.2/src/common/options.cc#L4338-L4341
2020-07-24 20:02:48 melanie witt nova: status New In Progress
2020-07-24 20:02:48 melanie witt nova: assignee Dan Smith (danms)
2020-07-27 14:32:23 OpenStack Infra devstack-plugin-ceph: status In Progress Fix Released
2020-07-27 21:12:12 melanie witt nova: status In Progress Fix Released