Activity log for bug #1843085

Date Who What changed Old value New value Message
2019-09-06 18:47:54 Kellen Renshaw bug added bug
2019-09-06 18:51:44 Kellen Renshaw bug watch added http://tracker.ceph.com/issues/38714
2019-09-06 18:51:44 Kellen Renshaw bug watch added http://tracker.ceph.com/issues/23223
2019-09-17 00:25:33 Dan Hill ceph (Ubuntu): assignee Dan Hill (hillpd)
2019-09-17 00:33:54 Dan Hill nominated for series Ubuntu Bionic
2019-09-17 00:33:54 Dan Hill bug task added ceph (Ubuntu Bionic)
2019-09-17 00:34:02 Dan Hill ceph (Ubuntu): assignee Dan Hill (hillpd)
2019-09-17 00:34:08 Dan Hill ceph (Ubuntu Bionic): assignee Dan Hill (hillpd)
2019-09-17 16:28:39 Dan Hill description This issue in the Ceph tracker has been encountered repeatedly with significant adverse effects on Ceph 12.2.11/12 in Bionic: https://tracker.ceph.com/issues/38454 This PR is the likely candidate for backporting to correct the issue: https://github.com/ceph/ceph/pull/26601 [Impact] Cancelling large S3/Swift object puts may result in garbage collection entries with zero-length chains. Rados gateway garbage collection does not efficiently process and clean up these zero-length chains. A large number of zero-length chains will result in rgw processes quickly spinning through the garbage collection lists doing very little work. This can result in abnormally high cpu utilization and op workloads. [Test Case] Disable garbage collection: `juju config ceph-radosgw config-flags='{"rgw": {"rgw enable gc threads": "false"}}'` Repeatedly kill 256MB object put requests for randomized object names. `for i in {0.. 1000}; do f=$(mktemp); fallocate -l 256M $f; s3cmd put $f s3://test_bucket &; pid=$!; sleep $((RANDOM % 3)); kill $pid; rm $f; done` Capture omap detail. Verify zero-length chains were created: `for i in $(seq 0 ${RGW_GC_MAX_OBJS:-32}); do rados -p default.rgw.log --namespace gc listomapvals gc.$i; done` Raise radosgw debug levels, and enable garbage collection: `juju config ceph-radosgw config-flags='{"rgw": {"rgw enable gc threads": "false"}}' loglevel=20` Verify zero-lenth chains are processed correctly by inspecting radosgw logs. [Regression Potential] {Pending} Back-port still needs to be accepted upstream. Need complete fix to assess regression potential. [Other Information] This issue has been reported upstream [0] and was fixed in Nautilus alongside a number of other garbage collection issues/enhancements in pr#26601 [1]: * adds additional logging to make future debugging easier. * resolves bug where the truncated flag was not always set correctly in gc_iterate_entries * resolves bug where marker in RGWGC::process was not advanced * resolves bug in which gc entries with a zero-length chain were not trimmed * resolves bug where same gc entry tag was added to list for deletion multiple times These fixes were slated for back-port into Luminous and Mimic, but the Luminous work was not completed because of a required dependency: AIO GC [2]. This dependency has been resolved upstream, and is pending SRU verification in Ubuntu packages [3]. [0] https://tracker.ceph.com/issues/38454 [1] https://github.com/ceph/ceph/pull/26601 [2] https://tracker.ceph.com/issues/23223 [3] https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1838858
2019-09-17 16:30:19 Dan Hill bug task added cloud-archive
2019-09-17 16:30:41 Dan Hill summary Need backport of 0-length gc chain fixes to Luminous Backport of zero-length gc chain fixes to Luminous
2019-09-17 17:00:10 Billy Olsen nominated for series cloud-archive/queens
2019-09-17 17:00:10 Billy Olsen bug task added cloud-archive/queens
2019-09-17 17:00:10 Billy Olsen nominated for series cloud-archive/rocky
2019-09-17 17:00:10 Billy Olsen bug task added cloud-archive/rocky
2019-09-17 17:00:20 Billy Olsen cloud-archive/rocky: status New Fix Released
2019-09-17 17:01:25 Billy Olsen cloud-archive/queens: assignee Dan Hill (hillpd)
2019-09-17 17:05:13 Dan Hill ceph (Ubuntu Bionic): status New In Progress
2019-09-19 06:56:22 James Page cloud-archive: status New Invalid
2019-09-19 06:56:24 James Page ceph (Ubuntu): status New Invalid
2019-09-19 06:56:28 James Page cloud-archive/queens: status New Triaged
2019-09-19 06:56:30 James Page ceph (Ubuntu Bionic): importance Undecided High
2019-09-19 06:56:32 James Page cloud-archive/rocky: importance Undecided High
2019-09-19 06:56:34 James Page cloud-archive/queens: importance Undecided High
2019-09-19 07:06:08 James Page bug added subscriber MIR approval team
2019-09-22 17:41:14 Eric Desrochers bug added subscriber Eric Desrochers
2019-10-14 13:47:59 Edward Hope-Morley tags sts-sru-needed
2019-11-26 15:15:02 James Page description [Impact] Cancelling large S3/Swift object puts may result in garbage collection entries with zero-length chains. Rados gateway garbage collection does not efficiently process and clean up these zero-length chains. A large number of zero-length chains will result in rgw processes quickly spinning through the garbage collection lists doing very little work. This can result in abnormally high cpu utilization and op workloads. [Test Case] Disable garbage collection: `juju config ceph-radosgw config-flags='{"rgw": {"rgw enable gc threads": "false"}}'` Repeatedly kill 256MB object put requests for randomized object names. `for i in {0.. 1000}; do f=$(mktemp); fallocate -l 256M $f; s3cmd put $f s3://test_bucket &; pid=$!; sleep $((RANDOM % 3)); kill $pid; rm $f; done` Capture omap detail. Verify zero-length chains were created: `for i in $(seq 0 ${RGW_GC_MAX_OBJS:-32}); do rados -p default.rgw.log --namespace gc listomapvals gc.$i; done` Raise radosgw debug levels, and enable garbage collection: `juju config ceph-radosgw config-flags='{"rgw": {"rgw enable gc threads": "false"}}' loglevel=20` Verify zero-lenth chains are processed correctly by inspecting radosgw logs. [Regression Potential] {Pending} Back-port still needs to be accepted upstream. Need complete fix to assess regression potential. [Other Information] This issue has been reported upstream [0] and was fixed in Nautilus alongside a number of other garbage collection issues/enhancements in pr#26601 [1]: * adds additional logging to make future debugging easier. * resolves bug where the truncated flag was not always set correctly in gc_iterate_entries * resolves bug where marker in RGWGC::process was not advanced * resolves bug in which gc entries with a zero-length chain were not trimmed * resolves bug where same gc entry tag was added to list for deletion multiple times These fixes were slated for back-port into Luminous and Mimic, but the Luminous work was not completed because of a required dependency: AIO GC [2]. This dependency has been resolved upstream, and is pending SRU verification in Ubuntu packages [3]. [0] https://tracker.ceph.com/issues/38454 [1] https://github.com/ceph/ceph/pull/26601 [2] https://tracker.ceph.com/issues/23223 [3] https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1838858 [Impact] Cancelling large S3/Swift object puts may result in garbage collection entries with zero-length chains. Rados gateway garbage collection does not efficiently process and clean up these zero-length chains. A large number of zero-length chains will result in rgw processes quickly spinning through the garbage collection lists doing very little work. This can result in abnormally high cpu utilization and op workloads. [Test Case] Disable garbage collection: `juju config ceph-radosgw config-flags='{"rgw": {"rgw enable gc threads": "false"}}'` Repeatedly kill 256MB object put requests for randomized object names. `for i in {0.. 1000}; do f=$(mktemp); fallocate -l 256M $f; s3cmd put $f s3://test_bucket &; pid=$!; sleep $((RANDOM % 3)); kill $pid; rm $f; done` Capture omap detail. Verify zero-length chains were created: `for i in $(seq 0 ${RGW_GC_MAX_OBJS:-32}); do rados -p default.rgw.log --namespace gc listomapvals gc.$i; done` Raise radosgw debug levels, and enable garbage collection: `juju config ceph-radosgw config-flags='{"rgw": {"rgw enable gc threads": "false"}}' loglevel=20` Verify zero-lenth chains are processed correctly by inspecting radosgw logs. [Regression Potential] Backport has been accepted into the Luminous release stable branch upstream. [Other Information] This issue has been reported upstream [0] and was fixed in Nautilus alongside a number of other garbage collection issues/enhancements in pr#26601 [1]: * adds additional logging to make future debugging easier. * resolves bug where the truncated flag was not always set correctly in gc_iterate_entries * resolves bug where marker in RGWGC::process was not advanced * resolves bug in which gc entries with a zero-length chain were not trimmed * resolves bug where same gc entry tag was added to list for deletion multiple times These fixes were slated for back-port into Luminous and Mimic, but the Luminous work was not completed because of a required dependency: AIO GC [2]. This dependency has been resolved upstream, and is pending SRU verification in Ubuntu packages [3]. [0] https://tracker.ceph.com/issues/38454 [1] https://github.com/ceph/ceph/pull/26601 [2] https://tracker.ceph.com/issues/23223 [3] https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1838858
2019-11-29 12:19:20 Timo Aaltonen ceph (Ubuntu Bionic): status In Progress Fix Committed
2019-11-29 12:19:23 Timo Aaltonen bug added subscriber Ubuntu Stable Release Updates Team
2019-11-29 12:19:26 Timo Aaltonen bug added subscriber SRU Verification
2019-11-29 12:19:30 Timo Aaltonen tags sts-sru-needed sts-sru-needed verification-needed verification-needed-bionic
2019-12-02 14:42:46 James Page cloud-archive/queens: status Triaged Fix Committed
2019-12-02 14:42:48 James Page tags sts-sru-needed verification-needed verification-needed-bionic sts-sru-needed verification-needed verification-needed-bionic verification-queens-needed
2020-01-14 07:32:28 Dan Hill description [Impact] Cancelling large S3/Swift object puts may result in garbage collection entries with zero-length chains. Rados gateway garbage collection does not efficiently process and clean up these zero-length chains. A large number of zero-length chains will result in rgw processes quickly spinning through the garbage collection lists doing very little work. This can result in abnormally high cpu utilization and op workloads. [Test Case] Disable garbage collection: `juju config ceph-radosgw config-flags='{"rgw": {"rgw enable gc threads": "false"}}'` Repeatedly kill 256MB object put requests for randomized object names. `for i in {0.. 1000}; do f=$(mktemp); fallocate -l 256M $f; s3cmd put $f s3://test_bucket &; pid=$!; sleep $((RANDOM % 3)); kill $pid; rm $f; done` Capture omap detail. Verify zero-length chains were created: `for i in $(seq 0 ${RGW_GC_MAX_OBJS:-32}); do rados -p default.rgw.log --namespace gc listomapvals gc.$i; done` Raise radosgw debug levels, and enable garbage collection: `juju config ceph-radosgw config-flags='{"rgw": {"rgw enable gc threads": "false"}}' loglevel=20` Verify zero-lenth chains are processed correctly by inspecting radosgw logs. [Regression Potential] Backport has been accepted into the Luminous release stable branch upstream. [Other Information] This issue has been reported upstream [0] and was fixed in Nautilus alongside a number of other garbage collection issues/enhancements in pr#26601 [1]: * adds additional logging to make future debugging easier. * resolves bug where the truncated flag was not always set correctly in gc_iterate_entries * resolves bug where marker in RGWGC::process was not advanced * resolves bug in which gc entries with a zero-length chain were not trimmed * resolves bug where same gc entry tag was added to list for deletion multiple times These fixes were slated for back-port into Luminous and Mimic, but the Luminous work was not completed because of a required dependency: AIO GC [2]. This dependency has been resolved upstream, and is pending SRU verification in Ubuntu packages [3]. [0] https://tracker.ceph.com/issues/38454 [1] https://github.com/ceph/ceph/pull/26601 [2] https://tracker.ceph.com/issues/23223 [3] https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1838858 [Impact] Cancelling large S3/Swift object puts may result in garbage collection entries with zero-length chains. Rados gateway garbage collection does not efficiently process and clean up these zero-length chains. A large number of zero-length chains will result in rgw processes quickly spinning through the garbage collection lists doing very little work. This can result in abnormally high cpu utilization and op workloads. [Test Case] Modify garbage collection parameters by editing ceph.conf on the target rgw: ``` [client.rgw.juju-29f238-sf00242079-4] rgw enable gc threads = false rgw gc obj min wait = 60 rgw gc processor period = 60 ``` Restart the ceph-radosgw service to apply the new configuration: `sudo systemctl restart ceph-radosgw@rgw.juju-29f238-sf00242079-4` Repeatedly interrupt 512MB object put requests for randomized object names: ``` for i in {0..1000}; do f=$(mktemp); fallocate -l 512M $f s3cmd put $f s3://test_bucket.juju-29f238-sf00242079-4 --disable-multipart & pid=$! sleep $((RANDOM % 7 + 3)); kill $pid rm $f done ``` Delete all objects in the bucket index: ``` for f in $(s3cmd ls s3://test_bucket.juju-29f238-sf00242079-4 | awk '{print $4}'); do s3cmd del $f done ``` By default rgw_max_gc_objs splits the garbage collection list into 32 shards. Capture omap detail and verify zero-length chains were left over: ``` for i in {0..31}; do sudo rados -p default.rgw.log --namespace gc listomapvals gc.$i done ``` Confirm the garbage collection list contains expired objects by listing expiration timestamps: `sudo radosgw-admin gc list | grep time; date` Raise the debug level and process the garbage collection list: `CEPH_ARGS="--debug-rgw=20 --err-to-stderr" sudo -E radosgw-admin gc process` Use the logs to verify the garbage collection process iterates through all remaining omap entry tags. Then confirm all rados objects have been cleaned up: `sudo rados -p default.rgw.buckets.data ls` [Regression Potential] Backport has been accepted into the Luminous release stable branch upstream. [Other Information] This issue has been reported upstream [0] and was fixed in Nautilus alongside a number of other garbage collection issues/enhancements in pr#26601 [1]: * adds additional logging to make future debugging easier. * resolves bug where the truncated flag was not always set correctly in gc_iterate_entries * resolves bug where marker in RGWGC::process was not advanced * resolves bug in which gc entries with a zero-length chain were not trimmed * resolves bug where same gc entry tag was added to list for deletion multiple times These fixes were slated for back-port into Luminous and Mimic, but the Luminous work was not completed because of a required dependency: AIO GC [2]. This dependency has been resolved upstream, and is pending SRU verification in Ubuntu packages [3]. [0] https://tracker.ceph.com/issues/38454 [1] https://github.com/ceph/ceph/pull/26601 [2] https://tracker.ceph.com/issues/23223 [3] https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1838858
2020-01-14 07:33:50 Dan Hill tags sts-sru-needed verification-needed verification-needed-bionic verification-queens-needed sts-sru-needed verification-done-bionic verification-needed verification-queens-needed
2020-01-14 09:15:56 Dan Hill description [Impact] Cancelling large S3/Swift object puts may result in garbage collection entries with zero-length chains. Rados gateway garbage collection does not efficiently process and clean up these zero-length chains. A large number of zero-length chains will result in rgw processes quickly spinning through the garbage collection lists doing very little work. This can result in abnormally high cpu utilization and op workloads. [Test Case] Modify garbage collection parameters by editing ceph.conf on the target rgw: ``` [client.rgw.juju-29f238-sf00242079-4] rgw enable gc threads = false rgw gc obj min wait = 60 rgw gc processor period = 60 ``` Restart the ceph-radosgw service to apply the new configuration: `sudo systemctl restart ceph-radosgw@rgw.juju-29f238-sf00242079-4` Repeatedly interrupt 512MB object put requests for randomized object names: ``` for i in {0..1000}; do f=$(mktemp); fallocate -l 512M $f s3cmd put $f s3://test_bucket.juju-29f238-sf00242079-4 --disable-multipart & pid=$! sleep $((RANDOM % 7 + 3)); kill $pid rm $f done ``` Delete all objects in the bucket index: ``` for f in $(s3cmd ls s3://test_bucket.juju-29f238-sf00242079-4 | awk '{print $4}'); do s3cmd del $f done ``` By default rgw_max_gc_objs splits the garbage collection list into 32 shards. Capture omap detail and verify zero-length chains were left over: ``` for i in {0..31}; do sudo rados -p default.rgw.log --namespace gc listomapvals gc.$i done ``` Confirm the garbage collection list contains expired objects by listing expiration timestamps: `sudo radosgw-admin gc list | grep time; date` Raise the debug level and process the garbage collection list: `CEPH_ARGS="--debug-rgw=20 --err-to-stderr" sudo -E radosgw-admin gc process` Use the logs to verify the garbage collection process iterates through all remaining omap entry tags. Then confirm all rados objects have been cleaned up: `sudo rados -p default.rgw.buckets.data ls` [Regression Potential] Backport has been accepted into the Luminous release stable branch upstream. [Other Information] This issue has been reported upstream [0] and was fixed in Nautilus alongside a number of other garbage collection issues/enhancements in pr#26601 [1]: * adds additional logging to make future debugging easier. * resolves bug where the truncated flag was not always set correctly in gc_iterate_entries * resolves bug where marker in RGWGC::process was not advanced * resolves bug in which gc entries with a zero-length chain were not trimmed * resolves bug where same gc entry tag was added to list for deletion multiple times These fixes were slated for back-port into Luminous and Mimic, but the Luminous work was not completed because of a required dependency: AIO GC [2]. This dependency has been resolved upstream, and is pending SRU verification in Ubuntu packages [3]. [0] https://tracker.ceph.com/issues/38454 [1] https://github.com/ceph/ceph/pull/26601 [2] https://tracker.ceph.com/issues/23223 [3] https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1838858 [Impact] Cancelling large S3/Swift object puts may result in garbage collection entries with zero-length chains. Rados gateway garbage collection does not efficiently process and clean up these zero-length chains. A large number of zero-length chains will result in rgw processes quickly spinning through the garbage collection lists doing very little work. This can result in abnormally high cpu utilization and op workloads. [Test Case] Modify garbage collection parameters by editing ceph.conf on the target rgw: ``` rgw enable gc threads = false rgw gc obj min wait = 60 rgw gc processor period = 60 ``` Restart the ceph-radosgw service to apply the new configuration: `sudo systemctl restart ceph-radosgw@rgw.$HOSTNAME` Repeatedly interrupt 512MB object put requests for randomized object names: ``` for i in {0..1000}; do f=$(mktemp); fallocate -l 512M $f s3cmd put $f s3://test_bucket --disable-multipart & pid=$! sleep $((RANDOM % 7 + 3)); kill $pid rm $f done ``` Delete all objects in the bucket index: ``` for f in $(s3cmd ls s3://test_bucket | awk '{print $4}'); do s3cmd del $f done ``` By default rgw_max_gc_objs splits the garbage collection list into 32 shards. Capture omap detail and verify zero-length chains were left over: ``` export CEPH_ARGS="--id=rgw.$HOSTNAME" for i in {0..31}; do sudo -E rados -p default.rgw.log --namespace gc listomapvals gc.$i done ``` Confirm the garbage collection list contains expired objects by listing expiration timestamps: `sudo -E radosgw-admin gc list | grep time; date` Raise the debug level and process the garbage collection list: `sudo -E radosgw-admin --debug-rgw=20 --err-to-stderr gc process` Use the logs to verify the garbage collection process iterates through all remaining omap entry tags. Then confirm all rados objects have been cleaned up: `sudo -E rados -p default.rgw.buckets.data ls` [Regression Potential] Backport has been accepted into the Luminous release stable branch upstream. [Other Information] This issue has been reported upstream [0] and was fixed in Nautilus alongside a number of other garbage collection issues/enhancements in pr#26601 [1]: * adds additional logging to make future debugging easier. * resolves bug where the truncated flag was not always set correctly in gc_iterate_entries * resolves bug where marker in RGWGC::process was not advanced * resolves bug in which gc entries with a zero-length chain were not trimmed * resolves bug where same gc entry tag was added to list for deletion multiple times These fixes were slated for back-port into Luminous and Mimic, but the Luminous work was not completed because of a required dependency: AIO GC [2]. This dependency has been resolved upstream, and is pending SRU verification in Ubuntu packages [3]. [0] https://tracker.ceph.com/issues/38454 [1] https://github.com/ceph/ceph/pull/26601 [2] https://tracker.ceph.com/issues/23223 [3] https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1838858
2020-01-14 09:16:46 Dan Hill tags sts-sru-needed verification-done-bionic verification-needed verification-queens-needed sts-sru-needed verification-done-bionic verification-needed verification-queens-done
2020-01-15 18:08:25 Dan Hill tags sts-sru-needed verification-done-bionic verification-needed verification-queens-done sts-sru-needed verification-done verification-done-bionic verification-queens-done
2020-01-16 11:31:36 Łukasz Zemczak tags sts-sru-needed verification-done verification-done-bionic verification-queens-done sts-sru-needed verification-needed verification-needed-bionic verification-queens-done
2020-01-16 17:46:40 Dan Hill tags sts-sru-needed verification-needed verification-needed-bionic verification-queens-done sts-sru-needed verification-done verification-done-bionic verification-queens-done
2020-01-20 16:51:44 Launchpad Janitor ceph (Ubuntu Bionic): status Fix Committed Fix Released
2020-01-20 16:52:16 Łukasz Zemczak removed subscriber Ubuntu Stable Release Updates Team
2020-01-21 13:31:11 James Page cloud-archive/queens: status Fix Committed Fix Released