2019-09-06 18:47:54 |
Kellen Renshaw |
bug |
|
|
added bug |
2019-09-06 18:51:44 |
Kellen Renshaw |
bug watch added |
|
http://tracker.ceph.com/issues/38714 |
|
2019-09-06 18:51:44 |
Kellen Renshaw |
bug watch added |
|
http://tracker.ceph.com/issues/23223 |
|
2019-09-17 00:25:33 |
Dan Hill |
ceph (Ubuntu): assignee |
|
Dan Hill (hillpd) |
|
2019-09-17 00:33:54 |
Dan Hill |
nominated for series |
|
Ubuntu Bionic |
|
2019-09-17 00:33:54 |
Dan Hill |
bug task added |
|
ceph (Ubuntu Bionic) |
|
2019-09-17 00:34:02 |
Dan Hill |
ceph (Ubuntu): assignee |
Dan Hill (hillpd) |
|
|
2019-09-17 00:34:08 |
Dan Hill |
ceph (Ubuntu Bionic): assignee |
|
Dan Hill (hillpd) |
|
2019-09-17 16:28:39 |
Dan Hill |
description |
This issue in the Ceph tracker has been encountered repeatedly with significant adverse effects on Ceph 12.2.11/12 in Bionic:
https://tracker.ceph.com/issues/38454
This PR is the likely candidate for backporting to correct the issue:
https://github.com/ceph/ceph/pull/26601 |
[Impact]
Cancelling large S3/Swift object puts may result in garbage collection entries with zero-length chains. Rados gateway garbage collection does not efficiently process and clean up these zero-length chains.
A large number of zero-length chains will result in rgw processes quickly spinning through the garbage collection lists doing very little work. This can result in abnormally high cpu utilization and op workloads.
[Test Case]
Disable garbage collection:
`juju config ceph-radosgw config-flags='{"rgw": {"rgw enable gc threads": "false"}}'`
Repeatedly kill 256MB object put requests for randomized object names.
`for i in {0.. 1000}; do f=$(mktemp); fallocate -l 256M $f; s3cmd put $f s3://test_bucket &; pid=$!; sleep $((RANDOM % 3)); kill $pid; rm $f; done`
Capture omap detail. Verify zero-length chains were created:
`for i in $(seq 0 ${RGW_GC_MAX_OBJS:-32}); do rados -p default.rgw.log --namespace gc listomapvals gc.$i; done`
Raise radosgw debug levels, and enable garbage collection:
`juju config ceph-radosgw config-flags='{"rgw": {"rgw enable gc threads": "false"}}' loglevel=20`
Verify zero-lenth chains are processed correctly by inspecting radosgw logs.
[Regression Potential]
{Pending} Back-port still needs to be accepted upstream. Need complete fix to assess regression potential.
[Other Information]
This issue has been reported upstream [0] and was fixed in Nautilus alongside a number of other garbage collection issues/enhancements in pr#26601 [1]:
* adds additional logging to make future debugging easier.
* resolves bug where the truncated flag was not always set correctly in gc_iterate_entries
* resolves bug where marker in RGWGC::process was not advanced
* resolves bug in which gc entries with a zero-length chain were not trimmed
* resolves bug where same gc entry tag was added to list for deletion multiple times
These fixes were slated for back-port into Luminous and Mimic, but the Luminous work was not completed because of a required dependency: AIO GC [2]. This dependency has been resolved upstream, and is pending SRU verification in Ubuntu packages [3].
[0] https://tracker.ceph.com/issues/38454
[1] https://github.com/ceph/ceph/pull/26601
[2] https://tracker.ceph.com/issues/23223
[3] https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1838858 |
|
2019-09-17 16:30:19 |
Dan Hill |
bug task added |
|
cloud-archive |
|
2019-09-17 16:30:41 |
Dan Hill |
summary |
Need backport of 0-length gc chain fixes to Luminous |
Backport of zero-length gc chain fixes to Luminous |
|
2019-09-17 17:00:10 |
Billy Olsen |
nominated for series |
|
cloud-archive/queens |
|
2019-09-17 17:00:10 |
Billy Olsen |
bug task added |
|
cloud-archive/queens |
|
2019-09-17 17:00:10 |
Billy Olsen |
nominated for series |
|
cloud-archive/rocky |
|
2019-09-17 17:00:10 |
Billy Olsen |
bug task added |
|
cloud-archive/rocky |
|
2019-09-17 17:00:20 |
Billy Olsen |
cloud-archive/rocky: status |
New |
Fix Released |
|
2019-09-17 17:01:25 |
Billy Olsen |
cloud-archive/queens: assignee |
|
Dan Hill (hillpd) |
|
2019-09-17 17:05:13 |
Dan Hill |
ceph (Ubuntu Bionic): status |
New |
In Progress |
|
2019-09-19 06:56:22 |
James Page |
cloud-archive: status |
New |
Invalid |
|
2019-09-19 06:56:24 |
James Page |
ceph (Ubuntu): status |
New |
Invalid |
|
2019-09-19 06:56:28 |
James Page |
cloud-archive/queens: status |
New |
Triaged |
|
2019-09-19 06:56:30 |
James Page |
ceph (Ubuntu Bionic): importance |
Undecided |
High |
|
2019-09-19 06:56:32 |
James Page |
cloud-archive/rocky: importance |
Undecided |
High |
|
2019-09-19 06:56:34 |
James Page |
cloud-archive/queens: importance |
Undecided |
High |
|
2019-09-19 07:06:08 |
James Page |
bug |
|
|
added subscriber MIR approval team |
2019-09-22 17:41:14 |
Eric Desrochers |
bug |
|
|
added subscriber Eric Desrochers |
2019-10-14 13:47:59 |
Edward Hope-Morley |
tags |
|
sts-sru-needed |
|
2019-11-26 15:15:02 |
James Page |
description |
[Impact]
Cancelling large S3/Swift object puts may result in garbage collection entries with zero-length chains. Rados gateway garbage collection does not efficiently process and clean up these zero-length chains.
A large number of zero-length chains will result in rgw processes quickly spinning through the garbage collection lists doing very little work. This can result in abnormally high cpu utilization and op workloads.
[Test Case]
Disable garbage collection:
`juju config ceph-radosgw config-flags='{"rgw": {"rgw enable gc threads": "false"}}'`
Repeatedly kill 256MB object put requests for randomized object names.
`for i in {0.. 1000}; do f=$(mktemp); fallocate -l 256M $f; s3cmd put $f s3://test_bucket &; pid=$!; sleep $((RANDOM % 3)); kill $pid; rm $f; done`
Capture omap detail. Verify zero-length chains were created:
`for i in $(seq 0 ${RGW_GC_MAX_OBJS:-32}); do rados -p default.rgw.log --namespace gc listomapvals gc.$i; done`
Raise radosgw debug levels, and enable garbage collection:
`juju config ceph-radosgw config-flags='{"rgw": {"rgw enable gc threads": "false"}}' loglevel=20`
Verify zero-lenth chains are processed correctly by inspecting radosgw logs.
[Regression Potential]
{Pending} Back-port still needs to be accepted upstream. Need complete fix to assess regression potential.
[Other Information]
This issue has been reported upstream [0] and was fixed in Nautilus alongside a number of other garbage collection issues/enhancements in pr#26601 [1]:
* adds additional logging to make future debugging easier.
* resolves bug where the truncated flag was not always set correctly in gc_iterate_entries
* resolves bug where marker in RGWGC::process was not advanced
* resolves bug in which gc entries with a zero-length chain were not trimmed
* resolves bug where same gc entry tag was added to list for deletion multiple times
These fixes were slated for back-port into Luminous and Mimic, but the Luminous work was not completed because of a required dependency: AIO GC [2]. This dependency has been resolved upstream, and is pending SRU verification in Ubuntu packages [3].
[0] https://tracker.ceph.com/issues/38454
[1] https://github.com/ceph/ceph/pull/26601
[2] https://tracker.ceph.com/issues/23223
[3] https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1838858 |
[Impact]
Cancelling large S3/Swift object puts may result in garbage collection entries with zero-length chains. Rados gateway garbage collection does not efficiently process and clean up these zero-length chains.
A large number of zero-length chains will result in rgw processes quickly spinning through the garbage collection lists doing very little work. This can result in abnormally high cpu utilization and op workloads.
[Test Case]
Disable garbage collection:
`juju config ceph-radosgw config-flags='{"rgw": {"rgw enable gc threads": "false"}}'`
Repeatedly kill 256MB object put requests for randomized object names.
`for i in {0.. 1000}; do f=$(mktemp); fallocate -l 256M $f; s3cmd put $f s3://test_bucket &; pid=$!; sleep $((RANDOM % 3)); kill $pid; rm $f; done`
Capture omap detail. Verify zero-length chains were created:
`for i in $(seq 0 ${RGW_GC_MAX_OBJS:-32}); do rados -p default.rgw.log --namespace gc listomapvals gc.$i; done`
Raise radosgw debug levels, and enable garbage collection:
`juju config ceph-radosgw config-flags='{"rgw": {"rgw enable gc threads": "false"}}' loglevel=20`
Verify zero-lenth chains are processed correctly by inspecting radosgw logs.
[Regression Potential]
Backport has been accepted into the Luminous release stable branch upstream.
[Other Information]
This issue has been reported upstream [0] and was fixed in Nautilus alongside a number of other garbage collection issues/enhancements in pr#26601 [1]:
* adds additional logging to make future debugging easier.
* resolves bug where the truncated flag was not always set correctly in gc_iterate_entries
* resolves bug where marker in RGWGC::process was not advanced
* resolves bug in which gc entries with a zero-length chain were not trimmed
* resolves bug where same gc entry tag was added to list for deletion multiple times
These fixes were slated for back-port into Luminous and Mimic, but the Luminous work was not completed because of a required dependency: AIO GC [2]. This dependency has been resolved upstream, and is pending SRU verification in Ubuntu packages [3].
[0] https://tracker.ceph.com/issues/38454
[1] https://github.com/ceph/ceph/pull/26601
[2] https://tracker.ceph.com/issues/23223
[3] https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1838858 |
|
2019-11-29 12:19:20 |
Timo Aaltonen |
ceph (Ubuntu Bionic): status |
In Progress |
Fix Committed |
|
2019-11-29 12:19:23 |
Timo Aaltonen |
bug |
|
|
added subscriber Ubuntu Stable Release Updates Team |
2019-11-29 12:19:26 |
Timo Aaltonen |
bug |
|
|
added subscriber SRU Verification |
2019-11-29 12:19:30 |
Timo Aaltonen |
tags |
sts-sru-needed |
sts-sru-needed verification-needed verification-needed-bionic |
|
2019-12-02 14:42:46 |
James Page |
cloud-archive/queens: status |
Triaged |
Fix Committed |
|
2019-12-02 14:42:48 |
James Page |
tags |
sts-sru-needed verification-needed verification-needed-bionic |
sts-sru-needed verification-needed verification-needed-bionic verification-queens-needed |
|
2020-01-14 07:32:28 |
Dan Hill |
description |
[Impact]
Cancelling large S3/Swift object puts may result in garbage collection entries with zero-length chains. Rados gateway garbage collection does not efficiently process and clean up these zero-length chains.
A large number of zero-length chains will result in rgw processes quickly spinning through the garbage collection lists doing very little work. This can result in abnormally high cpu utilization and op workloads.
[Test Case]
Disable garbage collection:
`juju config ceph-radosgw config-flags='{"rgw": {"rgw enable gc threads": "false"}}'`
Repeatedly kill 256MB object put requests for randomized object names.
`for i in {0.. 1000}; do f=$(mktemp); fallocate -l 256M $f; s3cmd put $f s3://test_bucket &; pid=$!; sleep $((RANDOM % 3)); kill $pid; rm $f; done`
Capture omap detail. Verify zero-length chains were created:
`for i in $(seq 0 ${RGW_GC_MAX_OBJS:-32}); do rados -p default.rgw.log --namespace gc listomapvals gc.$i; done`
Raise radosgw debug levels, and enable garbage collection:
`juju config ceph-radosgw config-flags='{"rgw": {"rgw enable gc threads": "false"}}' loglevel=20`
Verify zero-lenth chains are processed correctly by inspecting radosgw logs.
[Regression Potential]
Backport has been accepted into the Luminous release stable branch upstream.
[Other Information]
This issue has been reported upstream [0] and was fixed in Nautilus alongside a number of other garbage collection issues/enhancements in pr#26601 [1]:
* adds additional logging to make future debugging easier.
* resolves bug where the truncated flag was not always set correctly in gc_iterate_entries
* resolves bug where marker in RGWGC::process was not advanced
* resolves bug in which gc entries with a zero-length chain were not trimmed
* resolves bug where same gc entry tag was added to list for deletion multiple times
These fixes were slated for back-port into Luminous and Mimic, but the Luminous work was not completed because of a required dependency: AIO GC [2]. This dependency has been resolved upstream, and is pending SRU verification in Ubuntu packages [3].
[0] https://tracker.ceph.com/issues/38454
[1] https://github.com/ceph/ceph/pull/26601
[2] https://tracker.ceph.com/issues/23223
[3] https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1838858 |
[Impact]
Cancelling large S3/Swift object puts may result in garbage collection entries with zero-length chains. Rados gateway garbage collection does not efficiently process and clean up these zero-length chains.
A large number of zero-length chains will result in rgw processes quickly spinning through the garbage collection lists doing very little work. This can result in abnormally high cpu utilization and op workloads.
[Test Case]
Modify garbage collection parameters by editing ceph.conf on the target rgw:
```
[client.rgw.juju-29f238-sf00242079-4]
rgw enable gc threads = false
rgw gc obj min wait = 60
rgw gc processor period = 60
```
Restart the ceph-radosgw service to apply the new configuration:
`sudo systemctl restart ceph-radosgw@rgw.juju-29f238-sf00242079-4`
Repeatedly interrupt 512MB object put requests for randomized object names:
```
for i in {0..1000}; do
f=$(mktemp); fallocate -l 512M $f
s3cmd put $f s3://test_bucket.juju-29f238-sf00242079-4 --disable-multipart &
pid=$!
sleep $((RANDOM % 7 + 3)); kill $pid
rm $f
done
```
Delete all objects in the bucket index:
```
for f in $(s3cmd ls s3://test_bucket.juju-29f238-sf00242079-4 | awk '{print $4}'); do
s3cmd del $f
done
```
By default rgw_max_gc_objs splits the garbage collection list into 32 shards.
Capture omap detail and verify zero-length chains were left over:
```
for i in {0..31}; do
sudo rados -p default.rgw.log --namespace gc listomapvals gc.$i
done
```
Confirm the garbage collection list contains expired objects by listing expiration timestamps:
`sudo radosgw-admin gc list | grep time; date`
Raise the debug level and process the garbage collection list:
`CEPH_ARGS="--debug-rgw=20 --err-to-stderr" sudo -E radosgw-admin gc process`
Use the logs to verify the garbage collection process iterates through all remaining omap entry tags. Then confirm all rados objects have been cleaned up:
`sudo rados -p default.rgw.buckets.data ls`
[Regression Potential]
Backport has been accepted into the Luminous release stable branch upstream.
[Other Information]
This issue has been reported upstream [0] and was fixed in Nautilus alongside a number of other garbage collection issues/enhancements in pr#26601 [1]:
* adds additional logging to make future debugging easier.
* resolves bug where the truncated flag was not always set correctly in gc_iterate_entries
* resolves bug where marker in RGWGC::process was not advanced
* resolves bug in which gc entries with a zero-length chain were not trimmed
* resolves bug where same gc entry tag was added to list for deletion multiple times
These fixes were slated for back-port into Luminous and Mimic, but the Luminous work was not completed because of a required dependency: AIO GC [2]. This dependency has been resolved upstream, and is pending SRU verification in Ubuntu packages [3].
[0] https://tracker.ceph.com/issues/38454
[1] https://github.com/ceph/ceph/pull/26601
[2] https://tracker.ceph.com/issues/23223
[3] https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1838858 |
|
2020-01-14 07:33:50 |
Dan Hill |
tags |
sts-sru-needed verification-needed verification-needed-bionic verification-queens-needed |
sts-sru-needed verification-done-bionic verification-needed verification-queens-needed |
|
2020-01-14 09:15:56 |
Dan Hill |
description |
[Impact]
Cancelling large S3/Swift object puts may result in garbage collection entries with zero-length chains. Rados gateway garbage collection does not efficiently process and clean up these zero-length chains.
A large number of zero-length chains will result in rgw processes quickly spinning through the garbage collection lists doing very little work. This can result in abnormally high cpu utilization and op workloads.
[Test Case]
Modify garbage collection parameters by editing ceph.conf on the target rgw:
```
[client.rgw.juju-29f238-sf00242079-4]
rgw enable gc threads = false
rgw gc obj min wait = 60
rgw gc processor period = 60
```
Restart the ceph-radosgw service to apply the new configuration:
`sudo systemctl restart ceph-radosgw@rgw.juju-29f238-sf00242079-4`
Repeatedly interrupt 512MB object put requests for randomized object names:
```
for i in {0..1000}; do
f=$(mktemp); fallocate -l 512M $f
s3cmd put $f s3://test_bucket.juju-29f238-sf00242079-4 --disable-multipart &
pid=$!
sleep $((RANDOM % 7 + 3)); kill $pid
rm $f
done
```
Delete all objects in the bucket index:
```
for f in $(s3cmd ls s3://test_bucket.juju-29f238-sf00242079-4 | awk '{print $4}'); do
s3cmd del $f
done
```
By default rgw_max_gc_objs splits the garbage collection list into 32 shards.
Capture omap detail and verify zero-length chains were left over:
```
for i in {0..31}; do
sudo rados -p default.rgw.log --namespace gc listomapvals gc.$i
done
```
Confirm the garbage collection list contains expired objects by listing expiration timestamps:
`sudo radosgw-admin gc list | grep time; date`
Raise the debug level and process the garbage collection list:
`CEPH_ARGS="--debug-rgw=20 --err-to-stderr" sudo -E radosgw-admin gc process`
Use the logs to verify the garbage collection process iterates through all remaining omap entry tags. Then confirm all rados objects have been cleaned up:
`sudo rados -p default.rgw.buckets.data ls`
[Regression Potential]
Backport has been accepted into the Luminous release stable branch upstream.
[Other Information]
This issue has been reported upstream [0] and was fixed in Nautilus alongside a number of other garbage collection issues/enhancements in pr#26601 [1]:
* adds additional logging to make future debugging easier.
* resolves bug where the truncated flag was not always set correctly in gc_iterate_entries
* resolves bug where marker in RGWGC::process was not advanced
* resolves bug in which gc entries with a zero-length chain were not trimmed
* resolves bug where same gc entry tag was added to list for deletion multiple times
These fixes were slated for back-port into Luminous and Mimic, but the Luminous work was not completed because of a required dependency: AIO GC [2]. This dependency has been resolved upstream, and is pending SRU verification in Ubuntu packages [3].
[0] https://tracker.ceph.com/issues/38454
[1] https://github.com/ceph/ceph/pull/26601
[2] https://tracker.ceph.com/issues/23223
[3] https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1838858 |
[Impact]
Cancelling large S3/Swift object puts may result in garbage collection entries with zero-length chains. Rados gateway garbage collection does not efficiently process and clean up these zero-length chains.
A large number of zero-length chains will result in rgw processes quickly spinning through the garbage collection lists doing very little work. This can result in abnormally high cpu utilization and op workloads.
[Test Case]
Modify garbage collection parameters by editing ceph.conf on the target rgw:
```
rgw enable gc threads = false
rgw gc obj min wait = 60
rgw gc processor period = 60
```
Restart the ceph-radosgw service to apply the new configuration:
`sudo systemctl restart ceph-radosgw@rgw.$HOSTNAME`
Repeatedly interrupt 512MB object put requests for randomized object names:
```
for i in {0..1000}; do
f=$(mktemp); fallocate -l 512M $f
s3cmd put $f s3://test_bucket --disable-multipart &
pid=$!
sleep $((RANDOM % 7 + 3)); kill $pid
rm $f
done
```
Delete all objects in the bucket index:
```
for f in $(s3cmd ls s3://test_bucket | awk '{print $4}'); do
s3cmd del $f
done
```
By default rgw_max_gc_objs splits the garbage collection list into 32 shards.
Capture omap detail and verify zero-length chains were left over:
```
export CEPH_ARGS="--id=rgw.$HOSTNAME"
for i in {0..31}; do
sudo -E rados -p default.rgw.log --namespace gc listomapvals gc.$i
done
```
Confirm the garbage collection list contains expired objects by listing expiration timestamps:
`sudo -E radosgw-admin gc list | grep time; date`
Raise the debug level and process the garbage collection list:
`sudo -E radosgw-admin --debug-rgw=20 --err-to-stderr gc process`
Use the logs to verify the garbage collection process iterates through all remaining omap entry tags. Then confirm all rados objects have been cleaned up:
`sudo -E rados -p default.rgw.buckets.data ls`
[Regression Potential]
Backport has been accepted into the Luminous release stable branch upstream.
[Other Information]
This issue has been reported upstream [0] and was fixed in Nautilus alongside a number of other garbage collection issues/enhancements in pr#26601 [1]:
* adds additional logging to make future debugging easier.
* resolves bug where the truncated flag was not always set correctly in gc_iterate_entries
* resolves bug where marker in RGWGC::process was not advanced
* resolves bug in which gc entries with a zero-length chain were not trimmed
* resolves bug where same gc entry tag was added to list for deletion multiple times
These fixes were slated for back-port into Luminous and Mimic, but the Luminous work was not completed because of a required dependency: AIO GC [2]. This dependency has been resolved upstream, and is pending SRU verification in Ubuntu packages [3].
[0] https://tracker.ceph.com/issues/38454
[1] https://github.com/ceph/ceph/pull/26601
[2] https://tracker.ceph.com/issues/23223
[3] https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1838858 |
|
2020-01-14 09:16:46 |
Dan Hill |
tags |
sts-sru-needed verification-done-bionic verification-needed verification-queens-needed |
sts-sru-needed verification-done-bionic verification-needed verification-queens-done |
|
2020-01-15 18:08:25 |
Dan Hill |
tags |
sts-sru-needed verification-done-bionic verification-needed verification-queens-done |
sts-sru-needed verification-done verification-done-bionic verification-queens-done |
|
2020-01-16 11:31:36 |
Łukasz Zemczak |
tags |
sts-sru-needed verification-done verification-done-bionic verification-queens-done |
sts-sru-needed verification-needed verification-needed-bionic verification-queens-done |
|
2020-01-16 17:46:40 |
Dan Hill |
tags |
sts-sru-needed verification-needed verification-needed-bionic verification-queens-done |
sts-sru-needed verification-done verification-done-bionic verification-queens-done |
|
2020-01-20 16:51:44 |
Launchpad Janitor |
ceph (Ubuntu Bionic): status |
Fix Committed |
Fix Released |
|
2020-01-20 16:52:16 |
Łukasz Zemczak |
removed subscriber Ubuntu Stable Release Updates Team |
|
|
|
2020-01-21 13:31:11 |
James Page |
cloud-archive/queens: status |
Fix Committed |
Fix Released |
|