Ubuntu
ceph package

Bug #1906496
Activity log

Activity log for bug #1906496

Date	Who	What changed	Old value	New value	Message
2020-12-02 10:12:17	dongdong tao	bug			added bug
2020-12-02 10:15:00	dongdong tao	description	upstream implemented a new feature [1] that will check/report those long network ping times between osds, but it introduced an issue that ceph-mgr might be very slow because it needs to dump all the new osd network ping stats [2] for some tasks, this can be bad especially when the cluster has large number of osds. Since these kind osd network ping stats doesn't need to be exposed to the python mgr module. so, it only makes the mgr doing more extra work than it needs, the fix is to disable the ping time dump for those mgr python modules. The major fix from upstream is here [3], and also I found an improvement commit [4] that submitted later in another PR. We need to backport them to bionic Luminous and Mimic(Stein), Nautilus and Octopus have the fix [1] https://github.com/ceph/ceph/pull/28755 [2] https://github.com/ceph/ceph/pull/28755/files#diff-5498d83111f1210998ee186e98d5836d2bce9992be7648addc83f59e798cddd8L430 [3] https://github.com/ceph/ceph/pull/32406 [4] https://github.com/ceph/ceph/pull/32554/commits/1112584621016c4a8cac1bedb1a1b8b17c394f7f	upstream implemented a new feature [1] that will check/report those long network ping times between osds, but it introduced an issue that ceph-mgr might be very slow because it needs to dump all the new osd network ping stats [2] for some tasks, this can be bad especially when the cluster has large number of osds. Since these kind osd network ping stats doesn't need to be exposed to the python mgr module. so, it only makes the mgr doing more work than it needs to, it could cause the mgr slow or even hang and could cause the cpu usage of mgr process constantly high. the fix is to disable the ping time dump for those mgr python modules. The major fix from upstream is here [3], and also I found an improvement commit [4] that submitted later in another PR. We need to backport them to bionic Luminous and Mimic(Stein), Nautilus and Octopus have the fix [1] https://github.com/ceph/ceph/pull/28755 [2] https://github.com/ceph/ceph/pull/28755/files#diff-5498d83111f1210998ee186e98d5836d2bce9992be7648addc83f59e798cddd8L430 [3] https://github.com/ceph/ceph/pull/32406 [4] https://github.com/ceph/ceph/pull/32554/commits/1112584621016c4a8cac1bedb1a1b8b17c394f7f
2020-12-02 10:18:29	dongdong tao	summary	mgr can be very slow within a large ceph cluster	mgr can be very slow in a large ceph cluster
2020-12-02 19:42:09	Ponnuvel Palaniyappan	ceph (Ubuntu): assignee		Ponnuvel Palaniyappan (pponnuvel)
2020-12-03 09:05:00	Ponnuvel Palaniyappan	tags		sts
2020-12-05 22:08:15	Ponnuvel Palaniyappan	ceph (Ubuntu): status	New	Incomplete
2020-12-05 22:08:20	Ponnuvel Palaniyappan	ceph (Ubuntu): status	Incomplete	New
2020-12-05 22:08:27	Ponnuvel Palaniyappan	ceph (Ubuntu): status	New	Confirmed
2020-12-07 20:55:45	Ponnuvel Palaniyappan	attachment added		Bionic-Ceph-12.2.13-debdiff https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1906496/+attachment/5441739/+files/debdiff
2020-12-07 20:56:55	Ponnuvel Palaniyappan	attachment added		bug1906496.patch-bionic-12.2.13 https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1906496/+attachment/5441740/+files/bug1906496.patch-bionic-12.2.13
2020-12-09 07:08:02	Ponnuvel Palaniyappan	ceph (Ubuntu): status	Confirmed	Won't Fix
2020-12-09 07:08:05	Ponnuvel Palaniyappan	ceph (Ubuntu): status	Won't Fix	In Progress
2020-12-09 19:22:32	Ponnuvel Palaniyappan	attachment added		debdiff-ceph-13.2.9 https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1906496/+attachment/5442308/+files/debdiff-ceph-13.2.9
2020-12-09 19:22:53	Ponnuvel Palaniyappan	attachment added		bug1906496.patch-13.2.9 https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1906496/+attachment/5442309/+files/bug1906496.patch-13.2.9
2020-12-10 15:32:08	Ponnuvel Palaniyappan	bug			added subscriber Ubuntu Sponsors Team
2020-12-10 18:44:24	Ponnuvel Palaniyappan	summary	mgr can be very slow in a large ceph cluster	[SRU] mgr can be very slow in a large ceph cluster
2020-12-10 18:45:10	Ponnuvel Palaniyappan	description	upstream implemented a new feature [1] that will check/report those long network ping times between osds, but it introduced an issue that ceph-mgr might be very slow because it needs to dump all the new osd network ping stats [2] for some tasks, this can be bad especially when the cluster has large number of osds. Since these kind osd network ping stats doesn't need to be exposed to the python mgr module. so, it only makes the mgr doing more work than it needs to, it could cause the mgr slow or even hang and could cause the cpu usage of mgr process constantly high. the fix is to disable the ping time dump for those mgr python modules. The major fix from upstream is here [3], and also I found an improvement commit [4] that submitted later in another PR. We need to backport them to bionic Luminous and Mimic(Stein), Nautilus and Octopus have the fix [1] https://github.com/ceph/ceph/pull/28755 [2] https://github.com/ceph/ceph/pull/28755/files#diff-5498d83111f1210998ee186e98d5836d2bce9992be7648addc83f59e798cddd8L430 [3] https://github.com/ceph/ceph/pull/32406 [4] https://github.com/ceph/ceph/pull/32554/commits/1112584621016c4a8cac1bedb1a1b8b17c394f7f	[Impact] Ceph upstream implemented a new feature [1] that will check/report those long network ping times between osds, but it introduced an issue that ceph-mgr might be very slow because it needs to dump all the new osd network ping stats [2] for some tasks, this can be bad especially when the cluster has large number of osds. Since these kind osd network ping stats doesn't need to be exposed to the python mgr module. so, it only makes the mgr doing more work than it needs to, it could cause the mgr slow or even hang and could cause the cpu usage of mgr process constantly high. The fix is to disable the ping time dump for those mgr python modules. This resulted in ceph-mgr not responding to commands and/or hanging (and had to be restarted) in clusters with a large number of OSDs. [0] is the upstreambug. It was backported to Nautilus but rejected for Luminous and Mimic because they reached EOL in upstream. But I want to backport to these two releases Ubuntu/UCA. The major fix from upstream is here [3], and also I found an improvement commit [4] that submitted later in another PR. [Test Case] Deploy a Ceph cluster (Luminous 13.2.9 or Mimic 13.2.9) with large number of Ceph OSDs (600+). During normal operations of the cluster, as the ceph-mgr dumps the network ping stats regularly, this problem would manifest. This is relatively hard to reproduce as the ceph-mgr may not always get overloaded and thus not hang. [Regression Potential] Fix has been accepted upstream (the changes are here in "sync" with upstream to the extent these old releases match the latest source code) and have been confirmed to work. So the risk is minimal. At worst, this could affect modules that consume the stats from ceph-mgr (such as prometheus or other monitoring scripts/tools) and thus becomes less useful. But still shouldn't cause any problems to the operations of the cluster itself. [Other Info] - In addition to the fix from [1], another commit [4] is also cherry-picked and backported here - this was also accepted upstream. - Since the ceph-mgr hangs when affected, this also impact sosreport collection - commands time out as the mgr doesn't respond and thus info get truncated/not collected in that case. This fix should help avoid that problem in sosreports. [0] https://tracker.ceph.com/issues/43364 [1] https://github.com/ceph/ceph/pull/28755 [2] https://github.com/ceph/ceph/pull/28755/files#diff-5498d83111f1210998ee186e98d5836d2bce9992be7648addc83f59e798cddd8L430 [3] https://github.com/ceph/ceph/pull/32406 [4] https://github.com/ceph/ceph/pull/32554/commits/1112584621016c4a8cac1bedb1a1b8b17c394f7f
2020-12-11 05:12:01	Mathew Hodson	nominated for series		Ubuntu Bionic
2020-12-11 05:12:01	Mathew Hodson	bug task added		ceph (Ubuntu Bionic)
2020-12-11 05:12:08	Mathew Hodson	ceph (Ubuntu): importance	Undecided	High
2020-12-11 05:12:12	Mathew Hodson	ceph (Ubuntu Bionic): importance	Undecided	Medium
2020-12-11 05:14:07	Mathew Hodson	bug task added		cloud-archive
2020-12-11 05:14:34	Mathew Hodson	ceph (Ubuntu): status	In Progress	Fix Released
2020-12-11 11:32:38	Dr. Jens Harbott	bug			added subscriber Dr. Jens Harbott
2020-12-11 17:56:51	Corey Bryant	nominated for series		Ubuntu Groovy
2020-12-11 17:56:51	Corey Bryant	bug task added		ceph (Ubuntu Groovy)
2020-12-11 17:56:51	Corey Bryant	nominated for series		Ubuntu Hirsute
2020-12-11 17:56:51	Corey Bryant	bug task added		ceph (Ubuntu Hirsute)
2020-12-11 17:56:51	Corey Bryant	nominated for series		Ubuntu Focal
2020-12-11 17:56:51	Corey Bryant	bug task added		ceph (Ubuntu Focal)
2020-12-11 17:57:00	Corey Bryant	ceph (Ubuntu Groovy): status	New	Fix Released
2020-12-11 17:57:03	Corey Bryant	ceph (Ubuntu Focal): status	New	Fix Released
2020-12-11 17:57:31	Corey Bryant	nominated for series		cloud-archive/stein
2020-12-11 17:57:31	Corey Bryant	bug task added		cloud-archive/stein
2020-12-11 17:57:31	Corey Bryant	nominated for series		cloud-archive/victoria
2020-12-11 17:57:31	Corey Bryant	bug task added		cloud-archive/victoria
2020-12-11 17:57:31	Corey Bryant	nominated for series		cloud-archive/ussuri
2020-12-11 17:57:31	Corey Bryant	bug task added		cloud-archive/ussuri
2020-12-11 17:57:31	Corey Bryant	nominated for series		cloud-archive/queens
2020-12-11 17:57:31	Corey Bryant	bug task added		cloud-archive/queens
2020-12-11 17:57:31	Corey Bryant	nominated for series		cloud-archive/train
2020-12-11 17:57:31	Corey Bryant	bug task added		cloud-archive/train
2020-12-11 17:57:59	Corey Bryant	bug task deleted	cloud-archive/victoria
2020-12-11 17:58:08	Corey Bryant	cloud-archive/ussuri: status	New	Fix Released
2020-12-11 17:58:19	Corey Bryant	cloud-archive/train: status	New	Fix Released
2020-12-11 17:58:30	Corey Bryant	cloud-archive/stein: importance	Undecided	High
2020-12-11 17:58:30	Corey Bryant	cloud-archive/stein: status	New	Triaged
2020-12-11 17:58:41	Corey Bryant	cloud-archive/queens: importance	Undecided	High
2020-12-11 17:58:41	Corey Bryant	cloud-archive/queens: status	New	Triaged
2020-12-11 17:58:52	Corey Bryant	cloud-archive: status	New	Fix Released
2020-12-11 17:59:05	Corey Bryant	ceph (Ubuntu Bionic): importance	Medium	High
2020-12-11 17:59:05	Corey Bryant	ceph (Ubuntu Bionic): status	New	Triaged
2020-12-11 21:08:50	Corey Bryant	bug			added subscriber Ubuntu Stable Release Updates Team
2020-12-13 20:21:39	Ponnuvel Palaniyappan	description	[Impact] Ceph upstream implemented a new feature [1] that will check/report those long network ping times between osds, but it introduced an issue that ceph-mgr might be very slow because it needs to dump all the new osd network ping stats [2] for some tasks, this can be bad especially when the cluster has large number of osds. Since these kind osd network ping stats doesn't need to be exposed to the python mgr module. so, it only makes the mgr doing more work than it needs to, it could cause the mgr slow or even hang and could cause the cpu usage of mgr process constantly high. The fix is to disable the ping time dump for those mgr python modules. This resulted in ceph-mgr not responding to commands and/or hanging (and had to be restarted) in clusters with a large number of OSDs. [0] is the upstreambug. It was backported to Nautilus but rejected for Luminous and Mimic because they reached EOL in upstream. But I want to backport to these two releases Ubuntu/UCA. The major fix from upstream is here [3], and also I found an improvement commit [4] that submitted later in another PR. [Test Case] Deploy a Ceph cluster (Luminous 13.2.9 or Mimic 13.2.9) with large number of Ceph OSDs (600+). During normal operations of the cluster, as the ceph-mgr dumps the network ping stats regularly, this problem would manifest. This is relatively hard to reproduce as the ceph-mgr may not always get overloaded and thus not hang. [Regression Potential] Fix has been accepted upstream (the changes are here in "sync" with upstream to the extent these old releases match the latest source code) and have been confirmed to work. So the risk is minimal. At worst, this could affect modules that consume the stats from ceph-mgr (such as prometheus or other monitoring scripts/tools) and thus becomes less useful. But still shouldn't cause any problems to the operations of the cluster itself. [Other Info] - In addition to the fix from [1], another commit [4] is also cherry-picked and backported here - this was also accepted upstream. - Since the ceph-mgr hangs when affected, this also impact sosreport collection - commands time out as the mgr doesn't respond and thus info get truncated/not collected in that case. This fix should help avoid that problem in sosreports. [0] https://tracker.ceph.com/issues/43364 [1] https://github.com/ceph/ceph/pull/28755 [2] https://github.com/ceph/ceph/pull/28755/files#diff-5498d83111f1210998ee186e98d5836d2bce9992be7648addc83f59e798cddd8L430 [3] https://github.com/ceph/ceph/pull/32406 [4] https://github.com/ceph/ceph/pull/32554/commits/1112584621016c4a8cac1bedb1a1b8b17c394f7f	[Impact] Ceph upstream implemented a new feature [1] that will check/report those long network ping times between osds, but it introduced an issue that ceph-mgr might be very slow because it needs to dump all the new osd network ping stats [2] for some tasks, this can be bad especially when the cluster has large number of osds. Since these kind osd network ping stats doesn't need to be exposed to the python mgr module. so, it only makes the mgr doing more work than it needs to, it could cause the mgr slow or even hang and could cause the cpu usage of mgr process constantly high. The fix is to disable the ping time dump for those mgr python modules. This resulted in ceph-mgr not responding to commands and/or hanging (and had to be restarted) in clusters with a large number of OSDs. [0] is the upstreambug. It was backported to Nautilus but rejected for Luminous and Mimic because they reached EOL in upstream. But I want to backport to these two releases Ubuntu/UCA. The major fix from upstream is here [3], and also I found an improvement commit [4] that submitted later in another PR. [Test Case] Deploy a Ceph cluster (Luminous 13.2.9 or Mimic 13.2.9) with large number of Ceph OSDs (600+). During normal operations of the cluster, as the ceph-mgr dumps the network ping stats regularly, this problem would manifest. This is relatively hard to reproduce as the ceph-mgr may not always get overloaded and thus not hang. A simpler version could be to deploy a Ceph cluster with as many OSDs as the hardware/system setup allows and drive I/O on the cluster for sometime. Then various queries could be sent to the manager to verify it does report and doesn't get stuck. [Regression Potential] Fix has been accepted upstream (the changes are here in "sync" with upstream to the extent these old releases match the latest source code) and have been confirmed to work. So the risk is minimal. At worst, this could affect modules that consume the stats from ceph-mgr (such as prometheus or other monitoring scripts/tools) and thus becomes less useful. But still shouldn't cause any problems to the operations of the cluster itself. [Other Info] - In addition to the fix from [1], another commit [4] is also cherry-picked and backported here - this was also accepted upstream. - Since the ceph-mgr hangs when affected, this also impact sosreport collection - commands time out as the mgr doesn't respond and thus info get truncated/not collected in that case. This fix should help avoid that problem in sosreports. [0] https://tracker.ceph.com/issues/43364 [1] https://github.com/ceph/ceph/pull/28755 [2] https://github.com/ceph/ceph/pull/28755/files#diff-5498d83111f1210998ee186e98d5836d2bce9992be7648addc83f59e798cddd8L430 [3] https://github.com/ceph/ceph/pull/32406 [4] https://github.com/ceph/ceph/pull/32554/commits/1112584621016c4a8cac1bedb1a1b8b17c394f7f
2020-12-13 20:23:53	Ponnuvel Palaniyappan	description	[Impact] Ceph upstream implemented a new feature [1] that will check/report those long network ping times between osds, but it introduced an issue that ceph-mgr might be very slow because it needs to dump all the new osd network ping stats [2] for some tasks, this can be bad especially when the cluster has large number of osds. Since these kind osd network ping stats doesn't need to be exposed to the python mgr module. so, it only makes the mgr doing more work than it needs to, it could cause the mgr slow or even hang and could cause the cpu usage of mgr process constantly high. The fix is to disable the ping time dump for those mgr python modules. This resulted in ceph-mgr not responding to commands and/or hanging (and had to be restarted) in clusters with a large number of OSDs. [0] is the upstreambug. It was backported to Nautilus but rejected for Luminous and Mimic because they reached EOL in upstream. But I want to backport to these two releases Ubuntu/UCA. The major fix from upstream is here [3], and also I found an improvement commit [4] that submitted later in another PR. [Test Case] Deploy a Ceph cluster (Luminous 13.2.9 or Mimic 13.2.9) with large number of Ceph OSDs (600+). During normal operations of the cluster, as the ceph-mgr dumps the network ping stats regularly, this problem would manifest. This is relatively hard to reproduce as the ceph-mgr may not always get overloaded and thus not hang. A simpler version could be to deploy a Ceph cluster with as many OSDs as the hardware/system setup allows and drive I/O on the cluster for sometime. Then various queries could be sent to the manager to verify it does report and doesn't get stuck. [Regression Potential] Fix has been accepted upstream (the changes are here in "sync" with upstream to the extent these old releases match the latest source code) and have been confirmed to work. So the risk is minimal. At worst, this could affect modules that consume the stats from ceph-mgr (such as prometheus or other monitoring scripts/tools) and thus becomes less useful. But still shouldn't cause any problems to the operations of the cluster itself. [Other Info] - In addition to the fix from [1], another commit [4] is also cherry-picked and backported here - this was also accepted upstream. - Since the ceph-mgr hangs when affected, this also impact sosreport collection - commands time out as the mgr doesn't respond and thus info get truncated/not collected in that case. This fix should help avoid that problem in sosreports. [0] https://tracker.ceph.com/issues/43364 [1] https://github.com/ceph/ceph/pull/28755 [2] https://github.com/ceph/ceph/pull/28755/files#diff-5498d83111f1210998ee186e98d5836d2bce9992be7648addc83f59e798cddd8L430 [3] https://github.com/ceph/ceph/pull/32406 [4] https://github.com/ceph/ceph/pull/32554/commits/1112584621016c4a8cac1bedb1a1b8b17c394f7f	[Impact] Ceph upstream implemented a new feature [1] that will check/report those long network ping times between osds, but it introduced an issue that ceph-mgr might be very slow because it needs to dump all the new osd network ping stats [2] for some tasks, this can be bad especially when the cluster has large number of osds. Since these kind osd network ping stats doesn't need to be exposed to the python mgr module. so, it only makes the mgr doing more work than it needs to, it could cause the mgr slow or even hang and could cause the cpu usage of mgr process constantly high. The fix is to disable the ping time dump for those mgr python modules. This resulted in ceph-mgr not responding to commands and/or hanging (and had to be restarted) in clusters with a large number of OSDs. [0] is the upstreambug. It was backported to Nautilus but rejected for Luminous and Mimic because they reached EOL in upstream. But I want to backport to these two releases Ubuntu/UCA. The major fix from upstream is here [3], and also I found an improvement commit [4] that submitted later in another PR. [Test Case] Deploy a Ceph cluster (Luminous 13.2.9 or Mimic 13.2.9) with large number of Ceph OSDs (600+). During normal operations of the cluster, as the ceph-mgr dumps the network ping stats regularly, this problem would manifest. This is relatively hard to reproduce as the ceph-mgr may not always get overloaded and thus not hang. A simpler version could be to deploy a Ceph cluster with as many OSDs as the hardware/system setup allows (not necessarily 600+) and drive I/O on the cluster for sometime. Then various queries could be sent to the manager to verify it does report and doesn't get stuck. [Regression Potential] Fix has been accepted upstream (the changes are here in "sync" with upstream to the extent these old releases match the latest source code) and have been confirmed to work. So the risk is minimal. At worst, this could affect modules that consume the stats from ceph-mgr (such as prometheus or other monitoring scripts/tools) and thus becomes less useful. But still shouldn't cause any problems to the operations of the cluster itself. [Other Info] - In addition to the fix from [1], another commit [4] is also cherry-picked and backported here - this was also accepted upstream. - Since the ceph-mgr hangs when affected, this also impact sosreport collection - commands time out as the mgr doesn't respond and thus info get truncated/not collected in that case. This fix should help avoid that problem in sosreports. [0] https://tracker.ceph.com/issues/43364 [1] https://github.com/ceph/ceph/pull/28755 [2] https://github.com/ceph/ceph/pull/28755/files#diff-5498d83111f1210998ee186e98d5836d2bce9992be7648addc83f59e798cddd8L430 [3] https://github.com/ceph/ceph/pull/32406 [4] https://github.com/ceph/ceph/pull/32554/commits/1112584621016c4a8cac1bedb1a1b8b17c394f7f
2020-12-14 09:39:52	Ponnuvel Palaniyappan	description	[Impact] Ceph upstream implemented a new feature [1] that will check/report those long network ping times between osds, but it introduced an issue that ceph-mgr might be very slow because it needs to dump all the new osd network ping stats [2] for some tasks, this can be bad especially when the cluster has large number of osds. Since these kind osd network ping stats doesn't need to be exposed to the python mgr module. so, it only makes the mgr doing more work than it needs to, it could cause the mgr slow or even hang and could cause the cpu usage of mgr process constantly high. The fix is to disable the ping time dump for those mgr python modules. This resulted in ceph-mgr not responding to commands and/or hanging (and had to be restarted) in clusters with a large number of OSDs. [0] is the upstreambug. It was backported to Nautilus but rejected for Luminous and Mimic because they reached EOL in upstream. But I want to backport to these two releases Ubuntu/UCA. The major fix from upstream is here [3], and also I found an improvement commit [4] that submitted later in another PR. [Test Case] Deploy a Ceph cluster (Luminous 13.2.9 or Mimic 13.2.9) with large number of Ceph OSDs (600+). During normal operations of the cluster, as the ceph-mgr dumps the network ping stats regularly, this problem would manifest. This is relatively hard to reproduce as the ceph-mgr may not always get overloaded and thus not hang. A simpler version could be to deploy a Ceph cluster with as many OSDs as the hardware/system setup allows (not necessarily 600+) and drive I/O on the cluster for sometime. Then various queries could be sent to the manager to verify it does report and doesn't get stuck. [Regression Potential] Fix has been accepted upstream (the changes are here in "sync" with upstream to the extent these old releases match the latest source code) and have been confirmed to work. So the risk is minimal. At worst, this could affect modules that consume the stats from ceph-mgr (such as prometheus or other monitoring scripts/tools) and thus becomes less useful. But still shouldn't cause any problems to the operations of the cluster itself. [Other Info] - In addition to the fix from [1], another commit [4] is also cherry-picked and backported here - this was also accepted upstream. - Since the ceph-mgr hangs when affected, this also impact sosreport collection - commands time out as the mgr doesn't respond and thus info get truncated/not collected in that case. This fix should help avoid that problem in sosreports. [0] https://tracker.ceph.com/issues/43364 [1] https://github.com/ceph/ceph/pull/28755 [2] https://github.com/ceph/ceph/pull/28755/files#diff-5498d83111f1210998ee186e98d5836d2bce9992be7648addc83f59e798cddd8L430 [3] https://github.com/ceph/ceph/pull/32406 [4] https://github.com/ceph/ceph/pull/32554/commits/1112584621016c4a8cac1bedb1a1b8b17c394f7f	[Impact] Ceph upstream implemented a new feature [1] that will check/report those long network ping times between OSDs, but it introduced an issue that ceph-mgr might be very slow because it needs to dump all the new OSD network ping stats [2] for some tasks, this can be bad especially when the cluster has large number of OSDs. Since these kind OSD network ping stats doesn't need to be exposed to the python mgr module. so, it only makes the mgr doing more work than it needs to, it could cause the mgr slow or even hang and could cause the CPU usage of mgr process constantly high. The fix is to disable the ping time dump for those mgr python modules. This resulted in ceph-mgr not responding to commands and/or hanging (and had to be restarted) in clusters with a large number of OSDs. [0] is the upstream bug. It was backported to Nautilus but rejected for Luminous and Mimic because they reached EOL in upstream. But I want to backport to these two releases Ubuntu/UCA. The major fix from upstream is here [3], and also I found an improvement commit [4] that submitted later in another PR. [Test Case] Deploy a Ceph cluster (Luminous 13.2.9 or Mimic 13.2.9) with large number of Ceph OSDs (600+). During normal operations of the cluster, as the ceph-mgr dumps the network ping stats regularly, this problem would manifest. This is relatively hard to reproduce as the ceph-mgr may not always get overloaded and thus not hang. A simpler version could be to deploy a Ceph cluster with as many OSDs as the hardware/system setup allows (not necessarily 600+) and drive I/O on the cluster for sometime (say, 60 mins). Then various queries could be sent to the manager to verify it does report and doesn't get stuck. [Regression Potential] Fix has been accepted upstream (the changes are here in "sync" with upstream to the extent these old releases match the latest source code) and have been confirmed to work. So the risk is minimal. At worst, this could affect modules that consume the stats from ceph-mgr (such as prometheus or other monitoring scripts/tools) and thus becomes less useful. But still shouldn't cause any problems to the operations of the cluster itself. [Other Info] - In addition to the fix from [1], another commit [4] is also cherry-picked and backported here - this was also accepted upstream. - Since the ceph-mgr hangs when affected, this also impact sosreport collection - commands time out as the mgr doesn't respond and thus info get truncated/not collected in that case. This fix should help avoid that problem in sosreports. [0] https://tracker.ceph.com/issues/43364 [1] https://github.com/ceph/ceph/pull/28755 [2] https://github.com/ceph/ceph/pull/28755/files#diff-5498d83111f1210998ee186e98d5836d2bce9992be7648addc83f59e798cddd8L430 [3] https://github.com/ceph/ceph/pull/32406 [4] https://github.com/ceph/ceph/pull/32554/commits/1112584621016c4a8cac1bedb1a1b8b17c394f7f
2021-01-04 15:30:38	Corey Bryant	cloud-archive/stein: status	Triaged	Fix Committed
2021-01-04 15:30:40	Corey Bryant	tags	sts	sts verification-stein-needed
2021-01-06 10:31:54	Robie Basak	ceph (Ubuntu Bionic): status	Triaged	Fix Committed
2021-01-06 10:31:56	Robie Basak	bug			added subscriber SRU Verification
2021-01-06 10:31:58	Robie Basak	tags	sts verification-stein-needed	sts verification-needed verification-needed-bionic verification-stein-needed
2021-01-07 09:40:05	Ponnuvel Palaniyappan	attachment added		stein.sru https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1906496/+attachment/5450096/+files/stein.sru
2021-01-07 09:40:32	Ponnuvel Palaniyappan	attachment added		bionic.sru https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1906496/+attachment/5450097/+files/bionic.sru
2021-01-07 09:40:58	Ponnuvel Palaniyappan	tags	sts verification-needed verification-needed-bionic verification-stein-needed	sts verification-needed verification-needed-done verification-stein-done
2021-01-07 13:18:02	Corey Bryant	cloud-archive/queens: status	Triaged	Fix Committed
2021-01-07 13:18:04	Corey Bryant	tags	sts verification-needed verification-needed-done verification-stein-done	sts verification-needed verification-needed-done verification-queens-needed verification-stein-done
2021-01-07 13:58:08	Ponnuvel Palaniyappan	ceph (Ubuntu Bionic): assignee		Ponnuvel Palaniyappan (pponnuvel)
2021-01-07 13:58:18	Ponnuvel Palaniyappan	cloud-archive/stein: assignee		Ponnuvel Palaniyappan (pponnuvel)
2021-01-07 13:58:27	Ponnuvel Palaniyappan	cloud-archive/queens: assignee		Ponnuvel Palaniyappan (pponnuvel)
2021-01-07 15:16:56	Ponnuvel Palaniyappan	description	[Impact] Ceph upstream implemented a new feature [1] that will check/report those long network ping times between OSDs, but it introduced an issue that ceph-mgr might be very slow because it needs to dump all the new OSD network ping stats [2] for some tasks, this can be bad especially when the cluster has large number of OSDs. Since these kind OSD network ping stats doesn't need to be exposed to the python mgr module. so, it only makes the mgr doing more work than it needs to, it could cause the mgr slow or even hang and could cause the CPU usage of mgr process constantly high. The fix is to disable the ping time dump for those mgr python modules. This resulted in ceph-mgr not responding to commands and/or hanging (and had to be restarted) in clusters with a large number of OSDs. [0] is the upstream bug. It was backported to Nautilus but rejected for Luminous and Mimic because they reached EOL in upstream. But I want to backport to these two releases Ubuntu/UCA. The major fix from upstream is here [3], and also I found an improvement commit [4] that submitted later in another PR. [Test Case] Deploy a Ceph cluster (Luminous 13.2.9 or Mimic 13.2.9) with large number of Ceph OSDs (600+). During normal operations of the cluster, as the ceph-mgr dumps the network ping stats regularly, this problem would manifest. This is relatively hard to reproduce as the ceph-mgr may not always get overloaded and thus not hang. A simpler version could be to deploy a Ceph cluster with as many OSDs as the hardware/system setup allows (not necessarily 600+) and drive I/O on the cluster for sometime (say, 60 mins). Then various queries could be sent to the manager to verify it does report and doesn't get stuck. [Regression Potential] Fix has been accepted upstream (the changes are here in "sync" with upstream to the extent these old releases match the latest source code) and have been confirmed to work. So the risk is minimal. At worst, this could affect modules that consume the stats from ceph-mgr (such as prometheus or other monitoring scripts/tools) and thus becomes less useful. But still shouldn't cause any problems to the operations of the cluster itself. [Other Info] - In addition to the fix from [1], another commit [4] is also cherry-picked and backported here - this was also accepted upstream. - Since the ceph-mgr hangs when affected, this also impact sosreport collection - commands time out as the mgr doesn't respond and thus info get truncated/not collected in that case. This fix should help avoid that problem in sosreports. [0] https://tracker.ceph.com/issues/43364 [1] https://github.com/ceph/ceph/pull/28755 [2] https://github.com/ceph/ceph/pull/28755/files#diff-5498d83111f1210998ee186e98d5836d2bce9992be7648addc83f59e798cddd8L430 [3] https://github.com/ceph/ceph/pull/32406 [4] https://github.com/ceph/ceph/pull/32554/commits/1112584621016c4a8cac1bedb1a1b8b17c394f7f	[Impact] Ceph upstream implemented a new feature [1] that will check/report those long network ping times between OSDs, but it introduced an issue that ceph-mgr might be very slow because it needs to dump all the new OSD network ping stats [2] for some tasks, this can be bad especially when the cluster has large number of OSDs. Since these kind OSD network ping stats doesn't need to be exposed to the python mgr module. So, it only makes the mgr doing more work than it needs to, it could cause the mgr slow or even hang and could cause the CPU usage of mgr process constantly high. The fix is to disable the ping time dump for those mgr python modules. This resulted in ceph-mgr not responding to commands and/or hanging (and had to be restarted) in clusters with many OSDs. [0] is the upstream bug. It was backported to Nautilus but rejected for Luminous and Mimic because they reached EOL in upstream. But I want to backport to these two releases Ubuntu/UCA. The major fix from upstream is here [3], and also I found an improvement commit [4] that submitted later in another PR. [Test Case] Deploy a Ceph cluster (Luminous 13.2.9 or Mimic 13.2.9) with large number of Ceph OSDs (600+). During normal operations of the cluster, as the ceph-mgr dumps the network ping stats regularly, this problem would manifest. This is relatively hard to reproduce as the ceph-mgr may not always get overloaded and thus not hang. A simpler version could be to deploy a Ceph cluster with as many OSDs as the hardware/system setup allows (not necessarily 600+) and drive I/O on the cluster for sometime (say, 60 mins). Then various queries could be sent to the manager to verify it does report and doesn't get stuck. [Regression Potential] Fix has been accepted upstream (the changes are here in "sync" with upstream to the extent these old releases match the latest source code) and have been confirmed to work. So the risk is minimal. At worst, this could affect modules that consume the stats from ceph-mgr (such as prometheus or other monitoring scripts/tools) and thus becomes less useful. But still shouldn't cause any problems to the operations of the cluster itself. [Other Info] - In addition to the fix from [1], another commit [4] is also cherry-picked and backported here - this was also accepted upstream. - Since the ceph-mgr hangs when affected, this also impact sosreport collection - commands time out as the mgr doesn't respond and thus info get truncated/not collected in that case. This fix should help avoid that problem in sosreports. [0] https://tracker.ceph.com/issues/43364 [1] https://github.com/ceph/ceph/pull/28755 [2] https://github.com/ceph/ceph/pull/28755/files#diff-5498d83111f1210998ee186e98d5836d2bce9992be7648addc83f59e798cddd8L430 [3] https://github.com/ceph/ceph/pull/32406 [4] https://github.com/ceph/ceph/pull/32554/commits/1112584621016c4a8cac1bedb1a1b8b17c394f7f
2021-01-08 11:20:34	Ponnuvel Palaniyappan	attachment added		queens.sru https://bugs.launchpad.net/ubuntu/bionic/+source/ceph/+bug/1906496/+attachment/5450803/+files/queens.sru
2021-01-08 11:21:04	Ponnuvel Palaniyappan	tags	sts verification-needed verification-needed-done verification-queens-needed verification-stein-done	sts verification-done verification-needed-done verification-queens-done verification-stein-done
2021-01-16 23:18:15	Mathew Hodson	tags	sts verification-done verification-needed-done verification-queens-done verification-stein-done	sts verification-bionic-done verification-done verification-queens-done verification-stein-done
2021-01-16 23:19:17	Mathew Hodson	tags	sts verification-bionic-done verification-done verification-queens-done verification-stein-done	sts verification-done verification-done-bionic verification-queens-done verification-stein-done
2021-01-18 10:43:47	Łukasz Zemczak	removed subscriber Ubuntu Stable Release Updates Team
2021-01-18 10:53:50	Launchpad Janitor	ceph (Ubuntu Bionic): status	Fix Committed	Fix Released
2021-01-19 14:49:52	Corey Bryant	cloud-archive/stein: status	Fix Committed	Fix Released
2021-01-19 14:49:56	Corey Bryant	cloud-archive/queens: status	Fix Committed	Fix Released

Ubuntuceph package

Activity log for bug #1906496

Ubuntu
ceph package