Ubuntu
ceph package

Bug #1914911
Activity log

Activity log for bug #1914911

Date	Who	What changed	Old value	New value	Message
2021-02-07 08:48:48	dongdong tao	bug			added bug
2021-02-07 16:04:53	Ponnuvel Palaniyappan	bug			added subscriber Ponnuvel Palaniyappan
2021-02-07 21:38:20	Dominique Poulain	bug			added subscriber Dominique Poulain
2021-02-08 10:34:42	Ponnuvel Palaniyappan	ceph (Ubuntu): assignee		Ponnuvel Palaniyappan (pponnuvel)
2021-02-10 19:30:48	Ponnuvel Palaniyappan	nominated for series		Ubuntu Bionic
2021-02-10 19:30:48	Ponnuvel Palaniyappan	bug task added		ceph (Ubuntu Bionic)
2021-02-10 19:38:55	Ponnuvel Palaniyappan	bug task added		cloud-archive
2021-02-10 19:39:17	Ponnuvel Palaniyappan	nominated for series		cloud-archive/queens
2021-02-10 19:39:17	Ponnuvel Palaniyappan	bug task added		cloud-archive/queens
2021-02-10 22:14:16	Ponnuvel Palaniyappan	ceph (Ubuntu Bionic): status	New	In Progress
2021-02-10 23:09:15	Ponnuvel Palaniyappan	ceph (Ubuntu Bionic): assignee		Ponnuvel Palaniyappan (pponnuvel)
2021-02-11 08:46:38	Ponnuvel Palaniyappan	attachment added		bionic.debdiff https://bugs.launchpad.net/cloud-archive/+bug/1914911/+attachment/5462589/+files/bionic.debdiff
2021-02-11 08:47:55	Ponnuvel Palaniyappan	tags		sts-sru-needed
2021-02-11 09:15:17	Ponnuvel Palaniyappan	summary	bluefs doesn't compact log file	[SRU] bluefs doesn't compact log file
2021-02-11 15:12:41	Ponnuvel Palaniyappan	description	For a certain type of workload, the bluefs might never compact the log file, which would cause the bluefs log file slowly grows to a huge size (some bigger than 1TB for a 1.5T device). This bug could eventually cause osd crash and failed to restart as it couldn't get through the bluefs replay phase during boot time. We might see below log when trying to restart the osd: bluefs mount failed to replay log: (5) Input/output error There are more details in the bluefs perf counters when this issue happened: e.g. "bluefs": { "gift_bytes": 811748818944, "reclaim_bytes": 0, "db_total_bytes": 888564350976, "db_used_bytes": 867311747072, "wal_total_bytes": 0, "wal_used_bytes": 0, "slow_total_bytes": 0, "slow_used_bytes": 0, "num_files": 11, "log_bytes": 866545131520, "log_compactions": 0, "logged_bytes": 866542977024, "files_written_wal": 2, "files_written_sst": 3, "bytes_written_wal": 32424281934, "bytes_written_sst": 25382201 } As we can see the log_compactions is 0, which means it's never compacted and the log file size(log_bytes) is already 800+G. After the compaction, the log file size would reduced to around 1 G Here is the PR[1] that addressed this bug, we need to backport this to ubuntu 12.2.13 [1] https://github.com/ceph/ceph/pull/17354	[Impact] For a certain type of workload, the bluefs might never compact the log file, which would cause the bluefs log file slowly grows to a huge size (some bigger than 1TB for a 1.5T device). There are more details in the bluefs perf counters when this issue happened: e.g. "bluefs": { "gift_bytes": 811748818944, "reclaim_bytes": 0, "db_total_bytes": 888564350976, "db_used_bytes": 867311747072, "wal_total_bytes": 0, "wal_used_bytes": 0, "slow_total_bytes": 0, "slow_used_bytes": 0, "num_files": 11, "log_bytes": 866545131520, "log_compactions": 0, "logged_bytes": 866542977024, "files_written_wal": 2, "files_written_sst": 3, "bytes_written_wal": 32424281934, "bytes_written_sst": 25382201 } This bug could eventually cause osd crash and failed to restart as it couldn't get through the bluefs replay phase during boot time. We might see below log when trying to restart the osd: bluefs mount failed to replay log: (5) Input/output error As we can see the log_compactions is 0, which means it's never compacted and the log file size(log_bytes) is already 800+G. After the compaction, the log file size would need to be reduced to around 1G. [Test Case] Deploy a test ceph cluster (Luminous 12.2.13 which has the bug) and drive I/O. The compaction doesn't get triggered often when most I/O are reads. So fill up the cluster initially with lots of writes and then start reading heavy reads (no writes). Then the problem should occur. Smaller sized OSDs are OK as we'are only interested filling up the OSD and grow the bluefs log. [Where problems could occur] This fix has been part of all upstream releases since Mimic, so there's been quite good "runtime". The changes ensure that compaction happens more often. But that's not going to cause any problem. I can't see any real problems. [Other Info] - It's only needed for Luminous (Bionic). All new releases since have this already. - Upstream PR: https://github.com/ceph/ceph/pull/17354
2021-02-11 15:43:46	Ponnuvel Palaniyappan	attachment added		ceph12.2.13-lp1914911.gitdiff https://bugs.launchpad.net/cloud-archive/+bug/1914911/+attachment/5462713/+files/ceph12.2.13-lp1914911.gitdiff
2021-02-21 17:12:37	Ponnuvel Palaniyappan	tags	sts-sru-needed	seg sts-sru-needed
2021-03-16 18:30:27	Ponnuvel Palaniyappan	description	[Impact] For a certain type of workload, the bluefs might never compact the log file, which would cause the bluefs log file slowly grows to a huge size (some bigger than 1TB for a 1.5T device). There are more details in the bluefs perf counters when this issue happened: e.g. "bluefs": { "gift_bytes": 811748818944, "reclaim_bytes": 0, "db_total_bytes": 888564350976, "db_used_bytes": 867311747072, "wal_total_bytes": 0, "wal_used_bytes": 0, "slow_total_bytes": 0, "slow_used_bytes": 0, "num_files": 11, "log_bytes": 866545131520, "log_compactions": 0, "logged_bytes": 866542977024, "files_written_wal": 2, "files_written_sst": 3, "bytes_written_wal": 32424281934, "bytes_written_sst": 25382201 } This bug could eventually cause osd crash and failed to restart as it couldn't get through the bluefs replay phase during boot time. We might see below log when trying to restart the osd: bluefs mount failed to replay log: (5) Input/output error As we can see the log_compactions is 0, which means it's never compacted and the log file size(log_bytes) is already 800+G. After the compaction, the log file size would need to be reduced to around 1G. [Test Case] Deploy a test ceph cluster (Luminous 12.2.13 which has the bug) and drive I/O. The compaction doesn't get triggered often when most I/O are reads. So fill up the cluster initially with lots of writes and then start reading heavy reads (no writes). Then the problem should occur. Smaller sized OSDs are OK as we'are only interested filling up the OSD and grow the bluefs log. [Where problems could occur] This fix has been part of all upstream releases since Mimic, so there's been quite good "runtime". The changes ensure that compaction happens more often. But that's not going to cause any problem. I can't see any real problems. [Other Info] - It's only needed for Luminous (Bionic). All new releases since have this already. - Upstream PR: https://github.com/ceph/ceph/pull/17354	[Impact] For a certain type of workload, the bluefs might never compact the log file, which would cause the bluefs log file slowly grows to a huge size (some bigger than 1TB for a 1.5T device). There are more details in the bluefs perf counters when this issue happened: e.g. "bluefs": { "gift_bytes": 811748818944, "reclaim_bytes": 0, "db_total_bytes": 888564350976, "db_used_bytes": 867311747072, "wal_total_bytes": 0, "wal_used_bytes": 0, "slow_total_bytes": 0, "slow_used_bytes": 0, "num_files": 11, "log_bytes": 866545131520, "log_compactions": 0, "logged_bytes": 866542977024, "files_written_wal": 2, "files_written_sst": 3, "bytes_written_wal": 32424281934, "bytes_written_sst": 25382201 } This bug could eventually cause osd crash and failed to restart as it couldn't get through the bluefs replay phase during boot time. We might see below log when trying to restart the osd: bluefs mount failed to replay log: (5) Input/output error As we can see the log_compactions is 0, which means it's never compacted and the log file size(log_bytes) is already 800+G. After the compaction, the log file size would need to be reduced to around 1G. [Test Case] Deploy a test ceph cluster (Luminous 12.2.13 which has the bug) and drive I/O. The compaction doesn't get triggered often when most I/O are reads. So fill up the cluster initially with lots of writes and then start reading heavy reads (no writes). Then the problem should occur. Smaller sized OSDs are OK as we'are only interested filling up the OSD and grow the bluefs log. [Where problems could occur] This fix has been part of all upstream releases since Mimic, so there's been quite good "runtime". The changes ensure that compaction happens more often. But that's not going to cause any problem. I can't see any real problems. [Other Info] - It's only needed for Luminous (Bionic). All new releases since have this already. - Upstream master PR: https://github.com/ceph/ceph/pull/17354 - Upstream Luminous PR: https://github.com/ceph/ceph/pull/34876/files
2021-03-16 18:37:01	Ponnuvel Palaniyappan	attachment added		bionic.debdiff.fixed https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1914911/+attachment/5477289/+files/bionic.debdiff.fixed
2021-03-25 08:44:33	James Page	bug			added subscriber Ubuntu Stable Release Updates Team
2021-03-25 08:44:40	James Page	ceph (Ubuntu Bionic): importance	Undecided	High
2021-03-25 08:44:43	James Page	ceph (Ubuntu): status	New	Invalid
2021-03-25 08:44:46	James Page	cloud-archive: status	New	Invalid
2021-03-25 08:44:48	James Page	cloud-archive/queens: status	New	Triaged
2021-03-25 08:44:50	James Page	cloud-archive/queens: importance	Undecided	High
2021-03-25 10:45:55	Ponnuvel Palaniyappan	attachment added		bluefs-bionic.debdiff https://bugs.launchpad.net/cloud-archive/+bug/1914911/+attachment/5480862/+files/bluefs-bionic.debdiff
2021-04-26 15:03:53	Timo Aaltonen	ceph (Ubuntu Bionic): status	In Progress	Fix Committed
2021-04-26 15:03:55	Timo Aaltonen	bug			added subscriber SRU Verification
2021-04-26 15:03:59	Timo Aaltonen	tags	seg sts-sru-needed	seg sts-sru-needed verification-needed verification-needed-bionic
2021-04-29 05:34:46	dongdong tao	tags	seg sts-sru-needed verification-needed verification-needed-bionic	seg sts-sru-needed verification-done verification-done-bionic
2021-05-05 12:00:27	Corey Bryant	cloud-archive/queens: status	Triaged	Fix Committed
2021-05-05 12:00:28	Corey Bryant	tags	seg sts-sru-needed verification-done verification-done-bionic	seg sts-sru-needed verification-done verification-done-bionic verification-queens-needed
2021-05-06 09:00:38	Łukasz Zemczak	removed subscriber Ubuntu Stable Release Updates Team
2021-05-06 09:06:01	Launchpad Janitor	ceph (Ubuntu Bionic): status	Fix Committed	Fix Released
2021-05-28 10:22:53	Ponnuvel Palaniyappan	attachment added		queens_sru_1914911 https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1914911/+attachment/5500895/+files/queens_sru_1914911
2021-05-28 10:23:23	Ponnuvel Palaniyappan	tags	seg sts-sru-needed verification-done verification-done-bionic verification-queens-needed	seg sts-sru-needed verification-done verification-done-bionic verification-queens-done
2021-07-12 13:38:08	James Page	cloud-archive/queens: status	Fix Committed	Fix Released
2021-09-30 18:47:07	Mathew Hodson	ceph (Ubuntu): status	Invalid	Fix Released
2021-09-30 18:47:29	Mathew Hodson	cloud-archive: status	Invalid	Fix Released

Ubuntuceph package

Activity log for bug #1914911

Ubuntu
ceph package