Bcache bypass writeback on caching device with fragmentation
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
Undecided
|
dongdong tao | ||
Bionic |
Fix Released
|
High
|
dongdong tao | ||
Focal |
Fix Released
|
High
|
dongdong tao | ||
Groovy |
Fix Released
|
High
|
dongdong tao | ||
Hirsute |
Fix Released
|
Undecided
|
dongdong tao |
Bug Description
SRU Justification:
[Impact]
This bug in bcache affects I/O performance on all versions of the kernel [correct versions affected]. It is particularly negative on ceph if used with bcache.
Write I/O latency would suddenly go to around 1 second from around 10 ms when hitting this issue and would easily be stuck there for hours or even days, especially bad for ceph on bcache architecture. This would make ceph extremely slow and make the entire cloud almost unusable.
The root cause is that the dirty bucket had reached the 70 percent threshold, thus causing all writes to go direct to the backing HDD device. It might be fine if it actually had a lot of dirty data, but this happens when dirty data has not even reached over 10 percent, due to having high memory fragmentation. What makes it worse is that the writeback rate might be still at minimum value (8) due to the writeback percent not reached, so it takes ages for bcache to really reclaim enough dirty buckets to get itself out of this situation.
[Fix]
* 71dda2a5625f31b
The current way to calculate the writeback rate only considered the dirty sectors.
This usually works fine when memory fragmentation is not high, but it will give us an unreasonably low writeback rate when we are in the situation that a few dirty sectors have consumed a lot of dirty buckets. In some cases, the dirty buckets reached CUTOFF_
We accelerate the rate in 3 stages with different aggressiveness:
the first stage starts when dirty buckets percent reach above BCH_WRITEBACK_
the second is BCH_WRITEBACK_
the third is BCH_WRITEBACK_
By default the first stage tries to writeback the amount of dirty data
in one bucket (on average) in (1 / (dirty_
the second stage tries to writeback the amount of dirty data in one bucket
in (1 / (dirty_
stage tries to writeback the amount of dirty data in one bucket in
(1 / (dirty_
The initial rate at each stage can be controlled by 3 configurable
parameters:
writeback_
They are by default 1, 10, 1000, chosen based on testing and production data, detailed below.
A. When it comes to the low stage, it is still far from the 70%
threshold, so we only want to give it a little bit push by setting the
term to 1, it means the initial rate will be 170 if the fragment is 6,
it is calculated by bucket_
but still much more reasonable than the minimum 8.
For a production bcache with non-heavy workload, if the cache device
is bigger than 1 TB, it may take hours to consume 1% buckets,
so it is very possible to reclaim enough dirty buckets in this stage,
thus to avoid entering the next stage.
B. If the dirty buckets ratio didn’t turn around during the first stage,
it comes to the mid stage, then it is necessary for mid stage
to be more aggressive than low stage, so the initial rate is chosen
to be 10 times more than the low stage, which means 1700 as the initial
rate if the fragment is 6. This is a normal rate
we usually see for a normal workload when writeback happens
because of writeback_percent.
C. If the dirty buckets ratio didn't turn around during the low and mid
stages, it comes to the third stage, and it is the last chance that
we can turn around to avoid the horrible cutoff writeback sync issue,
then we choose 100 times more aggressive than the mid stage, that
means 170000 as the initial rate if the fragment is 6. This is also
inferred from a production bcache, I've got one week's writeback rate
data from a production bcache which has quite heavy workloads,
again, the writeback is triggered by the writeback percent,
the highest rate area is around 100000 to 240000, so I believe this
kind aggressiveness at this stage is reasonable for production.
And it should be mostly enough because the hint is trying to reclaim
1000 bucket per second, and from that heavy production env,
it is consuming 50 buckets per second on average in one week's data.
Option writeback_
this feature to be on or off, it's on by default.
[Test Case]
I’ve put all my testing results in below google document, the testing clearly shows the significant performance improvement.
https:/
Another testing is that we had built a testing kernel based on bionic 4.15.0-99.100 + the patch, and putting this kernel in a production environment, it’s an openstack environment with ceph on bcache as the storage. It runs for more than one month and doesn’t show any issue.
[Regression Potential]
The patch only updates the writeback rate, so it won’t have any impact on the data safety, the only potential regression I can think of is that the backing device might be a bit busier after the dirty buckets reached to BCH_WRITEBACK_
CVE References
Changed in linux (Ubuntu): | |
status: | Incomplete → Confirmed |
description: | updated |
Changed in linux (Ubuntu): | |
assignee: | nobody → dongdong tao (taodd) |
Changed in linux (Ubuntu Bionic): | |
assignee: | nobody → dongdong tao (taodd) |
Changed in linux (Ubuntu Focal): | |
assignee: | nobody → dongdong tao (taodd) |
description: | updated |
Changed in linux (Ubuntu Groovy): | |
assignee: | nobody → dongdong tao (taodd) |
importance: | Undecided → Medium |
status: | New → In Progress |
Changed in linux (Ubuntu Focal): | |
importance: | Undecided → Medium |
status: | New → In Progress |
Changed in linux (Ubuntu Bionic): | |
importance: | Undecided → Medium |
status: | New → In Progress |
Changed in linux (Ubuntu Bionic): | |
importance: | Medium → High |
Changed in linux (Ubuntu Focal): | |
importance: | Medium → High |
Changed in linux (Ubuntu Groovy): | |
importance: | Medium → High |
Changed in linux (Ubuntu Bionic): | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu Focal): | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu Groovy): | |
status: | In Progress → Fix Committed |
tags: | added: verification-done-bionic |
tags: | removed: verification-done-bionic |
tags: | added: verification-done-bionic |
summary: |
- Bcache bypasse writeback on caching device with fragmentation + Bcache bypass writeback on caching device with fragmentation |
This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:
apport-collect 1900438
and then change the status of the bug to 'Confirmed'.
If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.
This change has been made by an automated script, maintained by the Ubuntu Kernel Team.