Improve TSC refinement (and calibration) reliability
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
High
|
Guilherme G. Piccoli | ||
Xenial |
Fix Released
|
High
|
Guilherme G. Piccoli | ||
Bionic |
Fix Released
|
High
|
Guilherme G. Piccoli |
Bug Description
[Impact]
* We received a report recently of a missing TSC refinement across multiple reboots of a server, in an Intel Skylake-based processor. This was only reproducible in Bionic pre-5.0.
* After checking kernel commits, we came up with 2 commits that largely improve the situation: a786ef152cdc ("x86/tsc: Make calibration refinement more robust") [git.kernel.
* The first commit contains improvement in comments and in an offset to match more recent (fast) machines, but the important part is a retry mechanism in the TSC refinement (in case it fails due to some disturbance on TSC read, like NMIs/SMIs).
* The second commit is an improvement in TSC calibration for Skylake (and some other models), by checking a register instead of relying on table-based hardcoded values.
* A note for Xenial (kernel 4.4): the second patch would require the inclusion of more commits, so given the "maturity" of this release (and the fact kernel 4.15 is an HWE for Xenial), I've kept it out of Xenial, backporting only the first and more important patch for 4.4 .
[Test case]
* Unfortunately there's not an easy way to test the effectiveness of the commits, specially the refinement improvement.
* The user that reported us the missing refinements was able to test 300 reboots with a regular Bionic kernel (and it reproduced the issue at least once), whereas when they tested with Bionic kernel + both hereby proposed commits, the problem didn't happen.
* Regarding the calibration commit, it was well-tested by community using multiple machines and checking the TSC calibration read vs. tables present in instlatx64.atw.hu .
[Regression potential]
* We consider the regression potential low, specially due to the nature of the patches: the first is basically a retry mechanism (and some improvement in an offset to reflect more recent machines), and the 2nd is an improvement for TSC calibration on some platforms (that are currently hardcoded in a table-based way in kernel). Also, the patches are present upstream for a while and I couldn't find any fixes for them.
* An hypothetical regression from the 2nd patch could be in TSC precision calculation, which refinement itself might as well circumvent. From the first patch, a bug in code is the one hypothetical regression I could think.
CVE References
description: | updated |
Changed in linux (Ubuntu Xenial): | |
status: | New → In Progress |
importance: | Undecided → High |
Changed in linux (Ubuntu Bionic): | |
status: | New → In Progress |
importance: | Undecided → High |
assignee: | nobody → Guilherme G. Piccoli (gpiccoli) |
Changed in linux (Ubuntu Xenial): | |
assignee: | nobody → Guilherme G. Piccoli (gpiccoli) |
Changed in linux (Ubuntu Xenial): | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu Bionic): | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu): | |
status: | In Progress → Fix Released |
SRU just submitted to kernel team mailing-list: https:/ /lists. ubuntu. com/archives/ kernel- team/2020- May/109698. html
Cheers,
Guilherme