possible data corruption using ceph rdb with caching enabled
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Mirantis OpenStack |
Fix Released
|
High
|
MOS Ceph | ||
7.0.x |
Won't Fix
|
High
|
Alexey Stupnikov | ||
8.0.x |
Fix Released
|
High
|
Alexey Stupnikov | ||
9.x |
Fix Released
|
High
|
MOS Ceph |
Bug Description
Detailed bug description: spurious page corruptions in SQL Server running on Windows 2012R2 instances. The instance use ceph rbd storage with cache enabled.
The issue is not reproducible on LVM/file based storage.
Steps to reproduce: run SQL Server running on Windows 2012R2 or SQLioSim (stress test utility emulating SQL server)
Expected results: no errors
Actual result:
xpected FileId: 0x0
Received FileId: 0x0
Expected PageId: 0xCB19C
Received PageId: 0xCB19A (does not match expected)
Received CheckSum: 0x9F444071
Calculated CheckSum: 0x89603EC9 (does not match expected)
Received Buffer Length: 0x2000
Reproducibility: steadily reproducable with SQLioSim
Was reproduced in
MOS 6.0
MOS 7.0
MOS 8.0
Workaround: completely disabling rbd cache.
But it's not acceptable due to significant performance degradation.
SQL server cannot keep up with the required transaction rate.
tags: | added: customer-found |
Changed in mos: | |
importance: | Undecided → Critical |
assignee: | nobody → MOS Ceph (mos-ceph) |
tags: | added: support |
description: | updated |
tags: | added: area-ceph |
tags: | added: on-verification |
tags: | added: on-verification |
The problem looks like the application bug (forgotten fsync() or whatever it is on Windows).
librbd/ceph is also responsible for writing the filesystem metadata, and there are no (guest) filesystem metadata inconsistencies (no kernel panics/BSODs).
Please run a filesystem stress test, preferably a metadata heavy one (for instance, writing a lot of small files in a single directory)