We've tried the ndctl error injection. Now the error injection is successful. But we have a couple of questions related with the poisoned block.
Here are some tests/steps that I did:
1. I mmap the first 10GB of /dev/dax13.0 to virtual address space and inject an error at offset 1GB (block offset should be 1GB/512bytes = 2097152) which seems fine:
[root@scaqae09celadm13 wning]# ndctl inject-error --uninject --block=2097152 --count=1 namespace13.0
Warning: Un-injecting previously injected errors here will
not cause the kernel to 'forget' its badblock entries. Those
have to be cleared through the normal process of writing
the affected blocks
2. In my test program, I just try to read every address of the first 10GB. At the first time, when I read the offset 1GB, I got the SIGBUS error, but in the sinfo struct of signal handler, the failed address is NULL and signal code is 128 which seems incorrect. But then if we run again, the unit test gets stuck here:
Description:
Customer reports:
—
We've tried the ndctl error injection. Now the error injection is successful. But we have a couple of questions related with the poisoned block.
Here are some tests/steps that I did:
1. I mmap the first 10GB of /dev/dax13.0 to virtual address space and inject an error at offset 1GB (block offset should be 1GB/512bytes = 2097152) which seems fine:
[root@scaqae09c eladm13 wning]# ndctl inject-error --uninject --block=2097152 --count=1 namespace13.0
Warning: Un-injecting previously injected errors here will
not cause the kernel to 'forget' its badblock entries. Those
have to be cleared through the normal process of writing
the affected blocks
{ namespace13. 0", 518967525376, "0738c8bd- 3b3f-4989- 9d0e-0e9c6006c8 10", :"dax13. 0",
"dev":"
"mode":"dax",
"size":
"uuid":
"chardev"
"numa_node":0,
"badblock_count":1,
"badblocks":[
{ "offset":2097152, "length":1, "dimms":[ "nmem1" ] }
]
}
2. In my test program, I just try to read every address of the first 10GB. At the first time, when I read the offset 1GB, I got the SIGBUS error, but in the sinfo struct of signal handler, the failed address is NULL and signal code is 128 which seems incorrect. But then if we run again, the unit test gets stuck here:
rt_sigaction( SIGBUS,
{0x400dd2, [], SA_RESTORER| SA_SIGINFO, 0x7fb5cf839270}
, NULL, 8) = 0
And here is the output of log messages:
Apr 3 14:39:23 scaqae09celadm13 kernel: mce: Uncorrected hardware memory error in user-access at 952b200000
Apr 3 14:51:11 scaqae09celadm13 kernel: mce: Uncorrected hardware memory error in user-access at 94eb300000
Apr 3 14:51:11 scaqae09celadm13 kernel: mce_notify_irq: 48 callbacks suppressed
Apr 3 14:51:11 scaqae09celadm13 kernel: mce: [Hardware Error]: Machine check events logged
Apr 3 14:51:11 scaqae09celadm13 kernel: Memory failure: 0x94eb300: reserved kernel page still referenced by 1 users
Apr 3 14:51:11 scaqae09celadm13 kernel: Memory failure: 0x94eb300: recovery action for reserved kernel page: Failed
Apr 3 14:51:11 scaqae09celadm13 kernel: mce: Memory error not recovered
The program I use is (simple do memcpy and directly read from the target address):
for (i = 0; i < DAX_MAPPING_SIZE; i++) // DAX_MAPPING_SIZE is 10G
{ total += peek(buf + i); }
char peek(void *addr)
{ char temp[128]; memcpy(temp, addr, 1); return *(char *)addr; }
May I ask do we missed steps in triggering the SIGBUS error?
Target Kernel: 4.18
Target Release: 18.10