nailgun-agent cronjob is locking the block devices
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Fuel for OpenStack |
Invalid
|
High
|
Georgy Kibardin | ||
7.0.x |
Won't Fix
|
High
|
Sergii Rizvan | ||
Mitaka |
Fix Released
|
High
|
Georgy Kibardin |
Bug Description
We have some compute nodes, where we have dead Multipath devices.
root@mosp-9068:~# multipath -l 360002ac0000000
mpath46 (360002ac000000
size=28G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='round-robin 0' prio=-1 status=enabled
|- #:#:#:# - #:# failed undef running
|- #:#:#:# - #:# failed undef running
|- #:#:#:# - #:# failed undef running
`- #:#:#:# - #:# failed undef running
We can see that these dead devices sometimes happen on compute nodes following a completed live migration. Cinder/OS is not able to detach this Multipath Device from the system, because another process is locking the block devices.
After some research, we are pretty sure, that the root cause of the problem is the cronjob of nailgun-agent "/etc/cron.
We see a lot of hanging blkid commands on this machines then and we are not able to run any block subsystem command on this machine anymore.
root@mosp-9068:~# pstree -p
init(1)
├─blkid(1582)
├─blkid(26847)
├─blkid(32706)
├─cinder-
├─cinder-
├─cron(
│ ├─ruby(
│ │ ├─ruby(2644)
│ │ ├─ruby(2647)
│ │ ├─ruby(2678)
│ │ ├─ruby(2681)
│ │ ├─ruby(2693)
│ │ ├─ruby(2699)
│ │ ├─ruby(2706)
│ │ ├─{ruby}(2613)
│ │ └─{ruby}(2716)
│ └─tee(2585)
Per this bug https:/
Changed in mos: | |
assignee: | nobody → Fuel Sustaining (fuel-sustaining-team) |
tags: | added: customer-found support |
Changed in mos: | |
importance: | Undecided → High |
milestone: | none → 10.0 |
Changed in fuel: | |
milestone: | none → 10.0 |
assignee: | nobody → Fuel Sustaining (fuel-sustaining-team) |
no longer affects: | mos/9.x |
Changed in fuel: | |
importance: | Undecided → High |
no longer affects: | mos |
Changed in fuel: | |
assignee: | Fuel Sustaining (fuel-sustaining-team) → Georgy Kibardin (gkibardin) |
status: | New → In Progress |
Changed in fuel: | |
assignee: | Georgy Kibardin (gkibardin) → MOS Linux (mos-linux) |
status: | In Progress → Confirmed |
tags: | added: ct2 |
tags: | added: hard-to-verify |
tags: | added: on-verification |
It looks like ohai package calls blkid from filesystem.rb
I think that it would be reasonable to fix blkid to perform reads with timeout. And I suspect we cannot come up with a better fix since even the kernel doesn't know that some devices are offline and blocks reads infinetely.