nailgun-agent cronjob is locking the block devices

Bug #1619734 reported by Rudy McComb
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Georgy Kibardin
7.0.x
Won't Fix
High
Sergii Rizvan
Mitaka
Fix Released
High
Georgy Kibardin

Bug Description

We have some compute nodes, where we have dead Multipath devices.

root@mosp-9068:~# multipath -l 360002ac0000000000000032a00015b33
mpath46 (360002ac0000000000000032a00015b33) dm-9 ,
size=28G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='round-robin 0' prio=-1 status=enabled
|- #:#:#:# - #:# failed undef running
|- #:#:#:# - #:# failed undef running
|- #:#:#:# - #:# failed undef running
`- #:#:#:# - #:# failed undef running

We can see that these dead devices sometimes happen on compute nodes following a completed live migration. Cinder/OS is not able to detach this Multipath Device from the system, because another process is locking the block devices.

After some research, we are pretty sure, that the root cause of the problem is the cronjob of nailgun-agent "/etc/cron.d/nailgun-agent", that is running every minute.

We see a lot of hanging blkid commands on this machines then and we are not able to run any block subsystem command on this machine anymore.

root@mosp-9068:~# pstree -p
init(1)─┬─atop(29635)
├─blkid(1582)
├─blkid(26847)
├─blkid(32706)
├─cinder-backup(8424)
├─cinder-volume(28456)───cinder-volume(28466)
├─cron(3356)───cron(2574)───sh(2579)───flock(2581)───sh(2583)─┬─logger(2586)
│ ├─ruby(2584)─┬─blkid(2715)
│ │ ├─ruby(2644)
│ │ ├─ruby(2647)
│ │ ├─ruby(2678)
│ │ ├─ruby(2681)
│ │ ├─ruby(2693)
│ │ ├─ruby(2699)
│ │ ├─ruby(2706)
│ │ ├─{ruby}(2613)
│ │ └─{ruby}(2716)
│ └─tee(2585)

Per this bug https://bugs.launchpad.net/nova/+bug/1208799 it looks like we need a backport of os-brick for MOS 7.0.

Changed in mos:
assignee: nobody → Fuel Sustaining (fuel-sustaining-team)
tags: added: customer-found support
Dmitry Pyzhov (dpyzhov)
Changed in mos:
importance: Undecided → High
milestone: none → 10.0
Changed in fuel:
milestone: none → 10.0
assignee: nobody → Fuel Sustaining (fuel-sustaining-team)
no longer affects: mos/9.x
Changed in fuel:
importance: Undecided → High
no longer affects: mos
Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Georgy Kibardin (gkibardin)
status: New → In Progress
Revision history for this message
Georgy Kibardin (gkibardin) wrote :

It looks like ohai package calls blkid from filesystem.rb
I think that it would be reasonable to fix blkid to perform reads with timeout. And I suspect we cannot come up with a better fix since even the kernel doesn't know that some devices are offline and blocks reads infinetely.

Changed in fuel:
assignee: Georgy Kibardin (gkibardin) → MOS Linux (mos-linux)
status: In Progress → Confirmed
Revision history for this message
Dmitry Teselkin (teselkin-d) wrote :

I don't think that we should fix blkid just because we trying to use it in a way it shouldn't be used. Calling it every minute indeed an example of such bad practice.

We should think about more correct way of gathering that data, for example, iterate over /sys/block/ file system structure without ohai.

Changed in fuel:
assignee: MOS Linux (mos-linux) → Georgy Kibardin (gkibardin)
Revision history for this message
Dmitry Teselkin (teselkin-d) wrote :

Doesn't affect 10 because we don'e use ohai anymore https://review.openstack.org/#/c/314642/

Changed in fuel:
status: Confirmed → Invalid
Revision history for this message
Georgy Kibardin (gkibardin) wrote :

We cannot get block device id without reading it. We cannot know block device is dead. I don't see any way to avoid blocking.

Revision history for this message
Georgy Kibardin (gkibardin) wrote :

Aha, it seems that we just don't need block devices ids, so, we need just to backport the patch

Revision history for this message
Georgy Kibardin (gkibardin) wrote :

As a workaround we may turn off nailgun agent in cron for the time of migration.

tags: added: ct2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-nailgun-agent (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/367786

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-nailgun-agent (stable/8.0)

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/370925

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-nailgun-agent (stable/7.0)

Fix proposed to branch: stable/7.0
Review: https://review.openstack.org/374290

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-nailgun-agent (stable/mitaka)

Reviewed: https://review.openstack.org/367786
Committed: https://git.openstack.org/cgit/openstack/fuel-nailgun-agent/commit/?id=b5cb0a8e7b986c10d6c92e41958fde501bebd6cb
Submitter: Jenkins
Branch: stable/mitaka

commit b5cb0a8e7b986c10d6c92e41958fde501bebd6cb
Author: Ivan Suzdal <email address hidden>
Date: Tue May 10 18:51:03 2016 +0300

    Let's get rid of ohai.

    Ohai required support additional packages,
    and unfortunately, not all of them are
    opensource friendly (ruby-sigar, for example).

    This changes will let us to rid ruby-mixlib*,
    ruby-sigar and ruby-yajl packages.

    Also, it may sound strange, but ohai[:virtualization]
    makes decision based on /proc/cpuinfo information
    only (this applies only to kvm/qemu, other virt-systems
    determines correctly, AFAICS).

    So, if someone will choose a non-default (qemu)
    processor configuration, ohai will return incomplete
    information about virtualization on a kvm-based virtual host.
    Facter doing it more intelligently.

    Blueprint: get-rid-of-ohai
    Closes-Bug: #1619734

    Change-Id: Ia8021a3ab83bbf973eff548880ae10a540476b1c
    (cherry picked from commit 93e84d1649a9865779f0cf3517f623184d2e9029)

Revision history for this message
Sergii Rizvan (srizvan) wrote :

We've decided don't backport the fix to 7.0 and 8.0 because the change is very huge and applying it on deployed clusters may cause unexpected behavior in mailgun agent workflow.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-nailgun-agent (stable/7.0)

Change abandoned by Sergii Rizvan (<email address hidden>) on branch: stable/7.0
Review: https://review.openstack.org/374290

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-nailgun-agent (stable/8.0)

Change abandoned by Sergii Rizvan (<email address hidden>) on branch: stable/8.0
Review: https://review.openstack.org/370925

tags: added: hard-to-verify
tags: added: on-verification
Revision history for this message
Alexey. Kalashnikov (akalashnikov) wrote :

9.2 snapshot #517

Revision history for this message
Alexey. Kalashnikov (akalashnikov) wrote :

Verified on 9.2 snapshot #537

tags: removed: on-verification
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.