[nailgun-agent] Agent hangs on hw with huge block-dev count

Bug #1559167 reported by Aleksey Zvyagintsev
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Krzysztof Szukiełojć

Bug Description

nailgun-agent gets stuck on system with huge amount of disks: (phys disks - 64)

$ lsblk |wc -l
716

After investigation, we found that root-cause of why it gets stuck - lshw call [0][1]
Strace log for 'strace lshw':
http://paste.openstack.org/show/491111/

lshw gets stuck on random disk each time. The root-cause is that it tries to reach a partition of multipath device, which can be inaccessible at this moment of time:
###
lsblk |grep -A 3 -B 3 3600144f0534f392c000056e972890031-part2
sdaw 67:0 0 15G 0 disk
`-3600144f0534f392c000056e972890031 (dm-42) 252:42 0 15G 0 mpath
  |-3600144f0534f392c000056e972890031-part1 (dm-142) 252:142 0 24M 0 part
  |-3600144f0534f392c000056e972890031-part2 (dm-143) 252:143 0 200M 0 part
  `-3600144f0534f392c000056e972890031-part3 (dm-144) 252:144 0 14.4G 0 part
sdbm 68:0 0 15G 0 disk
`-3600144f0534f392c000056e9728b0041 (dm-60) 252:60 0 15G 0 mpath
--
sddi 71:0 0 15G 0 disk
`-3600144f0534f392c000056e972890031 (dm-42) 252:42 0 15G 0 mpath
  |-3600144f0534f392c000056e972890031-part1 (dm-142) 252:142 0 24M 0 part
  |-3600144f0534f392c000056e972890031-part2 (dm-143) 252:143 0 200M 0 part
  `-3600144f0534f392c000056e972890031-part3 (dm-144) 252:144 0 14.4G 0 part
###

dd if=/dev/zero of=/dev/sdaw2
dd: writing to '/dev/sdaw2': No such process
1+0 records in
0+0 records out
0 bytes (0 B) copied, 0.052422 s, 0.0 kB/s
Otherwise, the second path of device is fine:

###
Also, device is fine using mapped name :
dd if=/dev/zero of=/dev/mapper/3600144f0534f392c000056e972890031-part2
409601+0 records in
409600+0 records out
209715200 bytes (210 MB) copied, 11.9792 s, 17.5 MB/s

Work-around - trigger lshw with '-disable scsi' key.

[0]https://github.com/openstack/fuel-nailgun-agent/blob/master/agent#L334
[1]https://github.com/openstack/fuel-nailgun-agent/blob/master/agent#L904-L920

Changed in fuel:
assignee: nobody → Aleksey Zvyagintsev (azvyagintsev)
assignee: Aleksey Zvyagintsev (azvyagintsev) → Fuel Python Team (fuel-python)
tags: added: team-mixed
Maciej Relewicz (rlu)
Changed in fuel:
status: New → Confirmed
Revision history for this message
Alexander Gordeev (a-gordeev) wrote :

> Work-around - trigger lshw with '-disable scsi' key.

what if there will be a huge amount of NVME disks? IIRC, they don't use SCSI protocol.

description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-nailgun-agent (master)

Fix proposed to branch: master
Review: https://review.openstack.org/295210

Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Krzysztof Szukiełojć (kszukielojc)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-nailgun-agent (master)

Reviewed: https://review.openstack.org/295210
Committed: https://git.openstack.org/cgit/openstack/fuel-nailgun-agent/commit/?id=28dc110ea552c1bcbbd0160c29e1438962cd13f2
Submitter: Jenkins
Branch: master

commit 28dc110ea552c1bcbbd0160c29e1438962cd13f2
Author: Krzysztof Szukiełojć <email address hidden>
Date: Mon Mar 21 12:09:55 2016 +0100

    Setting timeout for calling lshw

    In some cases lshw may take too long like when we have
    lot of partitions > 600. We avoid this problem with
    setting timeout for lshw.

    Change-Id: I67748bc18023f3f6edce0cc20d4f0486877723b2
    Closes-bug: #1559167

Changed in fuel:
status: In Progress → Fix Committed
Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.