[nailgun-agent] Agent hangs on hw with huge block-dev count

Bug #1559167 reported by Aleksey Zvyagintsev on 2016-03-18
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
High
Krzysztof Szukiełojć

Bug Description

nailgun-agent gets stuck on system with huge amount of disks: (phys disks - 64)

$ lsblk |wc -l
716

After investigation, we found that root-cause of why it gets stuck - lshw call [0][1]
Strace log for 'strace lshw':
http://paste.openstack.org/show/491111/

lshw gets stuck on random disk each time. The root-cause is that it tries to reach a partition of multipath device, which can be inaccessible at this moment of time:
###
lsblk |grep -A 3 -B 3 3600144f0534f392c000056e972890031-part2
sdaw 67:0 0 15G 0 disk
`-3600144f0534f392c000056e972890031 (dm-42) 252:42 0 15G 0 mpath
  |-3600144f0534f392c000056e972890031-part1 (dm-142) 252:142 0 24M 0 part
  |-3600144f0534f392c000056e972890031-part2 (dm-143) 252:143 0 200M 0 part
  `-3600144f0534f392c000056e972890031-part3 (dm-144) 252:144 0 14.4G 0 part
sdbm 68:0 0 15G 0 disk
`-3600144f0534f392c000056e9728b0041 (dm-60) 252:60 0 15G 0 mpath
--
sddi 71:0 0 15G 0 disk
`-3600144f0534f392c000056e972890031 (dm-42) 252:42 0 15G 0 mpath
  |-3600144f0534f392c000056e972890031-part1 (dm-142) 252:142 0 24M 0 part
  |-3600144f0534f392c000056e972890031-part2 (dm-143) 252:143 0 200M 0 part
  `-3600144f0534f392c000056e972890031-part3 (dm-144) 252:144 0 14.4G 0 part
###

dd if=/dev/zero of=/dev/sdaw2
dd: writing to '/dev/sdaw2': No such process
1+0 records in
0+0 records out
0 bytes (0 B) copied, 0.052422 s, 0.0 kB/s
Otherwise, the second path of device is fine:

###
Also, device is fine using mapped name :
dd if=/dev/zero of=/dev/mapper/3600144f0534f392c000056e972890031-part2
409601+0 records in
409600+0 records out
209715200 bytes (210 MB) copied, 11.9792 s, 17.5 MB/s

Work-around - trigger lshw with '-disable scsi' key.

[0]https://github.com/openstack/fuel-nailgun-agent/blob/master/agent#L334
[1]https://github.com/openstack/fuel-nailgun-agent/blob/master/agent#L904-L920

Changed in fuel:
assignee: nobody → Aleksey Zvyagintsev (azvyagintsev)
assignee: Aleksey Zvyagintsev (azvyagintsev) → Fuel Python Team (fuel-python)
tags: added: team-mixed
Maciej Relewicz (rlu) on 2016-03-18
Changed in fuel:
status: New → Confirmed
Alexander Gordeev (a-gordeev) wrote :

> Work-around - trigger lshw with '-disable scsi' key.

what if there will be a huge amount of NVME disks? IIRC, they don't use SCSI protocol.

description: updated

Fix proposed to branch: master
Review: https://review.openstack.org/295210

Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Krzysztof Szukiełojć (kszukielojc)
status: Confirmed → In Progress

Reviewed: https://review.openstack.org/295210
Committed: https://git.openstack.org/cgit/openstack/fuel-nailgun-agent/commit/?id=28dc110ea552c1bcbbd0160c29e1438962cd13f2
Submitter: Jenkins
Branch: master

commit 28dc110ea552c1bcbbd0160c29e1438962cd13f2
Author: Krzysztof Szukiełojć <email address hidden>
Date: Mon Mar 21 12:09:55 2016 +0100

    Setting timeout for calling lshw

    In some cases lshw may take too long like when we have
    lot of partitions > 600. We avoid this problem with
    setting timeout for lshw.

    Change-Id: I67748bc18023f3f6edce0cc20d4f0486877723b2
    Closes-bug: #1559167

Changed in fuel:
status: In Progress → Fix Committed
Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers