2016-03-18 15:06:37 |
Alexander Gordeev |
description |
Nailgun-agent stucks on system with huge amount of disks: (phys disks - 64)
$ lsblk |wc -l
716
After investigation, we found that root-cause of stuck - lshw call [0][1]
Strace log for 'strace shw':
http://paste.openstack.org/show/491111/
lLshw stuck on random disk each time- root-cause , that he tries to reach part of device from multi-path , which can be unaccessible in this time:
###
lsblk |grep -A 3 -B 3 3600144f0534f392c000056e972890031-part2
sdaw 67:0 0 15G 0 disk
`-3600144f0534f392c000056e972890031 (dm-42) 252:42 0 15G 0 mpath
|-3600144f0534f392c000056e972890031-part1 (dm-142) 252:142 0 24M 0 part
|-3600144f0534f392c000056e972890031-part2 (dm-143) 252:143 0 200M 0 part
`-3600144f0534f392c000056e972890031-part3 (dm-144) 252:144 0 14.4G 0 part
sdbm 68:0 0 15G 0 disk
`-3600144f0534f392c000056e9728b0041 (dm-60) 252:60 0 15G 0 mpath
--
sddi 71:0 0 15G 0 disk
`-3600144f0534f392c000056e972890031 (dm-42) 252:42 0 15G 0 mpath
|-3600144f0534f392c000056e972890031-part1 (dm-142) 252:142 0 24M 0 part
|-3600144f0534f392c000056e972890031-part2 (dm-143) 252:143 0 200M 0 part
`-3600144f0534f392c000056e972890031-part3 (dm-144) 252:144 0 14.4G 0 part
###
dd if=/dev/zero of=/dev/sdaw2
dd: writing to '/dev/sdaw2': No such process
1+0 records in
0+0 records out
0 bytes (0 B) copied, 0.052422 s, 0.0 kB/s
Otherwise, second path of device are fine:
###
Also, device are fine using mapped name :
dd if=/dev/zero of=/dev/mapper/3600144f0534f392c000056e972890031-part2
409601+0 records in
409600+0 records out
209715200 bytes (210 MB) copied, 11.9792 s, 17.5 MB/s
Work-around - trigger lshw with '-disable scsi' key.
[0]https://github.com/openstack/fuel-nailgun-agent/blob/master/agent#L334
[1]https://github.com/openstack/fuel-nailgun-agent/blob/master/agent#L904-L920 |
nailgun-agent gets stuck on system with huge amount of disks: (phys disks - 64)
$ lsblk |wc -l
716
After investigation, we found that root-cause of why it gets stuck - lshw call [0][1]
Strace log for 'strace lshw':
http://paste.openstack.org/show/491111/
lshw gets stuck on random disk each time. The root-cause is that it tries to reach a partition of multipath device, which can be inaccessible at this moment of time:
###
lsblk |grep -A 3 -B 3 3600144f0534f392c000056e972890031-part2
sdaw 67:0 0 15G 0 disk
`-3600144f0534f392c000056e972890031 (dm-42) 252:42 0 15G 0 mpath
|-3600144f0534f392c000056e972890031-part1 (dm-142) 252:142 0 24M 0 part
|-3600144f0534f392c000056e972890031-part2 (dm-143) 252:143 0 200M 0 part
`-3600144f0534f392c000056e972890031-part3 (dm-144) 252:144 0 14.4G 0 part
sdbm 68:0 0 15G 0 disk
`-3600144f0534f392c000056e9728b0041 (dm-60) 252:60 0 15G 0 mpath
--
sddi 71:0 0 15G 0 disk
`-3600144f0534f392c000056e972890031 (dm-42) 252:42 0 15G 0 mpath
|-3600144f0534f392c000056e972890031-part1 (dm-142) 252:142 0 24M 0 part
|-3600144f0534f392c000056e972890031-part2 (dm-143) 252:143 0 200M 0 part
`-3600144f0534f392c000056e972890031-part3 (dm-144) 252:144 0 14.4G 0 part
###
dd if=/dev/zero of=/dev/sdaw2
dd: writing to '/dev/sdaw2': No such process
1+0 records in
0+0 records out
0 bytes (0 B) copied, 0.052422 s, 0.0 kB/s
Otherwise, the second path of device is fine:
###
Also, device is fine using mapped name :
dd if=/dev/zero of=/dev/mapper/3600144f0534f392c000056e972890031-part2
409601+0 records in
409600+0 records out
209715200 bytes (210 MB) copied, 11.9792 s, 17.5 MB/s
Work-around - trigger lshw with '-disable scsi' key.
[0]https://github.com/openstack/fuel-nailgun-agent/blob/master/agent#L334
[1]https://github.com/openstack/fuel-nailgun-agent/blob/master/agent#L904-L920 |
|