problems in nailgun-agent because rethtool doesn't work with virtio adapters properly

Bug #1562835 reported by Alexey Galkin
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Dmitry Guryanov
Mitaka
Fix Released
High
Dmitry Guryanov

Bug Description

It looks like we use lspci with incorrect options.

One of fuel slave nodes nailgun-agent uses 99.9% of CPU and not responding. As the host has the status: offline.

(Note: environment hasn't yet been deployed)

Some useful information(top, ps aux): http://paste.openstack.org/raw/492044/
Logfile nailgun-agent.log: https://goo.gl/F7hJtw
Fuel version (shotgun2 report): http://paste.openstack.org/show/492048/

Steps To Reproduce:
1. Deploy MOS 9.0 environment and start slave nodes

Observed Result:
slave nodes go offline, and we can see the following warnings in logs:
Can't get data from lspci. Reason: lspci exited with status 1
/usr/bin/lspci: option requires an argument -- 's'

Revision history for this message
Bug Checker Bot (bug-checker) wrote : Autochecker

(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:

actual result

expected result

steps to reproduce

For more detailed information on the contents of each of the listed sections see https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Here_is_how_you_file_a_bug

tags: added: need-info
Revision history for this message
Alexander Kislitsky (akislitsky) wrote : Re: nailgun-agent consumes too many resources and is not responding

@Alexey, could you please provide steps to reproduce.

Changed in fuel:
assignee: nobody → Fuel Python Team (fuel-python)
status: New → Incomplete
Dmitry Pyzhov (dpyzhov)
tags: added: area-python
Revision history for this message
Alexey Galkin (agalkin) wrote :

@Alexander, it's a floating bug. no concrete steps to reproduce.

But most often it is reproduced on the bootstrap images.

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote : Re: nailgun-agent consumes too many resources and is not responding: Can't get data from lspci.

Hi Alexander, it looks like we use lspci with incorrect options.
I've updated the description of the issue.

summary: - nailgun-agent consumes too many resources and is not responding
+ nailgun-agent consumes too many resources and is not responding: Can't
+ get data from lspci.
description: updated
Changed in fuel:
status: Incomplete → Confirmed
Revision history for this message
slava valyavskiy (slava-val-al) wrote :

It seems that we have 'nil' in ':bus_info' in some cases - https://github.com/openstack/fuel-nailgun-agent/blob/master/agent#L416

Dmitry Pyzhov (dpyzhov)
Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Dmitry Guryanov (dguryanov)
Revision history for this message
Dmitry Guryanov (dguryanov) wrote :

There is a bug in ohai, if you run this script:

-------------------------------------
require 'ohai/system'

x = 1
while 1
        os = Ohai::System.new()
        os.all_plugins

        x += 1
        puts x
end
--------------------------------------
if will fail after 100-500 iterations with

http://www.paste.org/80856

Revision history for this message
Dmitry Guryanov (dguryanov) wrote :

Bug exists in ohai 6.14.0-2~u14.04+mos1 and 8.4.0-1 (from latest ubuntu). There is no bug in ohai 6.14.0-2.

Dmitry Pyzhov (dpyzhov)
summary: - nailgun-agent consumes too many resources and is not responding: Can't
- get data from lspci.
+ MOS version of ohai fails randomly
Changed in fuel:
assignee: Dmitry Guryanov (dguryanov) → MOS Packaging Team (mos-packaging)
tags: added: area-mos
removed: area-python
Roman Vyalov (r0mikiam)
Changed in fuel:
status: Confirmed → New
Revision history for this message
Igor Yozhikov (iyozhikov) wrote : Re: MOS version of ohai fails randomly

Colleagues, according to package ownership transferring bug to mos-linux team

Changed in fuel:
assignee: MOS Packaging Team (mos-packaging) → MOS Linux (mos-linux)
Dina Belova (dbelova)
Changed in fuel:
status: New → Confirmed
Revision history for this message
Dmitry Teselkin (teselkin-d) wrote :

The original bug (before the comment #6) was caused by the fact that rethtool doesn't work with 'virtio' interfaces. In that case it fails to execute 'ioctl' method [0] and that leads to empty value passed to lspci.
From the original log [1] it can be seen that ruby constantly tries to execute lspci with wrong (empty) parameter, and does this *before* a segfault occurs. That means that segfaul is not a root cause, and even hardly can be an issue because to reproduce it one have to run it hundreds of times. As I said, the *real* root cause here is virtio network interface that isn't supported.

If anyone think that segfault is important please file a separate bug for this.

I'm passing the bug back because in any way we can't use virtio network interfaces on slave nodes and thus this bug have no real sense.

[0] https://review.fuel-infra.org/gitweb?p=packages/trusty/rethtool.git;a=blob;f=debian/patches/0001-Added-interface-driver-and-bus_info-support.patch;h=41abd6b0b88439a29f29fad96a6ab3b2a09ffad3;hb=refs/heads/master#l61
[1] https://goo.gl/F7hJtw

Changed in fuel:
assignee: MOS Linux (mos-linux) → Fuel Python Team (fuel-python)
Changed in fuel:
milestone: 9.0 → 10.0
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Dmitry Guryanov (dguryanov)
tags: added: area-python
removed: area-mos
summary: - MOS version of ohai fails randomly
+ problems in nailgun-agent because rethtool doesn't work with virtio
+ adapters properly
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-nailgun-agent (master)

Fix proposed to branch: master
Review: https://review.openstack.org/309447

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Dmitry Guryanov (dguryanov) wrote :

There are two problems in this bug:

1. nailgun agent consumed 100% CPU.
2. lspci called with wrong arguments.

The cpu consumed not because of problems in nailgun-agent, because if you look at the times of log messages in nailgun-agent.log you will find that it was called not more that several time per minute. So I hope this problem will be fixed in scope of https://bugs.launchpad.net/fuel/+bug/1572485 by https://review.fuel-infra.org/#/c/20149/

2. This problem will be fixed by https://review.openstack.org/#/c/309447/

Revision history for this message
Dmitry Guryanov (dguryanov) wrote :

ETA: 06.06.2016

Dmitry Pyzhov (dpyzhov)
Changed in fuel:
status: In Progress → Fix Committed
tags: added: on-verification
Revision history for this message
Andrey Lavrentyev (alavrentyev) wrote :

Wasn't able to reproduce it on 9.0-mos #488.

CPU is OK, as well as lspci invocation, so closing...

[root@nailgun ~]# shotgun2 short-report | head -n8
cat /etc/fuel_build_id:
 488
cat /etc/fuel_build_number:
 488
cat /etc/fuel_release:
 9.0
cat /etc/fuel_openstack_version:
 mitaka-9.0

tags: removed: on-verification
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.