[nailgun agent] untimely lshw command run observation

Bug #1554970 reported by Swann Croiset on 2016-03-09
26
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
High
Alexey Elagin
8.0.x
High
Sergii Rizvan
Mitaka
High
Sergii Rizvan
Newton
High
Sergii Rizvan

Bug Description

Description
============
I'm observing frequently the process lshw consuming up one CPU from time to time. Actually, on my env every minute during ~20 seconds (real hardware, mos 8, controller node, I presume it's true for all nodes)
I'm wondering why the command is ran so frequently?

root 30762 0.0 0.0 4440 652 ? Ss 15:03 0:00 \_ /bin/sh -c flock -w 0 -o /var/lock/nailgun-agent.lock -c "/usr/bin/nailgun-agent 2>&1 | tee -a /var/log/nailgun-agent
.log | /usr/bin/logger -t nailgun-agent"
root 30763 0.0 0.0 5896 608 ? S 15:03 0:00 \_ flock -w 0 -o /var/lock/nailgun-agent.lock -c /usr/bin/nailgun-agent 2>&1 | tee -a /var/log/nailgun-agent.log |
/usr/bin/logger -t nailgun-agent
root 30764 0.0 0.0 4440 628 ? S 15:03 0:00 \_ /bin/sh -c /usr/bin/nailgun-agent 2>&1 | tee -a /var/log/nailgun-agent.log | /usr/bin/logger -t nailgun-agent
root 30765 0.7 0.0 175408 23296 ? Sl 15:03 0:00 \_ ruby /usr/bin/nailgun-agent
root 33668 96.1 0.1 501008 475844 ? R 15:03 0:17 | \_ /usr/bin/lshw -json
root 30766 0.0 0.0 5916 696 ? S 15:03 0:00 \_ tee -a /var/log/nailgun-agent.log
root 30767 0.0 0.0 5908 704 ? S 15:03 0:00 \_ /usr/bin/logger -t nailgun-agent

Expected result/impact
====================
IMHO, this is definitely overkill, a waste of cpu time and useless because this kind of information is nearly static (hardware doesn't change so frequently right), the nailgun agent should run it once (or at least much less frequently if a use case requires 'fresh' hardware informations?), and I'm not talking about possible interference on the useful workload which could be impacted as well.

Please, could somebody confirm and fix it if accurate.

Step to reproduce
===============
deploy an env with fuel mos8

Swann Croiset (swann-w) wrote :
Changed in fuel:
milestone: none → 9.0
assignee: nobody → Fuel Python Team (fuel-python)
tags: added: area-python
tags: added: module-nailgun-agent
removed: area-python
tags: added: area-python
Changed in fuel:
status: New → Confirmed
Changed in fuel:
importance: Undecided → Medium
Ilya Shakhat (shakhat) wrote :

On scale-lab hardware the average CPU utilization for lshw is 62%. Here're the details collected by atop (10 minutes interval):

$ atop -r atop_20160316 -P PRC | grep lshw | head -n 22
PRC node-293 1458086421 2016/03/16 00:00:21 20 28722 (lshw) R 100 1592 46 0 120 0 0 5 0
PRC node-293 1458086441 2016/03/16 00:00:41 20 28722 (lshw) E 100 2492 88 0 0 0 0 0 0
PRC node-293 1458086501 2016/03/16 00:01:41 20 37072 (lshw) R 100 1283 35 0 120 0 0 16 0
PRC node-293 1458086521 2016/03/16 00:02:01 20 37072 (lshw) E 100 2240 58 0 0 0 0 0 0
PRC node-293 1458086541 2016/03/16 00:02:21 20 40720 (lshw) R 100 758 19 0 120 0 0 31 0
PRC node-293 1458086561 2016/03/16 00:02:41 20 40720 (lshw) E 100 2300 62 0 0 0 0 0 0
PRC node-293 1458086601 2016/03/16 00:03:21 20 7711 (lshw) R 100 538 23 0 120 0 0 7 0
PRC node-293 1458086621 2016/03/16 00:03:41 20 7711 (lshw) R 100 1934 64 0 120 0 0 2 0
PRC node-293 1458086641 2016/03/16 00:04:01 20 7711 (lshw) E 100 2529 91 0 0 0 0 0 0
PRC node-293 1458086681 2016/03/16 00:04:41 20 15749 (lshw) R 100 1846 55 0 120 0 0 23 0
PRC node-293 1458086701 2016/03/16 00:05:01 20 15749 (lshw) E 100 676 41 0 0 0 0 0 0
PRC node-293 1458086741 2016/03/16 00:05:41 20 22965 (lshw) R 100 1399 45 0 120 0 0 8 0
PRC node-293 1458086761 2016/03/16 00:06:01 20 22965 (lshw) E 100 2596 103 0 0 0 0 0 0
PRC node-293 1458086781 2016/03/16 00:06:21 20 26855 (lshw) R 100 1176 25 0 120 0 0 5 0
PRC node-293 1458086801 2016/03/16 00:06:41 20 26855 (lshw) E 100 2236 43 0 0 0 0 0 0
PRC node-293 1458086841 2016/03/16 00:07:21 20 34339 (lshw) R 100 373 9 0 120 0 0 7 0
PRC node-293 1458086861 2016/03/16 00:07:41 20 34339 (lshw) R 100 1936 62 0 120 0 0 28 0
PRC node-293 1458086881 2016/03/16 00:08:01 20 34339 (lshw) E 100 2478 86 0 0 0 0 0 0
PRC node-293 1458086921 2016/03/16 00:08:41 20 1665 (lshw) R 100 1034 35 0 120 0 0 30 0
PRC node-293 1458086941 2016/03/16 00:09:01 20 1665 (lshw) E 100 2547 96 0 0 0 0 0 0
PRC node-293 1458086961 2016/03/16 00:09:21 20 5307 (lshw) R 100 1254 29 0 120 0 0 30 0
PRC node-293 1458086981 2016/03/16 00:09:41 20 5307 (lshw) E 100 2215 56 0 0 0 0 0 0

Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Andrey Danin (gcon-monolake)
Changed in fuel:
assignee: Andrey Danin (gcon-monolake) → Alexey Elagin (aelagin)

(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:

steps to reproduce

For more detailed information on the contents of each of the listed sections see https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Here_is_how_you_file_a_bug

tags: added: need-info
Dmitry Pyzhov (dpyzhov) wrote :

Permanent high load of all cluster nodes is high priority issue.

Changed in fuel:
importance: Medium → High
Changed in fuel:
milestone: 9.0 → 10.0
Alexey Elagin (aelagin) on 2016-05-12
Changed in fuel:
status: Confirmed → Incomplete
Ilya Shakhat (shakhat) wrote :

@aelagin - what kind of info do you need? please specify before changing the status to incomplete.

Ilya Shakhat (shakhat) wrote :

The following one-liner can be used to measure the overall CPU consumption by lshw process on OpenStack node:

# atop -r /var/log/atop/atop_current -P PRC | grep "\b00:..:" | grep lshw | awk '{s+=$11} END {print s/3600}'

The performance signinficantly depends on number of CPU sockets, cores and PCI devices:

 * KVM host (1 core, 6 net ifs, 1 RAM slot):

root@node-1:~# atop -r /var/log/atop/atop_current -P PRC | grep "\b00:..:" | grep lshw | awk '{s+=$11} END {print s/3600}'
5.90667

 * HW host (2 sockets, 48 cores, 8 net ifs, 16 RAM slots):

root@node-77:~# atop -r /var/log/atop/atop_current -P PRC | grep "\b00:..:" | grep lshw | awk '{s+=$11} END {print s/3600}'
55.0083

 -- it means that lshw process is running in average half of time, which is doubtful.

Taking above into account return the bug into Confirmed state with High priority.

Changed in fuel:
status: Incomplete → Confirmed
Dmitry Pyzhov (dpyzhov) wrote :

We are in HCF phase and unfortunately we have to move the bug to the maintenance update.

tags: added: move-to-mu
Alexey Elagin (aelagin) wrote :
Changed in fuel:
status: Confirmed → In Progress
Alexey Elagin (aelagin) wrote :

So, i changed lshw parameters and now it gathers only network and storage information and it much faster.

Alexey Elagin (aelagin) wrote :

Ups, this fix brokes json so i'll make changes later.

Dmitry Pyzhov (dpyzhov) on 2016-08-04
tags: added: 9.1-proposed

@aelagin
It is known issue, that json is broken with -C
https://github.com/lyonel/lshw/commit/67921217b14b8ae4e16bb63a93587a0106d93624

So can we use this:
lshw -xml -sanitize -C network -C storage

And convert this valid xml to json using ruby?

tags: added: customer-found sla1

sla1 for 9.0-updates

Customer reported on his env that the process also consumes ram:
http://paste.openstack.org/show/570217/

Alexey Elagin (aelagin) wrote :

It's better to use a cache file with full lshw output and run lshw just once if cache file is absent.

@aelagin

Looks like you are right.
This is the output from customer's compute node:
root@compute-0-10:~# time lshw -xml -sanitize -C network -C storage >/dev/null

real 0m26.769s
user 0m25.924s
sys 0m0.804s
root@compute-0-10:~# time lshw -xml -sanitize >/dev/null

real 0m24.949s
user 0m24.172s
sys 0m0.736s
root@compute-0-10:~#

So "-C network -C storage" doesn't improve time a lot..

But how cached file will handle hardware change?

To disable run of lshw need to avoid run of _get_pci_dev_list
To do this change this line
https://github.com/openstack/fuel-nailgun-agent/blob/9.0/agent#L337
(:pci_devices => _get_pci_dev_list,)
to empty list
:pci_devices => [],

Ivan Ponomarev (ivanzipfer) wrote :

I think this problem should be solved using udev script.
Something like:
ACTION=="add|remove", RUN+="/path/some_script.sh"

Fix proposed to branch: master
Review: https://review.openstack.org/383021

@aelagin, So the change 383013 actually disables lshw run for deployed nodes, and leaves it only on bootstrap, correct?

Change abandoned by Alexey Elagin (<email address hidden>) on branch: master
Review: https://review.openstack.org/335065

Change abandoned by Alexey Elagin (<email address hidden>) on branch: master
Review: https://review.openstack.org/383013

Change abandoned by Alexey Elagin (<email address hidden>) on branch: master
Review: https://review.openstack.org/383021

Alexey Elagin (aelagin) wrote :

@alexbozhenko yep, it's true. We don't need to collect hw data from deployed nodes.

Reviewed: https://review.openstack.org/383071
Committed: https://git.openstack.org/cgit/openstack/fuel-nailgun-agent/commit/?id=b8a2f95f0f85806de885da4d9e0483296b6c0def
Submitter: Jenkins
Branch: master

commit b8a2f95f0f85806de885da4d9e0483296b6c0def
Author: Alexey Elagin <email address hidden>
Date: Thu Oct 6 19:03:02 2016 +0300

    Change _get_pci_dev_list func

    Add hostname check and run lshw only on bootstrap nodes.
    Add sanitize param to lshw to hide any ip,mac etc

    Change-Id: I7739da68ab059178787ff0fe2418a54717684750
    Closes-Bug: #1554970

Changed in fuel:
status: In Progress → Fix Committed

Reviewed: https://review.openstack.org/388002
Committed: https://git.openstack.org/cgit/openstack/fuel-nailgun-agent/commit/?id=d7f11710ce4c600c63c4359187b3f72ee34843a2
Submitter: Jenkins
Branch: stable/8.0

commit d7f11710ce4c600c63c4359187b3f72ee34843a2
Author: Alexey Elagin <email address hidden>
Date: Thu Oct 6 19:03:02 2016 +0300

    Change _get_pci_dev_list func

    Add hostname check and run lshw only on bootstrap nodes.
    Add sanitize param to lshw to hide any ip,mac etc

    Change-Id: I7739da68ab059178787ff0fe2418a54717684750
    Closes-Bug: #1554970
    (cherry picked from commit d93f99480c551a6d82a56e49ac661bd081cbddfe)

Reviewed: https://review.openstack.org/391898
Committed: https://git.openstack.org/cgit/openstack/fuel-nailgun-agent/commit/?id=af1e5da094c2975230da29f808b26645231841d4
Submitter: Jenkins
Branch: master

commit af1e5da094c2975230da29f808b26645231841d4
Author: Sergii Rizvan <email address hidden>
Date: Mon Oct 31 17:50:53 2016 +0200

    Add .chomp to a system output operation

    In the _get_pci_dev_list method adding .chomp is needed to the
    `cat /etc/nailgun_systemtype` operation in order to make a correct
    comparison with 'bootstrap' string.

    Change-Id: Id2fdc4c7b7bd7604c43803da594480bf865cf1cb
    Related-Bug: #1554970

Reviewed: https://review.openstack.org/387960
Committed: https://git.openstack.org/cgit/openstack/fuel-nailgun-agent/commit/?id=1dd3dd64bff80dd60c53881adc9c378fd6660bf6
Submitter: Jenkins
Branch: stable/mitaka

commit 1dd3dd64bff80dd60c53881adc9c378fd6660bf6
Author: Alexey Elagin <email address hidden>
Date: Thu Oct 6 19:03:02 2016 +0300

    Change _get_pci_dev_list func

    Add hostname check and run lshw only on bootstrap nodes.
    Add sanitize param to lshw to hide any ip,mac etc

    Change-Id: I7739da68ab059178787ff0fe2418a54717684750
    Closes-Bug: #1554970
    (cherry picked from commit b8a2f95f0f85806de885da4d9e0483296b6c0def)

tags: added: on-verification

Verified on 9.2 snapshot #516.

tags: removed: on-verification

Reviewed: https://review.openstack.org/405294
Committed: https://git.openstack.org/cgit/openstack/fuel-nailgun-agent/commit/?id=27ad2910a41c564fc4c4dd1df5b20ef55f89eb9a
Submitter: Jenkins
Branch: stable/newton

commit 27ad2910a41c564fc4c4dd1df5b20ef55f89eb9a
Author: Alexey Elagin <email address hidden>
Date: Thu Oct 6 19:03:02 2016 +0300

    Change _get_pci_dev_list func

    Add hostname check and run lshw only on bootstrap nodes.
    Add sanitize param to lshw to hide any ip,mac etc

    Change-Id: I7739da68ab059178787ff0fe2418a54717684750
    Closes-Bug: #1554970
    (cherry picked from commit 1dd3dd64bff80dd60c53881adc9c378fd6660bf6)

tags: added: on-verification

Verified on MOS 8.0 + MU4 updates.

tags: removed: on-verification

This issue was fixed in the openstack/fuel-nailgun-agent 11.0.0.0rc1 release candidate.

tags: added: on-verification

Verified on 10.0 build #1569.

tags: removed: on-verification
Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers