Mirantis OpenStack

lshw 100%CPU usage every minute

Series 8.0.x
Bug #1613985

Bug #1613985 reported by Stanislav Kolenkin on 2016-08-17

This bug affects 3 people

	Status	Importance	Assigned to
Mirantis OpenStack	Confirmed	High	Stanislav Kolenkin
8.0.x	Confirmed	High	Stanislav Kolenkin
9.x	Confirmed	High	Stanislav Kolenkin

Bug Description

I have found that lshw process is started every minute and uses 100% CPU on any cloud-related node.

date && ps axuwww |grep lshw
Wed Aug 17 09:54:07 AST 2016
root 25790 111 0.0 61924 41372 ? R 09:54 0:01 /usr/bin/lshw -json
root 25794 0.0 0.0 10432 928 pts/8 S+ 09:54 0:00 grep --color=auto lshw

date && ps axuwww |grep lshw
Wed Aug 17 09:54:26 AST 2016
root 25819 0.0 0.0 10432 924 pts/8 S+ 09:54 0:00 grep --color=auto lshw

date && ps axuwww |grep lshw
Wed Aug 17 09:55:27 AST 2016
root 26283 32.0 0.0 23084 8792 ? R 09:55 0:00 /usr/bin/lshw -json
root 26287 0.0 0.0 10432 928 pts/8 S+ 09:55 0:00 grep --color=auto lshw

lshw --version
Hardware Lister (lshw) - B.02.16

Screenshot in the attachment.

See original description

Tags:

Revision history for this message

Stanislav Kolenkin (skolenkin) wrote on 2016-08-17:

lshw.png Edit (340.6 KiB, image/png)

Stanislav Kolenkin (skolenkin) on 2016-08-17

description:

updated

Revision history for this message

Denis Meltsaykin (dmeltsaykin) wrote on 2016-08-17:

Stanislav, please add more details on the topic. What is the impact? How does it affect workloads? Are there any failures?

tags:

added: area-linux

Revision history for this message

Stanislav Kolenkin (skolenkin) wrote on 2016-08-18:

The described issue can affect the response time of services, caused CPU usage spikes.

Revision history for this message

Oleksandr Savatieiev (osavatieiev) wrote on 2016-08-23:

The lshw process works for 10-15 seconds on one (1) CPU and can affect response time when there will be high load on some service (services) that is working on the same host. For example, networking agent or contrail's cassandra DB that will definitely be affected.

This is not critical, I agree. But still, on high loaded clouds - this will become serious.

description:

updated

Stanislav Kolenkin (skolenkin) on 2016-08-23

tags:

added: customer-found

Andrii Kolesnikov (akolesnikov) on 2016-08-23

tags:

added: support

Revision history for this message

Maksym Shalamov (mshalamov) wrote on 2016-08-25:

Hello,

The described issue affected only 1 CPU core, so can affect the response time of services only on the high loaded clusters. Probably exist some simple workaround to resolve this issue. Could somebody help resolve this issue?

Revision history for this message

Maksym Shalamov (mshalamov) wrote on 2016-08-25:

Please pay attention that issue contains a customer-found tag.

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-08-25:

Why the issue in incomplete status?

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-08-25:

#10

Hi MOS Linux team, could you please take a look the issue?

Thank you!

Revision history for this message

Ivan Suzdal (isuzdal) wrote on 2016-08-25:

#11

Dear all
High CPU usage is absolutely normal behavior for lhsw.
AFAIU, lshw called from nailgun-agent. So, here is two options:
1) Gather necessary information using pure ruby code instead of calling lshw.
2) Set nailgun-agent priority with nice/ionice commands in cron task.
Yet another option - execute nailgun-agent on dedicated CPU (taskset can help).

Revision history for this message

Dmitriy Stremkovskiy (dstremkouski) wrote on 2017-02-17:

#12

I don't think it is normal for production to spend 100% of one CPU for nothing. Why do we need to scan hardware changes all the time. This information is needed for monitoring use cases only.
You may consult deploy engineers to run nailgun-agent before each deploy changes to get proper data for nodes, but no need to run this scan each minute.