Execution of lstopo command inside nailgun-agent is hanging when hw:cpu_realtime specs are defined for VM

Bug #1742886 reported by Anton Rodionov on 2018-01-12
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Critical
Oleksiy Molchanov

Bug Description

lstopo command in nailgun-agent hangs when there are VMs booted with realtime scheduling.

nailgun-agent is being executed on every deployed Compute host every 30 seconds. Among the commands which are executed is the
'lstopo' command which is part of get_numa_topology function.

lstopo command will go one by one through all the existing CPUs of the System, including the ones allocated to tenant VMs by the vcpu_pin_set parameter in nova.conf, set it's affinity and bind itself to each Core.

The problem occurs when there are VMs with following extra_specs defined in the Nova flavor:

hw:cpu_realtime=yes
hw:cpu_realtime_mask=<vCPU number to set emulator pin>

As a result of those extra specs, following is defined for the VM in libvirt-xml

Example:

<vcpu placement='static'>8</vcpu>
<cputune>
<vcpupin vcpu='0' cpuset='10'/>
<vcpupin vcpu='1' cpuset='11'/>
<vcpupin vcpu='2' cpuset='12'/>
<vcpupin vcpu='3' cpuset='13'/>
<vcpupin vcpu='4' cpuset='14'/>
<vcpupin vcpu='5' cpuset='15'/>
<vcpupin vcpu='6' cpuset='16'/>
<vcpupin vcpu='7' cpuset='17'/>
<emulatorpin cpuset='10'/>
<vcpusched vcpus='1-7' scheduler='fifo' priority='1'/>
</cputune>

Due to vcpusched_vcpus parameter in libvirt, the tasks which the VM's GuestOS is executing will have a higher priority than the lstopo command which is trying to bind the pCPU belonging to that VM. Due to this, the command execution waits indefinitely. The result is that nailgun-agent is hanging on the Compute host as well, and the Compute is shown as Offline in 'fuel node'
printout.

Following is requested:
'lstopo' is being executed towards an already provisioned/deployed node. This is not needed at this point of time, as CPU topology will not change once the Compute is deployed. It should be enough to run it once, only during server discovery. Once the node is provisioned, lstopo is not needed any more. Same applies for other commands inside nailgun-agent, like 'lshw'.

MOS 9.2 build 596, November 2017

Changed in fuel:
milestone: none → 9.2-mu-4
assignee: nobody → MOS Maintenance (mos-maintenance)
importance: Undecided → Critical
status: New → Confirmed
Changed in fuel:
assignee: MOS Maintenance (mos-maintenance) → Oleksiy Molchanov (omolchanov)

Fix proposed to branch: 9.0/mitaka
Change author: Oleksiy Molchanov <email address hidden>
Review: https://review.fuel-infra.org/37669

Changed in fuel:
status: Confirmed → In Progress

Reviewed: https://review.fuel-infra.org/37669
Submitter: Pkgs Jenkins <email address hidden>
Branch: 9.0/mitaka

Commit: 2d5d8887daec40e9f6f088d1d9dc81f6a064ae45
Author: Oleksiy Molchanov <email address hidden>
Date: Wed Jan 24 16:11:39 2018

Set timeout for lstopo

Closes-Bug: 1742886
Change-Id: Ic80cc341e16067e33d81847be30147cdc68c5a66

Changed in fuel:
status: In Progress → Fix Committed
Dmitry (dtsapikov) wrote :

Verified on 9.2+mu4

Changed in fuel:
status: Fix Committed → Fix Released

Fix proposed to branch: 9.0/mitaka
Change author: Oleksiy Molchanov <email address hidden>
Review: https://review.fuel-infra.org/38271

Changed in fuel:
status: Fix Released → Fix Committed
Changed in fuel:
milestone: 9.2-mu-4 → 9.2-mu-6

Reviewed: https://review.fuel-infra.org/38271
Submitter: Pkgs Jenkins <email address hidden>
Branch: 9.0/mitaka

Commit: 8a0cb23ca02f3a27874e1d36cee724884af07207
Author: Oleksiy Molchanov <email address hidden>
Date: Wed Apr 11 09:45:57 2018

Put lstopo output to temp file

Change-Id: I53bd8964759f02ca1c60891525007368c5199661
Closes-Bug: 1742886

Dmitry (dtsapikov) wrote :

Verified on 9.2+mu6

Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers