drmgr command fails during the scale-up test on Novalink System (Brazos)

Bug #1696434 reported by bugproxy on 2017-06-07
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
High
Unassigned
powerpc-utils (Ubuntu)
High
Ubuntu on IBM Power Systems Bug Triage
Xenial
High
Steve Langasek
Yakkety
High
Steve Langasek
Zesty
High
Steve Langasek

Bug Description

[SRU Justification]
drmgr fails intermittently when adding devices to the system.

[Test case]
To be completed by IBM, who have access to the hardware.
1. Run a scale test of launching 1000 VMs on a Novalink system.
2. Observe that some of the deployments fail with the following error:
kernel I/O op failed, rc = 26 len = 26.
3. Install powerpc-utils from -proposed
4. Run the scale test again.
5. Observe that all the deployments succeed.

[Regression potential]
This change cherry-picked from upstream corrects faulty handling of a 0 return code from syscalls. Regression potential appears to be minimal.

Problem:

During the scale-up test to 1000 VMs I could see 20 deploys failed due to following command failure..

Command /usr/sbin/pvmdrmgr drmgr -c slot -s 'U9119.MHE.1085B07-V1-C1030' -a -w 3 returned 19. Additional messages: /usr/sbin/pvmdrmgr drmgr -c slot -s 'U9119.MHE.1085B07-V1-C1030' -a -w 3
Validating I/O DLPAR capability...yes.
kernel I/O op failed, rc = 26 len = 26.

I have been looking through the logs on this system to piece together what is happening when the dlpar add failures occur. From what I am seeing we are trying to dlpar add a virtual network device and getting a error when trying to add the device to the system.

> ########## May 17 05:18:00 2017 ##########
> drmgr: -c slot -s U9119.MHE.1085B07-V1-C1030 -a -w 3
> Validating I/O DLPAR capability...yes.
> Getting node types 0x00000003
> Could not find DRC property group in path: /proc/device-tree/ibm,serial.
> Acquiring drc index 0x30000406
> get-sensor for 30000406: 0, 2
> Setting allocation state to 'alloc usable'
> Setting indicator state to 'unisolate'
> Configuring connector for drc index 30000406
> Adding device-tree node /proc/device-tree/vdevice/l-lan@30000406
> ofdt update: add_node /vdevice/l-lan@30000406 ibm,loc-code 30 U9119.MHE.1085B07-V1-C1030-T1
> Getting node types 0x00000003
> performing kernel op for U9119.MHE.1085B07-V1-C1030, file is /sys/bus/pci/slots/control/add_slot
> kernel I/O op failed, rc = 26 len = 26.
> No such device
> Releasing drc index 0x30000406
> get-sensor for 30000406: 0, 1
> Setting isolation state to 'isolate'
> Setting allocation state to 'alloc unusable'
> get-sensor for 30000406: 0, 2
> drc_index 30000406 sensor-state: 2
> Resource is not available to the partition.
> Removing device-tree node /proc/device-tree/vdevice/l-lan@30000406
> ########## May 17 05:20:11 2017 ##########

From the drmgr log, you can see that we get a ENODEV return code when performing the kernel operation to add the device to the system.

> performing kernel op for U9119.MHE.1085B07-V1-C1030, file is /sys/bus/pci/slots/control/add_slot
> kernel I/O op failed, rc = 26 len = 26.
> No such device

This indicates that the rpadlpar_io kernel modules was unable to find the device in the device tree. This doesn not seem right because earlier in the drmgr logs we add the device to the device tree. Additionally, the drmgr code validates that the add succeeds by retrieveing the newly added device node from the device tree as a sanity check. There are no failures reported for this.

> Adding device-tree node /proc/device-tree/vdevice/l-lan@30000406
> ofdt update: add_node /vdevice/l-lan@30000406 ibm,loc-code 30 U9119.MHE.1085B07-V1-C1030-T1
> Getting node types 0x00000003

I started scale-up testing and I could see deploys are going fine. Will post a comment here if I see further drmgr failures.

Patches have been submitted upstream.

https://groups.google.com/forum/#!topic/powerpc-utils-devel/GNEi65WBwkQ

and

https://groups.google.com/forum/#!topic/powerpc-utils-devel/hJfUb5wYPsE

bugproxy (bugproxy) on 2017-06-07
tags: added: architecture-ppc64le bugnameltc-154853 severity-high targetmilestone-inin16043
Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → powerpc-ibm-utils (Ubuntu)
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in powerpc-ibm-utils (Ubuntu):
status: New → Confirmed

------- Comment From <email address hidden> 2017-06-08 11:05 EDT-------
Can we get the fix into both 16.04 (Xenial) and 17.04 (Zesty) please? Thanks.

Changed in ubuntu-power-systems:
status: New → Confirmed
Breno Leitão (breno-leitao) wrote :

I created a new version for this fix. This is version 1.3.1-2ubuntu0.3 and you could see it on my PPA:

https://launchpad.net/~breno-leitao/+archive/ubuntu/powerpc-ibm-utils

please install the package using:

 # sudo add-apt-repository ppa:breno-leitao/powerpc-ibm-utils
 # sudo apt-get update
 # apt-get install powerpc-ibm-utils

If it fixes the problem, I will ask someone to sponsor this package.

Amartey Pearson (apearson) wrote :

Looking at the diff, it appears you brought in:

0005-in-kernel-dlpar.patch
Subject: [PATCH] drmgr: Start using in-kernel DLPAR functionality for
 cpu/memory

I believe you'll need to bring in the following as well:
https://groups.google.com/forum/#!topic/powerpc-utils-devel/LkrB6tIvs_Y
[PATCH] drmgr: Disable use of in-kernel cpu hotplug

It appears that CPU hotplug is not enabled in the stock 4.4 kernel, so without that second commit CPU hotplug will fail with this new package.

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-06-12 15:36 EDT-------
*** Bug 155320 has been marked as a duplicate of this bug. ***

Steve Langasek (vorlon) on 2017-06-19
affects: powerpc-ibm-utils (Ubuntu) → powerpc-utils (Ubuntu)
Steve Langasek (vorlon) wrote :

You ask for fixing this in 17.04. Does that mean this is not part of the 1.3.2 release? We need this fixed in later series first before we can SRU it into 16.04.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package powerpc-utils - 1.3.2-1ubuntu2

---------------
powerpc-utils (1.3.2-1ubuntu2) artful; urgency=medium

  * d/p/in-kernel-dlpar.patch: fix FTBFS.

 -- Steve Langasek <email address hidden> Mon, 19 Jun 2017 14:18:01 -0700

Changed in powerpc-utils (Ubuntu):
status: Confirmed → Fix Released
Steve Langasek (vorlon) on 2017-06-19
description: updated
Steve Langasek (vorlon) on 2017-06-20
Changed in powerpc-utils (Ubuntu Xenial):
milestone: none → ubuntu-16.04.3
assignee: nobody → Steve Langasek (vorlon)
Changed in powerpc-utils (Ubuntu Yakkety):
assignee: nobody → Steve Langasek (vorlon)
Changed in powerpc-utils (Ubuntu Zesty):
assignee: nobody → Steve Langasek (vorlon)
Changed in powerpc-utils (Ubuntu Xenial):
status: New → In Progress
Changed in powerpc-utils (Ubuntu Yakkety):
status: New → In Progress
Changed in powerpc-utils (Ubuntu Zesty):
status: New → In Progress

Hello bugproxy, or anyone else affected,

Accepted powerpc-utils into zesty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/powerpc-utils/1.3.2-1ubuntu2~17.04 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in powerpc-utils (Ubuntu Zesty):
status: In Progress → Fix Committed
tags: added: verification-needed
Changed in powerpc-utils (Ubuntu Yakkety):
status: In Progress → Fix Committed
Brian Murray (brian-murray) wrote :

Hello bugproxy, or anyone else affected,

Accepted powerpc-utils into yakkety-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/powerpc-utils/1.3.2-1ubuntu2~16.10 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in powerpc-utils (Ubuntu Xenial):
status: In Progress → Fix Committed
Brian Murray (brian-murray) wrote :

Hello bugproxy, or anyone else affected,

Accepted powerpc-utils into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/powerpc-utils/1.3.1-2ubuntu0.3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in ubuntu-power-systems:
status: Confirmed → Fix Committed
Breno Leitão (breno-leitao) wrote :

Amartey, could you please test this package at -proposed archive? They are depending on it to release the package to the official archive.

Amartey Pearson (apearson) wrote :

Tested version 1.3.1-2ubuntu0.3 in 16.04.2 successfully.

On Wed, Jun 21, 2017 at 04:03:11PM -0000, Amartey Pearson wrote:
> Tested version 1.3.1-2ubuntu0.3 in 16.04.2 successfully.

Thanks. Are you also able to test this on 16.10 and 17.04 (for both bugs)?
This bug report specifically asked for this to be fixed in 17.04, and we
need to have verification separately for the binary package in each release.

Steve Langasek (vorlon) on 2017-06-21
tags: added: verification-done-xenial

As part of a recent change in the Stable Release Update verification policy we would like to inform that for a bug to be considered verified for a given release a verification-done-$RELEASE tag needs to be added to the bug where $RELEASE is the name of the series the package that was tested (e.g. verification-done-xenial). Please note that the global 'verification-done' tag can no longer be used for this purpose.

Thank you!

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package powerpc-utils - 1.3.1-2ubuntu0.3

---------------
powerpc-utils (1.3.1-2ubuntu0.3) xenial; urgency=medium

  * d/p/Improve-perf-of-drmgr-lsslot-with-large-num-of-virt.patch:
    Fix scaling with large number of virtual adapters. LP: #1692837
  * d/p/drmgr-Stale-errno-usage-corrections.patch,
    d/p/drmgr-Correct-errno-usage-use-in-validate_paltform.patch,
    d/p/drmgr-Correct-errno-usage-in-init_cpu_info.patch:
    Fix failures during scale-up test on Novalink System. LP: #1696434

 -- Breno Leitao <email address hidden> Fri, 09 Jun 2017 10:39:15 -0400

Changed in powerpc-utils (Ubuntu Xenial):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for powerpc-utils has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Changed in powerpc-utils (Ubuntu):
importance: Undecided → High
Changed in powerpc-utils (Ubuntu Xenial):
importance: Undecided → High
Changed in powerpc-utils (Ubuntu Yakkety):
importance: Undecided → High
Changed in powerpc-utils (Ubuntu Zesty):
importance: Undecided → High
Manoj Iyer (manjo) on 2017-07-19
Changed in ubuntu-power-systems:
importance: Undecided → High
Manoj Iyer (manjo) on 2017-07-24
tags: added: triage-g
Amartey Pearson (apearson) wrote :

Tested version 1.3.2-1ubuntu2 in 17.04 successfully.

I did not test in 16.10 as that is now EOL.

tags: added: verification-done-zesty
removed: verification-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package powerpc-utils - 1.3.2-1ubuntu2~17.04

---------------
powerpc-utils (1.3.2-1ubuntu2~17.04) zesty; urgency=medium

  * d/p/Improve-perf-of-drmgr-lsslot-with-large-num-of-virt.patch:
    Fix scaling with large number of virtual adapters. LP: #1692837
  * d/p/drmgr-Stale-errno-usage-corrections.patch,
    d/p/drmgr-Correct-errno-usage-use-in-validate_paltform.patch,
    d/p/drmgr-Correct-errno-usage-in-init_cpu_info.patch:
    Fix failures during scale-up test on Novalink System. LP: #1696434

 -- Steve Langasek <email address hidden> Mon, 19 Jun 2017 14:18:01 -0700

Changed in powerpc-utils (Ubuntu Zesty):
status: Fix Committed → Fix Released
Manoj Iyer (manjo) on 2017-08-25
Changed in ubuntu-power-systems:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers