OpenStack override PCI pass-through failed with Ceph nil pointer

Bug #1950406 reported by Tan Sin Lam
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Hugo Nicodemos Brito

Bug Description

Brief Description
-----------------
After successful installed Starlingx V5.0.1 with Openstack, configure OpenStack Nova to allow for PCI pass-through PCI passthrough failed with Ceph nil pointer.

Severity
--------
Critical: Time Senstive Networking (TSN) not usable due to the error

Steps to Reproduce
------------------
1/ Follow Starlingx release 5 bare metal aio simplex installation guide and installed Openstack with Starlingx successfully.
   https://docs.starlingx.io/deploy_install_guides/r5_release/bare_metal/aio_simplex_install_kubernetes.html

2/ Enable TSN in Starlingx following the guide https://docs.starlingx.io/operations/tsn.html

3/ Create nova-tsn-pt.yaml file to allow PCI pass-through for i210 adapter, with the following content:
conf:
nova:
  pci:
    alias:
      type: multistring
      values:
      - '{"vendor_id": "8086", "product_id": "1533","device_type":"type-PCI","name": "h210-1"}'
    passthrough_whitelist:
      type: multistring
      values:
      - '{"class_id": "8086", "product_id":"1533"}'
overrides:
  nova_compute:
    hosts:
    - conf:
        nova:
          DEFAULT:
            my_ip: 192.168.206.2
            shared_pcpu_map: '""'
            vcpu_pin_set: '"2-5"'
          libvirt:
            images_type: default
            live_migration_inbound_addr: 192.168.206.2
          pci:
            passthrough_whitelist:
              type: multistring
              values:
              - '{"class_id": "8086", "product_id": "1533"}'
          vnc:
            vncserver_listen: 0.0.0.0
            vncserver_proxyclient_address: 192.168.206.2
      name: controller-0

4/ Set PCI passthrough config
   > system helm-override-update stx-openstack nova openstack --values nova-tsn-pt.yaml

5/ Confirm that the user_override lists
   > system helm-override-show stx-openstack nova openstack

6/ Apply the changes
   > system application-apply stx-openstack

7/ After a few minutes, the apply failed with Ceph nil pointer error (see the attached log)

Expected Behavior
------------------
Openstack PCI passthrough override should succeed without error.

Actual Behavior
----------------
The apply failed with Ceph nil pointer error. Tested a few times with clean installation of Starlingx.

Reproducibility
---------------
100% reproducible.

System Configuration
--------------------
One node system, natively install Starlingx in Intel NUC HadesCanyon. OAM in 192.168.1.x subnet, data network in 192.168.2.x subnet, both interfaces use flat network.

Branch/Pull Time/Commit
-----------------------
Use Starlingx V5.0.1 green build on 27 Oct 2021, downloaded from http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/latest_green_build/

Last Pass
---------
None

Timestamp/Logs
--------------
Attached the log. Snippet of logs as below, Unique identifier "nil pointer evaluating interface {}.ceph"

2021-11-10 01:52:00.286 74 ERROR armada.handlers.tiller [-] [chart=openstack-nova]: Error while updating release osh-openstack-nova: grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
 status = StatusCode.UNKNOWN
 details = "render error in "nova/job-storage-init.yaml": template: nova/job-storage-init.yaml:65:24: executing "nova/job-storage-init.yaml" at <.Values.conf.ceph.enabled>: nil pointer evaluating interface {}.ceph"
 debug_error_string = "{"created":"@1636509120.286631700","description":"Error received from peer ipv4:127.0.0.1:24134","file":"src/core/lib/surface/call.cc","file_line":1067,"grpc_message":"render error in "nova/job-storage-init.yaml": template: nova/job-storage-init.yaml:65:24: executing "nova/job-storage-init.yaml" at <.Values.conf.ceph.enabled>: nil pointer evaluating interface {}.ceph","grpc_status":2}"
>
2021-11-10 01:52:00.286 74 ERROR armada.handlers.tiller Traceback (most recent call last):
2021-11-10 01:52:00.286 74 ERROR armada.handlers.tiller File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 421, in update_release
2021-11-10 01:52:00.286 74 ERROR armada.handlers.tiller metadata=self.metadata)
2021-11-10 01:52:00.286 74 ERROR armada.handlers.tiller File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 923, in __call__
2021-11-10 01:52:00.286 74 ERROR armada.handlers.tiller return _end_unary_response_blocking(state, call, False, None)
2021-11-10 01:52:00.286 74 ERROR armada.handlers.tiller File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 826, in _end_unary_response_blocking
2021-11-10 01:52:00.286 74 ERROR armada.handlers.tiller raise _InactiveRpcError(state)
2021-11-10 01:52:00.286 74 ERROR armada.handlers.tiller grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
2021-11-10 01:52:00.286 74 ERROR armada.handlers.tiller status = StatusCode.UNKNOWN
2021-11-10 01:52:00.286 74 ERROR armada.handlers.tiller details = "render error in "nova/job-storage-init.yaml": template: nova/job-storage-init.yaml:65:24: executing "nova/job-storage-init.yaml" at <.Values.conf.ceph.enabled>: nil pointer evaluating interface {}.ceph"
2021-11-10 01:52:00.286 74 ERROR armada.handlers.tiller debug_error_string = "{"created":"@1636509120.286631700","description":"Error received from peer ipv4:127.0.0.1:24134","file":"src/core/lib/surface/call.cc","file_line":1067,"grpc_message":"render error in "nova/job-storage-init.yaml": template: nova/job-storage-init.yaml:65:24: executing "nova/job-storage-init.yaml" at <.Values.conf.ceph.enabled>: nil pointer evaluating interface {}.ceph","grpc_status":2}"

Test Activity
-------------
Developer Testing

Workaround
----------
None. Each time the apply failed, my Openstack is no longer work, I have to reinstall Starlingx. Please let me know if there is a way to revert the apply after failure.

Revision history for this message
Tan Sin Lam (sinlam) wrote :
Ghada Khalil (gkhalil)
tags: added: stx.5.0 stx.distro.openstack
removed: v5.0.1
Tan Sin Lam (sinlam)
tags: added: stx stx.6.0
Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Douglas, can someone from the openstack team have a look at this and provide some guidance to the reporter.

tags: removed: stx
Changed in starlingx:
assignee: nobody → Douglas Lopes Pereira (douglaspereira)
Revision history for this message
Tan Sin Lam (sinlam) wrote :

Thanks for the comments. I tested 8 Dec 2021 green build for Starlingx V6, the same issue is still there.

Revision history for this message
Hugo Nicodemos Brito (hbrito) wrote :

Hey Tan Sin, did you try to update the override using `--reuse-values` ?

Revision history for this message
Tan Sin Lam (sinlam) wrote (last edit ):

Hi Hugo, thanks for your suggestion, I tried with '--reuse-values' to update the override, unfortunately I got the same error. I also include the log 'Openstack override apply log on 17 Dec 2021'

Revision history for this message
Tan Sin Lam (sinlam) wrote :
Revision history for this message
Lucas (lcavalca) wrote :

Hi, can you provide me the output of helm-override-show nova?

for some reason this is being overriden:

https://opendev.org/starlingx/openstack-armada-app/src/branch/master/stx-openstack-helm/stx-openstack-helm/manifests/manifest.yaml#L1201

this is why your install is not working and why Hugo thought it might be the `reset-values` parameter

Revision history for this message
Tan Sin Lam (sinlam) wrote :

Hi Lucas, thanks for your suggestion. I run the following commands:

> system helm-override-update stx-openstack nova openstack --values nova-tsn-pt.yaml --reuse-values
> system helm-override-show stx-openstack nova openstack

The helm-override-show output is attached. I notice the ceph admin-keyring is null. Is this the reason of the failure?

Revision history for this message
Hugo Nicodemos Brito (hbrito) wrote :

I think you are indenting the `conf` parameter wrong in your file.

`conf: null`

The `nova` and `overrides` are inside `conf`:

conf:
  nova:
    PCI:
      alias:
        type: multistring
        values:
        - '{"vendor_id": "8086", "product_id": "1533","device_type":"type-PCI","name": "h210-1"}'
      passthrough_whitelist:
        type: multistring
        values:
        - '{"class_id": "8086", "product_id":"1533"}'
  overrides:
    nova_compute:
      hosts:
      - conf:
          nova:
            DEFAULT:
              my_ip: 192.168.206.2
              shared_pcpu_map: '""'
              vcpu_pin_set: '"2-5"'
            libvirt:
              images_type: default
              live_migration_inbound_addr: 192.168.206.2
            PCI:
              passthrough_whitelist:
                type: multistring
                values:
                - '{"class_id": "8086", "product_id": "1533"}'
            vnc:
              vncserver_listen: 0.0.0.0
              vncserver_proxyclient_address: 192.168.206.2
        name: controller-0

Revision history for this message
Tan Sin Lam (sinlam) wrote :

Thank you Hugo for detecting the error. I have made the changes and retried the apply. This time it went to 63% (more than half an hour) and then abort with failure "Timed out waiting for pods (namespace=openstack, labels=(release_group=osh-openstack-horizon)). These pods were not ready=['horizon-f49759975-sqgxr']"

The apply log is attached. Hopefully this can be solved soon. Thanks for any suggestion!

Revision history for this message
Tan Sin Lam (sinlam) wrote (last edit ):

Hi, I installed a fresh Starlingx V6 based on 8 Dec 2021 green build, and retried the apply again. This time the apply is successful (previously used from my backup and applied). Then, I followed the steps at https://docs.starlingx.io/operations/tsn.html to enable PCI passthrough in the flavor property.

> openstack flavor set m1.medium --property pci_passthrough:alias=h210-1:1

Subsequently I launched a new VM to use m1.medium flavor but I got the error in Openstack "No valid host was found. There are not enough hosts available." Does anyone have any idea what caused the problem? If I use other flavor, it is ok. This is tested in Intel NUC HadesCanyon which has i210 network adapter, and used as data network.

If this issue is unrelated to the bug, please close this bug report. Thank you and hope to hear any suggestion!

Revision history for this message
Hugo Nicodemos Brito (hbrito) wrote :

Hey. So the error "No valid host was found." is caused when nova does not find an available host with the resources required in the flavor (like memory, isolated cores, required devices, etc). The message is generic to not expose the cloud data for the end-user. You can try to find more information in the logs, like the nova-compute.

I don't know if still missing anything else from the documentation, but you can take a look at the OpenStack documentation (https://docs.openstack.org/nova/xena/admin/pci-passthrough.html).

Revision history for this message
Tan Sin Lam (sinlam) wrote :

Thanks Hugo for the suggestion. I went through Openstack documentation and check the Bios enabled VT-d and intel_iommu=on in kernel parameters. The same error is still there when launched a VM. I checked nova-compute log but nothing print out during the error.

My 'nova-tsn-pt.yaml' is exactly as above, not sure what is going wrong. Let me know if you have any other suggestion.

Merry Christmas!

Revision history for this message
Tan Sin Lam (sinlam) wrote :

Hi, I manage to solve the "No valid host was found" issue. It was due to "product_id" should be "157b" instead of "1533". Found the product_id from "lspci -nn | grep Net". Now a new VM detects the hardware queues of the i210 adapter for my TSN testing.

Very happy now and thank you for the helps from Hugo and Lucas. May I know how to close this bug report? Wish all of you both a Merry Christmas and a prosperous New Year!

Revision history for this message
Hugo Nicodemos Brito (hbrito) wrote :

Hey, Tan Sin. I'll address a review updating the documentation for this bug. Thanks for the info and Happy New Year!

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to docs (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/docs/+/823090

Changed in starlingx:
status: New → In Progress
Changed in starlingx:
assignee: Douglas Lopes Pereira (douglaspereira) → Hugo Nicodemos Brito (hbrito)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to docs (master)

Reviewed: https://review.opendev.org/c/starlingx/docs/+/823090
Committed: https://opendev.org/starlingx/docs/commit/7e85dbdc4b4d41922980c68e01bb18347c8903d8
Submitter: "Zuul (22348)"
Branch: master

commit 7e85dbdc4b4d41922980c68e01bb18347c8903d8
Author: Hugo Brito <email address hidden>
Date: Tue Dec 28 10:34:45 2021 -0300

    Fix Time Sensitive Networking doc

    - Correct indentation of nova-tsn-pt.yaml file
    - Add # to the comment of `openstack flavor` command
    - Add parameter `--reuse-values` to `helm-override-update` command
    - Add command to check the correct `product_id`
    - Minor typos

    Closes-bug: 1950406

    Signed-off-by: Hugo Brito <email address hidden>
    Change-Id: Ic442cf49f9415f8aa3e413f6ef3d730153225b59

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.docs
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Issue was addressed by a doc update. Given the doc repo has not been branched for r/stx.6.0, a cherry-pick is not required. The change will be available in stx.6.0 once the doc branching is done.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.