Intel QAT plugin pod failed to start

Bug #1869236 reported by Mihnea Saracin
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Mingyuan Qi

Bug Description

Brief Description
-----------------
The QAT pod failed to start on a system which has the Intel QAT plugin installed.

Severity
--------
Severe

Steps to Reproduce
------------------
- Install a system with intel-qat-plugin enabled.
  e.g Have the following field in localhost.yml:
  k8s_plugins:
    intel-qat-plugin: intelqat=enabled

- Apply intelqat label to controller-0
  e.g. system host-label-assign controller-0 intelquat=enabled

Expected Behavior
------------------
The QAT pod should start on controller-0

Actual Behavior
----------------
The Intel QAT pod has failed to start and has the following status: 'CreateContainerError'

When I try to describe the pod, it gave the following error several times:

Warning Failed 172m kubelet, controller-0 Error: failed to create containerd container: error unpacking image: failed to extract layer sha256:3775d222400e29e763fe127f88eb1d73675fd94cd26468517afc640c7858b267: mount callback failed on /var/lib/docker/tmpmounts/containerd-mount417475360: chmod /var/lib/docker/tmpmounts/containerd-mount417475360/usr/bin/b2sum: no such file or directory: unknown

Reproducibility
---------------
100%

System Configuration
--------------------
AIO-DX

Branch/Pull Time/Commit
-----------------------
BUILD_ID="2020-03-18_04-10-00"

Test Activity
-------------
Feature Testing

Frank Miller (sensfan22)
description: updated
Revision history for this message
Ghada Khalil (gkhalil) wrote :

The same issue was also seen with the FPGA device plugin

Running "crictl pull docker.io/starlingx/intel-fpga-plugin:stx.3.0-v0.11.0-103-g4f28657" gave the following error:
[sysadmin@controller-0 ~(keystone_admin)]$ crictl pull docker.io/starlingx/intel-fpga-plugin:stx.3.0-v0.11.0-103-g4f28657
FATA[0000] pulling image failed: rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/starlingx/intel-fpga-plugin:stx.3.0-v0.11.0-103-g4f28657": unpack: failed to extract layer sha256:5a79e735a347304e1e65f005005b74741f205974ff568068cd6cf3741548b4e1: mount callback failed on /var/lib/docker/tmpmounts/containerd-mount819163650: chmod /var/lib/docker/tmpmounts/containerd-mount819163650/usr/bin/b2sum: no such file or directory: unknown

tags: added: stx.4.0 stx.containers
Changed in starlingx:
importance: Undecided → High
status: New → Triaged
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / high priority - it is believed that this is related to the introduction of kata containers and containerd runtime in stx.

Changed in starlingx:
assignee: nobody → Mingyuan Qi (myqi)
Revision history for this message
Mingyuan Qi (myqi) wrote :

The issue was introduced by containerd and was fixed in https://github.com/containerd/containerd/pull/4099
Will build a containerd rpm to verify the issue fixed and discuss with Shuicheng the upgrade strategy.

Revision history for this message
Lin Shuicheng (shuicheng) wrote :

Hi Mingyuan,
Please have a try with cherry-pick the upstream patch to current version containerd to fix it.
We may combine containerd upgrade together with kubernetes/kata upgrade.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/718886

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)

Reviewed: https://review.opendev.org/718886
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=5694c7221890e67bc1c2706bbaf08aef4c2d9745
Submitter: Zuul
Branch: master

commit 5694c7221890e67bc1c2706bbaf08aef4c2d9745
Author: Mingyuan Qi <email address hidden>
Date: Fri Apr 10 05:29:19 2020 +0000

    Fix QAT plugin image pull failed

    Fix several image pull failure issue caused by a containerd chmod
    issue resolve by upstream commit e2269f2.

    Original commit message:

    handleLChmod() does not properly check that files behind the
    handlinks exist before calling os.Chmod(). We've seen base images
    where this results in "no such file or directory" error from
    os.Chmod() when unpacking the image.

    To keep the existing logic but fix the problem, this commit simply
    skips IsNotExist error.

    Closes-bug: 1869236

    Change-Id: I2e77adbf89ad5505f2d7127a3f06ccfb805c0f24
    Signed-off-by: Mingyuan Qi <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/729834

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (f/centos8)
Download full text (15.2 KiB)

Reviewed: https://review.opendev.org/729834
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=e4d12decc4c702e1e908d2430c7b4bc524c31c07
Submitter: Zuul
Branch: f/centos8

commit 5bb777d3725a48bc18431daedb6fd67198cd053a
Author: SidneyAn <email address hidden>
Date: Wed May 20 23:22:06 2020 +0800

    Add python-daemon to srpm list

    add python-daemon-2.2.3-7.el8.src.rpm to srpm list
    for pkg python3-daemon building.

    Change-Id: I0ad60d1083222130e72f935e08f97a8608b75880
    Story: 2007106
    Task: 39291
    Signed-off-by: SidneyAn <email address hidden>

commit fc125a7a24c00850aafd4a791a63e8e627b5ee1e
Author: Ran An <email address hidden>
Date: Thu May 14 11:41:50 2020 +0000

    Revert "Add python3-daemon required by logmgmt"

    This reverts commit 97cd7ea5c1037dd22488793ea9271462fedc4c7a.

    Change-Id: I3f09054c1546252493f8eb29dc70806829324a52

commit 97cd7ea5c1037dd22488793ea9271462fedc4c7a
Author: SidneyAn <email address hidden>
Date: Fri Apr 3 15:48:09 2020 +0800

    Add python3-daemon required by logmgmt

    pkg logmgmt upgraded to python3 requires python3 model "daemon",
    and no pkgs in Centos7 offical repo provide it.

    this patch refer to the python3-daemon pkg build by rdo
    for CentOS 8: python-daemon-2.2.3-7.el8.src.rpm

    disable the rpm check part which is not required in stx to
    reduce python3 dependencies that not supported by CentOS 7

    Depends-on: https://review.opendev.org/#/c/727657/
    Depends-on: https://review.opendev.org/#/c/727662/
    Change-Id: Ie08ea9c7adf830ad4e8e924fa69352fb2a923a6f
    Story: 2007106
    Task: 39291
    Signed-off-by: SidneyAn <email address hidden>

commit e2dc5c2dd0042788697ade268ac5c24fe9dc2f8c
Author: Steven Webster <email address hidden>
Date: Tue May 12 10:32:21 2020 -0400

    Fix sriov device plugin image build

    Previous commit d204f10ab5 introduced a build script to assist
    in building the SR-IOV device plugin.

    The script utilizes a Makefile to do build the plugin binary,
    then the image.

    Building the binary depends on go being present on the host. If it
    is not, the build will fail.

    Building the binary is actually not required, as it will be also
    done in a container as part of the 'make image', rather than copying
    the binary from the host.

    Closes-Bug: #1878224
    Change-Id: I4499ea2bbef4b3da8a154c69a07b415574517500
    Signed-off-by: Steven Webster <email address hidden>

commit d204f10ab53414dd46d5eb51fd99950d3ab70fa8
Author: Steven Webster <email address hidden>
Date: Fri Apr 24 10:59:59 2020 -0400

    Uprev the SR-IOV device plugin to the latest version

    This is intended primarily to pick up support for SR-IOV
    accelerators.

    The builder has been changed to a script model, as the
    device plugin's Dockerfile has been moved to a separate
    directory. The build-stx-images script does find this file,
    but the docker build will fail as the device plugin's
    source directory is no longer where the builder expects
    it to be. Instead, use the existing Makefile to assist
    in building the bi...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.