cyborg agent failed to start privsep daemon

Bug #1873715 reported by ya.wang
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
kolla-ansible
Fix Released
High
Radosław Piliszek
Train
Fix Released
High
Unassigned
Ussuri
Fix Released
High
Radosław Piliszek

Bug Description

What happened:

The cyborg agent failed to scan PCI devices by using 'lspci', because the privesp daemon failed to start.

The log is bellow:

2020-03-25 17:06:00.555 6 DEBUG oslo_service.periodic_task [-] Running periodic task AgentManager.update_available_resource run_periodic_tasks /var/lib/kolla/venv/lib/python2.7/site-packages/oslo_service/periodic_task.py:217
2020-03-25 17:06:00.556 6 DEBUG oslo_concurrency.lockutils [-] Lock "agent_resources" acquired by "cyborg.agent.resource_tracker.update_usage" :: waited 0.000s inner /var/lib/kolla/venv/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:327
2020-03-25 17:06:00.557 6 INFO oslo.privsep.daemon [-] Running privsep helper: ['sudo', 'privsep-helper', '--config-file', '/etc/cyborg/cyborg.conf', '--privsep_context', 'cyborg.privsep.sys_admin_pctxt', '--privsep_sock_path', '/tmp/tmpLXVeAf/privsep.sock']
2020-03-25 17:06:01.247 6 WARNING oslo.privsep.daemon [-] privsep log: [Errno 1] Operation not permitted2020-03-25 17:06:01.285 6 INFO oslo.privsep.daemon [-] Spawned new privsep daemon via rootwrap
2020-03-25 17:06:01.285 6 DEBUG oslo.privsep.daemon [-] Accepted privsep connection to /tmp/tmpLXVeAf/privsep.sock __init__ /var/lib/kolla/venv/lib/python2.7/site-packages/oslo_privsep/daemon.py:3352020-03-25 17:06:01.239 2631 INFO oslo.privsep.daemon [-] privsep daemon starting
2020-03-25 17:06:01.243 2631 INFO oslo.privsep.daemon [-] privsep process running with uid/gid: 0/0
2020-03-25 17:06:01.246 2631 ERROR oslo.privsep.daemon [-] [Errno 1] Operation not permitted
Traceback (most recent call last):
  File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_privsep/daemon.py", line 557, in helper_main
    Daemon(channel, context).run()
  File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_privsep/daemon.py", line 367, in run
    self._drop_privs()
  File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_privsep/daemon.py", line 403, in _drop_privs
    capabilities.drop_all_caps_except(self.caps, self.caps, [])
  File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_privsep/capabilities.py", line 156, in drop_all_caps_except
    raise OSError(errno, os.strerror(errno))
OSError: [Errno 1] Operation not permitted
2020-03-25 17:06:01.296 6 DEBUG oslo_privsep.comm [-] EOF on privsep read channel _reader_main /var/lib/kolla/venv/lib/python2.7/site-packages/oslo_privsep/comm.py:148
2020-03-25 17:06:01.297 6 ERROR oslo.privsep.daemon [-] Error while sending initial PING to privsep: Premature eof waiting for privileged process: IOError: Premature eof waiting for privileged process
2020-03-25 17:06:01.297 6 ERROR oslo.privsep.daemon Traceback (most recent call last):
2020-03-25 17:06:01.297 6 ERROR oslo.privsep.daemon File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_privsep/daemon.py", line 183, in exchange_ping
2020-03-25 17:06:01.297 6 ERROR oslo.privsep.daemon reply = self.send_recv((Message.PING.value,))
2020-03-25 17:06:01.297 6 ERROR oslo.privsep.daemon File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_privsep/comm.py", line 171, in send_recv
2020-03-25 17:06:01.297 6 ERROR oslo.privsep.daemon reply = future.result()
2020-03-25 17:06:01.297 6 ERROR oslo.privsep.daemon File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_privsep/comm.py", line 110, in result
2020-03-25 17:06:01.297 6 ERROR oslo.privsep.daemon raise self.error
2020-03-25 17:06:01.297 6 ERROR oslo.privsep.daemon IOError: Premature eof waiting for privileged process
2020-03-25 17:06:01.297 6 ERROR oslo.privsep.daemon
2020-03-25 17:06:01.297 6 CRITICAL oslo.privsep.daemon [-] Privsep daemon failed to start: IOError: Premature eof waiting for privileged process
2020-03-25 17:06:01.298 6 DEBUG oslo_concurrency.lockutils [-] Lock "agent_resources" released by "cyborg.agent.resource_tracker.update_usage" :: held 0.742s inner /var/lib/kolla/venv/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:339
2020-03-25 17:06:01.298 6 ERROR oslo_service.periodic_task [-] Error during AgentManager.update_available_resource: FailedToDropPrivileges: Privsep daemon failed to start
2020-03-25 17:06:01.298 6 ERROR oslo_service.periodic_task Traceback (most recent call last):
2020-03-25 17:06:01.298 6 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_service/periodic_task.py", line 222, in run_periodic_tasks
2020-03-25 17:06:01.298 6 ERROR oslo_service.periodic_task task(self, context)
2020-03-25 17:06:01.298 6 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python2.7/site-packages/cyborg/agent/manager.py", line 91, in update_available_resource
2020-03-25 17:06:01.298 6 ERROR oslo_service.periodic_task self._rt.update_usage(context)
2020-03-25 17:06:01.298 6 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 328, in inner
2020-03-25 17:06:01.298 6 ERROR oslo_service.periodic_task return f(*args, **kwargs)
2020-03-25 17:06:01.298 6 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python2.7/site-packages/cyborg/agent/resource_tracker.py", line 72, in update_usage
2020-03-25 17:06:01.298 6 ERROR oslo_service.periodic_task acc_list.extend(acc_driver.discover())
2020-03-25 17:06:01.298 6 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python2.7/site-packages/cyborg/accelerator/drivers/gpu/nvidia/driver.py", line 33, in discover
2020-03-25 17:06:01.298 6 ERROR oslo_service.periodic_task return sysinfo.gpu_tree()
2020-03-25 17:06:01.298 6 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python2.7/site-packages/cyborg/accelerator/drivers/gpu/nvidia/sysinfo.py", line 25, in gpu_tree
2020-03-25 17:06:01.298 6 ERROR oslo_service.periodic_task devs = utils.discover_gpus(VENDOR_ID)
2020-03-25 17:06:01.298 6 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python2.7/site-packages/cyborg/accelerator/drivers/gpu/utils.py", line 77, in discover_gpus
2020-03-25 17:06:01.298 6 ERROR oslo_service.periodic_task gpus = get_pci_devices(GPU_FLAGS, vendor_id)
2020-03-25 17:06:01.298 6 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python2.7/site-packages/cyborg/accelerator/drivers/gpu/utils.py", line 55, in get_pci_devices
2020-03-25 17:06:01.298 6 ERROR oslo_service.periodic_task lspci_out = lspci_privileged()[0].split('\n')
2020-03-25 17:06:01.298 6 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_privsep/priv_context.py", line 244, in _wrap
2020-03-25 17:06:01.298 6 ERROR oslo_service.periodic_task self.start()
2020-03-25 17:06:01.298 6 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_privsep/priv_context.py", line 255, in start
2020-03-25 17:06:01.298 6 ERROR oslo_service.periodic_task channel = daemon.RootwrapClientChannel(context=self)
2020-03-25 17:06:01.298 6 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_privsep/daemon.py", line 347, in __init__
2020-03-25 17:06:01.298 6 ERROR oslo_service.periodic_task super(RootwrapClientChannel, self).__init__(sock)
2020-03-25 17:06:01.298 6 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_privsep/daemon.py", line 178, in __init__
2020-03-25 17:06:01.298 6 ERROR oslo_service.periodic_task self.exchange_ping()
2020-03-25 17:06:01.298 6 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_privsep/daemon.py", line 191, in exchange_ping
2020-03-25 17:06:01.298 6 ERROR oslo_service.periodic_task raise FailedToDropPrivileges(msg)
2020-03-25 17:06:01.298 6 ERROR oslo_service.periodic_task FailedToDropPrivileges: Privsep daemon failed to start
2020-03-25 17:06:01.298 6 ERROR oslo_service.periodic_task

What you expected to happen:

The cyborg agent can scan the PCI devices successfully.

How to reproduct it:

1. Config the cyborg agent's driver:

[agent]
enabled_drivers = nvidia_gpu_driver

2. Restart cyborg agent container.

**Environment**:
OS: CentOS7.7.1908
Kernel: 3.10.0-1062.18.1.el7.x86_64
Docker version: 18.09.9
Kolla-Ansible version: 9.0.1
Docker image install type: source
Docker image distribution: CentOS
Are you using official images from Docker Hub or self built: Docker Hub

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (master)

Fix proposed to branch: master
Review: https://review.opendev.org/721139

Changed in kolla-ansible:
assignee: nobody → ya.wang (ya.wang)
status: New → In Progress
Changed in kolla-ansible:
importance: Undecided → High
summary: - cyborg agent failed to start privesp daemon
+ cyborg agent failed to start privsep daemon
tags: added: cyborg cyborg-agent privsep
Changed in kolla-ansible:
assignee: ya.wang (ya.wang) → Radosław Piliszek (yoctozepto)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (master)

Reviewed: https://review.opendev.org/721139
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=953edb870ee67d9f8f39307e5cb401ef8bd0348c
Submitter: Zuul
Branch: master

commit 953edb870ee67d9f8f39307e5cb401ef8bd0348c
Author: ya.wang <wang.ya@99cloud.net>
Date: Mon Apr 20 10:36:59 2020 +0800

    Fix that cyborg agent failed to start privsep daemon.

    Add privileged capability to cyborg agent.

    Change-Id: Id237df1acb1b44c4e6442b39838058be1a95fcc6
    Closes-bug: #1873715

Changed in kolla-ansible:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/722195

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/train)

Reviewed: https://review.opendev.org/722195
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=92b81eefe8f7ee5da54aed17407ff9092a980c7e
Submitter: Zuul
Branch: stable/train

commit 92b81eefe8f7ee5da54aed17407ff9092a980c7e
Author: ya.wang <wang.ya@99cloud.net>
Date: Mon Apr 20 10:36:59 2020 +0800

    Fix that cyborg agent failed to start privsep daemon.

    Add privileged capability to cyborg agent.

    Change-Id: Id237df1acb1b44c4e6442b39838058be1a95fcc6
    Closes-bug: #1873715
    (cherry picked from commit 953edb870ee67d9f8f39307e5cb401ef8bd0348c)

tags: added: in-stable-train
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.