[podman] neutron doesn't launch dhcp-agent

Bug #1799484 reported by Cédric Jeanneret
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Alex Schultz

Bug Description

Hello,

# Context:
I'm working on the podman integration, and the issue is located on the undercloud.

I'm using a tweaked t-h-t with the following patches embedded:
- https://review.openstack.org/#/c/611801/
- https://review.openstack.org/#/c/606975/

and a tweaked puppet-tripleo with the following patches:
- https://review.openstack.org/#/c/606095/

SELinux is enforcing.

# Issue:
Apparently, neutron has some issues starting its children, especially a dhcp-agent:

2018-10-23 14:29:57.370 64679 DEBUG oslo.privsep.daemon [-] privsep: request[140651565326160]: (3, 'neutron.privileged.agent.linux.ip_lib.get_link_attributes', (u'tap936f789b-55', u'qdhcp-a0d28768-54d5-4761-8227-e00fd4e61391'), {}) loop /u
sr/lib/python2.7/site-packages/oslo_privsep/daemon.py:443
2018-10-23 14:29:57.399 64679 DEBUG oslo.privsep.daemon [-] privsep: Exception during request[140651565326160]: Network interface tap936f789b-55 not found in namespace qdhcp-a0d28768-54d5-4761-8227-e00fd4e61391. loop /usr/lib/python2.7/sit
e-packages/oslo_privsep/daemon.py:449
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/oslo_privsep/daemon.py", line 445, in loop
    reply = self._process_cmd(*msg)
  File "/usr/lib/python2.7/site-packages/oslo_privsep/daemon.py", line 428, in _process_cmd
    ret = func(*f_args, **f_kwargs)
  File "/usr/lib/python2.7/site-packages/oslo_privsep/priv_context.py", line 209, in _wrap
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/neutron/privileged/agent/linux/ip_lib.py", line 290, in get_link_attributes
    link = _run_iproute_link("get", device, namespace)[0]
  File "/usr/lib/python2.7/site-packages/neutron/privileged/agent/linux/ip_lib.py", line 154, in _run_iproute_link
    idx = _get_link_id(device, namespace)
  File "/usr/lib/python2.7/site-packages/neutron/privileged/agent/linux/ip_lib.py", line 148, in _get_link_id
    raise NetworkInterfaceNotFound(device=device, namespace=namespace)
NetworkInterfaceNotFound: Network interface tap936f789b-55 not found in namespace qdhcp-a0d28768-54d5-4761-8227-e00fd4e61391.
2018-10-23 14:29:57.408 64679 DEBUG oslo.privsep.daemon [-] privsep: reply[140651565326160]: (5, 'neutron.privileged.agent.linux.ip_lib.NetworkInterfaceNotFound', (u'Network interface tap936f789b-55 not found in namespace qdhcp-a0d28768-54d5-4761-8227-e00fd4e61391.',)) loop /usr/lib/python2.7/site-packages/oslo_privsep/daemon.py:456

This situation leads to another issue:
fb4070404f6a docker.io/tripleomaster/centos-binary-neutron-dhcp-agent:965941f1e62cef16967e7a7cd6d98263e52acb62_0989b280 ip netns exec qdhcp... 49 seconds ago Exited (1) 49 seconds ago neutron-dnsmasq-qdhcp-a0d28768-54d5-4761-8227-e00fd4e61391 false

The container id changes as apparently neutron tries to start a new one on top of it (as it detects it's crashed).
The container logs has this:
setting the network namespace "qdhcp-a0d28768-54d5-4761-8227-e00fd4e61391" failed: Invalid argument

Always linked to this problem, I get multiple occurrences of that error message:
dhcp-agent.log:2018-10-23 14:30:53.543 59430 ERROR neutron.agent.linux.external_process [-] dnsmasq for dhcp with uuid a0d28768-54d5-4761-8227-e00fd4e61391 not found. The process should not have died

On the host, I can't find the tap interface:
[root@undercloud neutron]# ip a | grep -c tap
0

which means it doesn't exist for some reason.

Now, I said SELinux is enforcing - but apparently, this doesn't play anything for the issue: I don't have anything in the audit.log

Can anyone point me where to look in order to understand what container/service should create the "tap" interface? I'm pretty sure it's the root cause of the issues I face.

Thank you!

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Small update:
in fact the tap interface exists, it's managed by OVS, and not shown by `ip a' command:

[root@undercloud log]# ovs-vsctl list-ports br-int
int-br-ctlplane
tap936f789b-55

I suspect an issue accessing to the openvswith socket - not sure though. Nothing seems to raise an error with that so far.

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Running the ovs-vsctl command from within neutron containers does return the tap device. So not an issue with access to the openvswitch data.

Still digging...

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

OK, the "real" issue is apparently here:

2018-10-23 17:07:49.342 58988 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-3c92f030-45b4-4927-a83f-b2141a18877f', 'dnsmasq', '--no-hosts', '--no-resolv', '--except-interface=lo', '--
pid-file=/var/lib/neutron/dhcp/3c92f030-45b4-4927-a83f-b2141a18877f/pid', '--dhcp-hostsfile=/var/lib/neutron/dhcp/3c92f030-45b4-4927-a83f-b2141a18877f/host', '--addn-hosts=/var/lib/neutron/dhcp/3c92f030-45b4-4927-a83f-b2141a18877f/addn_hosts', '--dhcp-optsfile=/var/lib/neutron/dhcp/3c92f030-45b4-4927-a83f-b2141a18877f/opts', '--dhcp-leasefile=/var/lib/neutron/dhcp/3c92f030-45b4-4927-a83f-b2141a18877f/leases', '--dhcp-match=set:ipxe,175', '--bind-interfaces', '--interface=tape2d00740-20', '--dhcp-range=set:tag0,192.168.24.0,static,255.255.255.0,86400s', '--dhcp-option-force=option:mtu,1500', '--dhcp-lease-max=256', '--conf-file=', '--domain=localdomain'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103

The dhcp-agent container wants to spawn that command through another container, the very one that exits with a non-0 status.

A "podman inspect <said container>" shows that, apparently, no SELinux tags are added to the volumes, and this can be the root cause of the issue, especially for those ones:

            {
                "destination": "/run/netns",
                "type": "bind",
                "source": "/run/netns",
                "options": [
                    "shared",
                    "rbind",
                    "rw"
                ]
            },
            {
                "destination": "/var/lib/neutron",
                "type": "bind",
                "source": "/var/lib/neutron",
                "options": [
                    "rbind",
                    "rw",
                    "rprivate"
                ]
            },

The /var/lib/neutron should be "shared,z", as well as the "/run/netns", as those two are shared with different containers, hence with different SELinux contexts/namespace.

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Apparently correcting the flags in this change should do the trick:
https://review.openstack.org/#/c/606095/ (see my comments)

Changed in tripleo:
assignee: nobody → Cédric Jeanneret (cjeanner)
Revision history for this message
Cédric Jeanneret (cjeanner) wrote :
Download full text (6.1 KiB)

So, SELinux flag aren't involved in this current issue - Brent apparently has the same issue on a permissive selinux.

Digging further, trying to run this:
ip -d netns exec qdhcp-3c92f030-45b4-4927-a83f-b2141a18877f bash

This should allow to "enter" the network namespace. Result is "clear":
setting the network namespace "qdhcp-3c92f030-45b4-4927-a83f-b2141a18877f" failed: Invalid argument

strace doesn't show more:
strace -f -s 120 -v ip -d netns exec qdhcp-3c92f030-45b4-4927-a83f-b2141a18877f bash
execve("/usr/sbin/ip", ["ip", "-d", "netns", "exec", "qdhcp-3c92f030-45b4-4927-a83f-b2141a18877f", "bash"], ["XDG_SESSION_ID=1", "HOSTNAME=undercloud.lain.internux.ch", "GUESTFISH_INIT=\\e[1;34m", "SELINUX_ROLE_REQUESTED=", "TERM=screen", "SHELL=/bin/bash", "HISTSIZE=1000", "SSH_CLIENT=192.168.122.1 59108 22", "SELINUX_USE_CURRENT_RANGE=", "SSH_TTY=/dev/pts/0", "USER=root", "LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su="..., "TMUX=/tmp/tmux-0/default,12905,0", "GUESTFISH_PS1=\\[\\e[1;32m\\]><fs>\\[\\e[0;31m\\] ", "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin:/root/bin", "MAIL=/var/spool/mail/root", "PWD=/root", "LANG=en_US.utf8", "GUESTFISH_OUTPUT=\\e[0m", "TMUX_PANE=%0", "SELINUX_LEVEL_REQUESTED=", "HISTCONTROL=ignoredups", "SHLVL=2", "HOME=/root", "LOGNAME=root", "SSH_CONNECTION=192.168.122.1 59108 192.168.122.38 22", "LC_CTYPE=en_US.utf8", "LESSOPEN=||/usr/bin/lesspipe.sh %s", "XDG_RUNTIME_DIR=/run/user/0", "GUESTFISH_RESTORE=\\e[0m", "_=/usr/bin/strace"]) = 0
brk(NULL) = 0x1a57000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7faed6d6a000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_dev=makedev(8, 1), st_ino=528183, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=64, st_size=29858, st_atime=2018/10/23-15:39:14.742653064, st_mtime=2018/10/23-15:39:14.513652790, st_ctime=2018/10/23-15:39:14.513652790}) = 0
mmap(NULL, 29858, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7faed6d62000
close(3) = 0
open("/lib64/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0`\16\0\0\0\0\0\0@\0\0\0\0\0\0\0\0E\0\0\0\0\0\0\0\0\0\0@\0008\0\7\0@\0!\0 \0\1\0\0\0\5\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\330\37\0\0\0\0\0\0\330\37\0\0\0\0\0\0\0\0 \0\0\0\0\0"..., 832) = 832
fstat(3, {st_dev=makedev(8, 1), st_ino=199332, st_mode=S_IFREG|0755, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=40, st_size=19776, st_atime=2018/10/23-15:07:28.628000000, st_mtime=2018/04/10-08:24:52, st_ctime=2018/06/05-14:07:01.395133488}) = 0
mmap(NULL, 2109744, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7faed6946000
mprotect(0x7faed6948000, 2097152, PROT_NONE) = 0
mmap(0x7faed6b48000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x7faed6b48000
close(3) = 0
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0...

Read more...

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

So.... Some more stuff:

- deleting the qdhcp-xxx and letting neutron_dhcp re-create it doesn't make it batter.
- creating the qdhcp-xxx by hand and letting neutron_dhcp do its stuff doesn't help either

But, deleting the qdhcp-xx and letting neutron_dhcp re-create it allows to get the commands, more or less.

In order:
(3, 'neutron.privileged.agent.linux.ip_lib.create_netns', (u'qdhcp-3c92f030-45b4-4927-a83f-b2141a18877f',), {})

shell is: ip netns add qdhcp-xxx

['ip', 'netns', 'exec', 'qdhcp-3c92f030-45b4-4927-a83f-b2141a18877f', 'sysctl', '-w', 'net.ipv4.conf.all.promote_secondaries=1']

shell is the same, without quotes, of course

(3, 'neutron.privileged.agent.linux.ip_lib.set_link_attribute', ('lo', u'qdhcp-3c92f030-45b4-4927-a83f-b2141a18877f'), {'state': 'up'})

shell is: ip link set lo up netns qdhcp-3c92f030-45b4-4927-a83f-b2141a18877f

And there, we have a first failure:
RTNETLINK answers: Invalid argument

Also reported from the log itself:
(4, [{'header': {'pid': 24464, 'length': 36, 'flags': 0, 'error': None, 'type': 2, 'sequence_number': 255}, 'event': 'NLMSG_ERROR'}])

['ip', 'netns', 'exec', 'qdhcp-3c92f030-45b4-4927-a83f-b2141a18877f', 'sysctl', '-w', 'net.ipv6.conf.default.accept_ra=0']

shell is the same, without quotes.

(3, 'neutron.privileged.agent.linux.ip_lib.get_link_attributes', (u'tape2d00740-20', u'qdhcp-3c92f030-45b4-4927-a83f-b2141a18877f')

shell is: (don't know, there isn't ip link get <device> netns qdhcp-xxx)

We get: Network interface tape2d00740-20 not found in namespace qdhcp-3c92f030-45b4-4927-a83f-b2141a18877f.

(3, 'neutron.privileged.agent.linux.ip_lib.set_link_attribute', (u'tape2d00740-20', None), {'address': u'fa:16:3e:07:55:2c'})

shell is: ip link set tape2d00740-20 address fa:16:3e:07:55:2c

there again, failure:
(4, [{'header': {'pid': 4294961794, 'length': 36, 'flags': 0, 'error': None, 'type': 2, 'sequence_number': 255}, 'event': 'NLMSG_ERROR'}])

and from the shell: Cannot find device "tape2d00740-20"

(3, 'neutron.privileged.agent.linux.ip_lib.set_link_attribute', (u'tape2d00740-20', None), {'net_ns_fd': u'qdhcp-3c92f030-45b4-4927-a83f-b2141a18877f'})

shell is: ??

failure: (4, [{'header': {'pid': 4294961791, 'length': 36, 'flags': 0, 'error': None, 'type': 2, 'sequence_number': 255}, 'event': 'NLMSG_ERROR'}])

and so on.

So apparently there's an issue linking OVS managed interfaces with netns.

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Work is done in this review:
https://review.openstack.org/606095

Changed in tripleo:
status: Triaged → In Progress
Changed in tripleo:
assignee: Cédric Jeanneret (cjeanner) → Emilien Macchi (emilienm)
Changed in tripleo:
assignee: Emilien Macchi (emilienm) → Bogdan Dobrelya (bogdando)
Changed in tripleo:
assignee: Bogdan Dobrelya (bogdando) → Emilien Macchi (emilienm)
Changed in tripleo:
assignee: Emilien Macchi (emilienm) → Bogdan Dobrelya (bogdando)
Changed in tripleo:
assignee: Bogdan Dobrelya (bogdando) → Cédric Jeanneret (cjeanner)
Changed in tripleo:
assignee: Cédric Jeanneret (cjeanner) → Alex Schultz (alex-schultz)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (master)

Reviewed: https://review.openstack.org/606095
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=6117cae693a182c29af0c93f877046ffc7250ae2
Submitter: Zuul
Branch: master

commit 6117cae693a182c29af0c93f877046ffc7250ae2
Author: Bogdan Dobrelya <email address hidden>
Date: Fri Sep 28 16:02:04 2018 +0200

    Fix wrapper containers for podman w/o sockets

    Adapt wrapper containers for podman, which has no a socket available.

    Add container_cli parameter for base neutron class, default to docker.
    Possible values: podman/docker (default). It is used by the wrappers
    tooling to issue CLI commands to the host containers system.
    Deprecate bind_socket so it does nothing for podman CLI.

    Additionally, add debug triggers for the wrapper scripts messages to
    become captured to the wrapper containers' stdout.

    Do not stop and remove the existing container before launching a new
    one. Allow the neutron parent process to control the process life
    cycle. Although make the wraper containers cleaning up any exited
    containers after its main process terminated by the neutron parent
    process. Additionally, If a name is already taken by a container,
    give it an unique name and assume all the smooth transitioning work
    to be done by the parent neutron process and that clean up logic
    in the wrapper.

    Closes-Bug: #1799484
    Change-Id: Ib3c41a8bee349856d21f360595e41a9eafd79323
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 10.2.0

This issue was fixed in the openstack/puppet-tripleo 10.2.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.