Wallaby: OVB jobs are failing to provision overcloud nodes: ""No such file or directory: 'dnsmasq-kill'"

Bug #1922767 reported by Ronelle Landy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

Wallaby OVB integration jobs are failing the overcloud deploy step at the start - provisioning overcloud nodes:

https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-wallaby/38b3e1e/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

2021-04-05 17:41:25 | 2021-04-05 17:41:25.557055 | fa163e6d-fff4-92ba-566b-000000000017 | FATAL | Provision instances | localhost | error={"changed": false, "logging": "Created port overcloud-controller-1-ctlplane (UUID 6edcc815-baab-4e3b-92f2-79874e439774) for node baremetal-38175-1 (UUID 6e2ec2a3-2831-45ba-ae49-f759e8408c48) with {'network_id': '35b710f1-779c-4151-b21c-6ab715d699e9', 'name': 'overcloud-controller-1-ctlplane'}\nCreated port overcloud-controller-0-ctlplane (UUID 4d432dba-680a-42cf-b5d7-02a777ff9bab) for node baremetal-38175-2 (UUID 290cb0c4-5951-4e61-8d86-a6d8515e79b7) with {'network_id': '35b710f1-779c-4151-b21c-6ab715d699e9', 'name': 'overcloud-controller-0-ctlplane'}\nCreated port overcloud-controller-2-ctlplane (UUID abb4451d-c657-49e7-af24-0a7d26785806) for node baremetal-38175-0 (UUID 766c3a25-b321-436b-92cb-6168dbab336f) with {'network_id': '35b710f1-779c-4151-b21c-6ab715d699e9', 'name': 'overcloud-controller-2-ctlplane'}\nCreated port overcloud-novacompute-0-ctlplane (UUID 7784f4ed-71ca-4c1f-a42c-e9d25d857fe8) for node baremetal-38175-3 (UUID 64c46afc-272e-419e-a95b-7df1476a94ca) with {'network_id': '35b710f1-779c-4151-b21c-6ab715d699e9', 'name': 'overcloud-novacompute-0-ctlplane'}\nAttached port overcloud-controller-1-ctlplane (UUID 6edcc815-baab-4e3b-92f2-79874e439774) to node baremetal-38175-1 (UUID 6e2ec2a3-2831-45ba-ae49-f759e8408c48)\nAttached port overcloud-controller-0-ctlplane (UUID 4d432dba-680a-42cf-b5d7-02a777ff9bab) to node baremetal-38175-2 (UUID 290cb0c4-5951-4e61-8d86-a6d8515e79b7)\nAttached port overcloud-controller-2-ctlplane (UUID abb4451d-c657-49e7-af24-0a7d26785806) to node baremetal-38175-0 (UUID 766c3a25-b321-436b-92cb-6168dbab336f)\nAttached port overcloud-novacompute-0-ctlplane (UUID 7784f4ed-71ca-4c1f-a42c-e9d25d857fe8) to node baremetal-38175-3 (UUID 64c46afc-272e-419e-a95b-7df1476a94ca)\nProvisioning started on node baremetal-38175-1 (UUID 6e2ec2a3-2831-45ba-ae49-f759e8408c48)\nProvisioning started on node baremetal-38175-2 (UUID 290cb0c4-5951-4e61-8d86-a6d8515e79b7)\nProvisioning started on node baremetal-38175-0 (UUID 766c3a25-b321-436b-92cb-6168dbab336f)\nProvisioning started on node baremetal-38175-3 (UUID 64c46afc-272e-419e-a95b-7df1476a94ca)\n", "msg": "Node 64c46afc-272e-419e-a95b-7df1476a94ca reached failure state \"deploy failed\"; the last error is Timeout reached while waiting for callback for node 64c46afc-272e-419e-a95b-7df1476a94ca"}
2021-04-05 17:41:25 | 2021-04-05 17:41:25.564217 | fa163e6d-fff4-92ba-566b-000000000017 | TIMING | Provision instances | localhost | 0:32:08.705231 | 1913.66s
2

See the following error (No such file or directory: 'dnsmasq-kill': 'dnsmasq-kill') in var/log/containers/neutron/dhcp-agent.log:

https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-wallaby/38b3e1e/logs/undercloud/var/log/containers/neutron/dhcp-agent.log.txt.gz

2021-04-05 17:02:56.130 92449 DEBUG neutron.agent.linux.dhcp [req-f1d899db-ff61-4711-8a29-6555ed535708 - - - - -] Done building host file /var/lib/neutron/dhcp/35b710f1-779c-4151-b21c-6ab715d699e9/host _output_hosts_file /usr/lib/python3.6/site-packages/neutron/agent/linux/dhcp.py:887
2021-04-05 17:02:56.142 97916 DEBUG oslo.privsep.daemon [-] privsep: Exception during request[140416409565488]: [Errno 2] No such file or directory: 'dnsmasq-kill': 'dnsmasq-kill' _process_cmd /usr/lib/python3.6/site-packages/oslo_privsep/daemon.py:490
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/oslo_privsep/daemon.py", line 485, in _process_cmd
    ret = func(*f_args, **f_kwargs)
  File "/usr/lib/python3.6/site-packages/oslo_privsep/priv_context.py", line 249, in _wrap
    return func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/neutron/privileged/agent/linux/utils.py", line 56, in execute_process
    obj, cmd = _create_process(cmd, addl_env=addl_env)
  File "/usr/lib/python3.6/site-packages/neutron/privileged/agent/linux/utils.py", line 83, in _create_process
    stdout=subprocess.PIPE, stderr=subprocess.PIPE)
  File "/usr/lib/python3.6/site-packages/eventlet/green/subprocess.py", line 58, in __init__
    subprocess_orig.Popen.__init__(self, args, 0, *argss, **kwds)
  File "/usr/lib64/python3.6/subprocess.py", line 729, in __init__
    restore_signals, start_new_session)
  File "/usr/lib64/python3.6/subprocess.py", line 1364, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'dnsmasq-kill': 'dnsmasq-kill'
2021-04-05 17:02:56.164 97916 DEBUG oslo.privsep.daemon [-] privsep: reply[140416409565488]: (5, 'builtins.FileNotFoundError', (2, "No such file or directory: 'dnsmasq-kill'")) _call_back /usr/lib/python3.6/site-packages/oslo_privsep/daemon.py:511
2021-04-05 17:02:56.165 92449 DEBUG neutron.agent.dhcp.agent [-] Resync event has been scheduled _periodic_resync_helper /usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py:356
2021-04-05 17:02:56.165 92449 DEBUG neutron.common.utils [-] Calling throttled function clear wrapper /usr/lib/python3.6/site-packages/neutron/common/utils.py:115
2021-04-05 17:02:56.165 92449 DEBUG neutron.agent.dhcp.agent [-] resync (35b710f1-779c-4151-b21c-6ab715d699e9): [FileNotFoundError(2, "No such file or directory: 'dnsmasq-kill'")] _periodic_resync_helper /usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py:373
2021-04-05 17:02:56.166 92449 ERROR neutron.agent.dhcp.agent [req-f1d899db-ff61-4711-8a29-6555ed535708 - - - - -] Unable to reload_allocations dhcp for 35b710f1-779c-4151-b21c-6ab715d699e9.: FileNotFoundError: [Errno 2] No such file or directory: 'dnsmasq-kill'
2021-04-05 17:02:56.166 92449 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2021-04-05 17:02:56.166 92449 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py", line 227, in call_driver
2021-04-05 17:02:56.166 92449 ERROR neutron.agent.dhcp.agent rv = getattr(driver, action)(**action_kwargs)
2021-04-05 17:02:56.166 92449 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/dhcp.py", line 593, in reload_allocations
2021-04-05 17:02:56.166 92449 ERROR neutron.agent.dhcp.agent self._spawn_or_reload_process(reload_with_HUP=True)
2021-04-05 17:02:56.166 92449 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/dhcp.py", line 525, in _spawn_or_reload_process
2021-04-05 17:02:56.166 92449 ERROR neutron.agent.dhcp.agent pm.enable(reload_cfg=reload_with_HUP, ensure_active=True)
2021-04-05 17:02:56.166 92449 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/external_process.py", line 91, in enable
2021-04-05 17:02:56.166 92449 ERROR neutron.agent.dhcp.agent self.reload_cfg()
2021-04-05 17:02:56.166 92449 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/external_process.py", line 99, in reload_cfg
2021-04-05 17:02:56.166 92449 ERROR neutron.agent.dhcp.agent self.disable('HUP')
2021-04-05 17:02:56.166 92449 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/external_process.py", line 114, in disable
2021-04-05 17:02:56.166 92449 ERROR neutron.agent.dhcp.agent privsep_exec=True)
2021-04-05 17:02:56.166 92449 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py", line 132, in execute
2021-04-05 17:02:56.166 92449 ERROR neutron.agent.dhcp.agent cmd, _process_input, addl_env)
2021-04-05 17:02:56.166 92449 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/oslo_privsep/priv_context.py", line 247, in _wrap
2021-04-05 17:02:56.166 92449 ERROR neutron.agent.dhcp.agent return self.channel.remote_call(name, args, kwargs)
2021-04-05 17:02:56.166 92449 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/oslo_privsep/daemon.py", line 224, in remote_call
2021-04-05 17:02:56.166 92449 ERROR neutron.agent.dhcp.agent raise exc_type(*result[2])
2021-04-05 17:02:56.166 92449 ERROR neutron.agent.dhcp.agent FileNotFoundError: [Errno 2] No such file or directory: 'dnsmasq-kill'
2021-04-05 17:02:56.166 92449 ERROR neutron.agent.dhcp.agent

fs036 shows the same error:

https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-wallaby/d94bf90/logs/undercloud/var/log/containers/neutron/dhcp-agent.log.txt.gz

Ronelle Landy (rlandy)
tags: added: ci promotion-blocker
Changed in tripleo:
milestone: none → wallaby-rc1
importance: Undecided → Critical
status: New → Triaged
Revision history for this message
Ronelle Landy (rlandy) wrote :
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

From what I see in the logs it seems that kill_scripts directory exists properly on the node: https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-wallaby/d94bf90/logs/undercloud/var/lib/neutron/kill_scripts/

And it is set on the dhcp agent's container's config: https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-wallaby/d94bf90/logs/undercloud/var/log/extra/podman/containers/neutron_dhcp/podman_info.log.txt.gz

            {
                "Type": "bind",
                "Source": "/var/lib/neutron/kill_scripts",
                "Destination": "/etc/neutron/kill_scripts",
                "Driver": "",
                "Mode": "",
                "Options": [
                    "rbind"
                ],
                "RW": true,
                "Propagation": "shared"
            },

So I don't understand why it's not available from the container for the dhcp agent.
I will investigate that issue and will also ask Brent for help with this.

Revision history for this message
Brent Eagles (beagles) wrote :

It probably doesn't matter at all, but I'm not sure why we mount that path as shared or relabel it. I wonder also if we are running into a problem where that path is included as a subdirectory of another mount. It's weird that it's worked in previous versions and other environments without issue.

Revision history for this message
Ronelle Landy (rlandy) wrote :
Revision history for this message
Slawek Kaplonski (slaweq) wrote :
wes hayutin (weshayutin)
Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.