cs9 master fs001, fs002, fs020, fs035, fs039 and fs064 is failing to provision overcloud node (the last error is Timeout reached while waiting for callback for node)

Bug #1958955 reported by Bhagyashri Shewale
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

Error:

2022-01-24 21:42:45.515721 | fa163e2f-fc1d-b7c0-460c-000000000018 | FATAL | Provision instances | localhost | error={"changed": false, "logging": "Created port overcloud-controller-1-ctlplane (UUID 9380f102-6b7d-4b75-9388-f86e4cb1b60b) for node baremetal-66459-1 (UUID 8d945669-2f67-4054-9b4f-f2da1b97ad66) with {'network_id': '5c1ba3cf-ed75-4484-8a8d-f953c6baba8f', 'name': 'overcloud-controller-1-ctlplane'}\nCreated port overcloud-novacompute-0-ctlplane (UUID 761cbe0b-3c9a-4b98-bc06-fb4879b6facf) for node baremetal-66459-0 (UUID d59773e8-9e73-4333-a3cb-464b3bc3b74e) with {'network_id': '5c1ba3cf-ed75-4484-8a8d-f953c6baba8f', 'name': 'overcloud-novacompute-0-ctlplane'}\nCreated port overcloud-controller-2-ctlplane (UUID d516f1b8-4128-4fde-a8fa-6ecaba25f626) for node baremetal-66459-3 (UUID cd1c8614-a15c-4eb9-a41b-5120951d5321) with {'network_id': '5c1ba3cf-ed75-4484-8a8d-f953c6baba8f', 'name': 'overcloud-controller-2-ctlplane'}\nCreated port overcloud-controller-0-ctlplane (UUID 52dbf92d-f02d-4a09-9761-5eeb3f8c3628) for node baremetal-66459-2 (UUID cccb43b0-3786-420c-8ee0-365f9e3b934f) with {'network_id': '5c1ba3cf-ed75-4484-8a8d-f953c6baba8f', 'name': 'overcloud-controller-0-ctlplane'}\nAttached port overcloud-novacompute-0-ctlplane (UUID 761cbe0b-3c9a-4b98-bc06-fb4879b6facf) to node baremetal-66459-0 (UUID d59773e8-9e73-4333-a3cb-464b3bc3b74e)\nAttached port overcloud-controller-0-ctlplane (UUID 52dbf92d-f02d-4a09-9761-5eeb3f8c3628) to node baremetal-66459-2 (UUID cccb43b0-3786-420c-8ee0-365f9e3b934f)\nAttached port overcloud-controller-1-ctlplane (UUID 9380f102-6b7d-4b75-9388-f86e4cb1b60b) to node baremetal-66459-1 (UUID 8d945669-2f67-4054-9b4f-f2da1b97ad66)\nAttached port overcloud-controller-2-ctlplane (UUID d516f1b8-4128-4fde-a8fa-6ecaba25f626) to node baremetal-66459-3 (UUID cd1c8614-a15c-4eb9-a41b-5120951d5321)\nProvisioning started on node baremetal-66459-1 (UUID 8d945669-2f67-4054-9b4f-f2da1b97ad66)\nProvisioning started on node baremetal-66459-2 (UUID cccb43b0-3786-420c-8ee0-365f9e3b934f)\nProvisioning started on node baremetal-66459-3 (UUID cd1c8614-a15c-4eb9-a41b-5120951d5321)\nProvisioning started on node baremetal-66459-0 (UUID d59773e8-9e73-4333-a3cb-464b3bc3b74e)\n", "msg": "Node cd1c8614-a15c-4eb9-a41b-5120951d5321 reached failure state \"deploy failed\"; the last error is Timeout reached while waiting for callback for node cd1c8614-a15c-4eb9-a41b-5120951d5321"}
2022-01-24 21:42:45.519530 | fa163e2f-fc1d-b7c0-460c-000000000018 | TIMING | Provision instances | localhost | 0:33:34.027825 | 2001.25s

NO MORE HOSTS LEFT *************************************************************

PLAY RECAP *********************************************************************
localhost : ok=9 changed=2 unreachable=0 failed=1 skipped=5 rescued=0 ignored=0

Affected jobs:
- periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp_1supp-featureset064-master
- periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp_1supp-featureset039-master
- periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset035-master
- periodic-tripleo-ci-centos-9-ovb-1ctlr_2comp-featureset020-master
- periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master
- periodic-tripleo-ci-centos-9-ovb-1ctlr_1comp-featureset002-master

[1]: https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master&pipeline=openstack-periodic-integration-main
[2]: https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master/c2688ab/logs/undercloud/home/zuul/overcloud_node_provision.log.txt.gz
[3]: https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master/c2688ab/job-output.txt

Revision history for this message
Marios Andreou (marios-b) wrote :
Revision history for this message
Ronelle Landy (rlandy) wrote :

Looks pretty consistent now:

https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-1ctlr_1comp-featureset002-master/37ba009/logs/undercloud/home/zuul/overcloud_node_provision.log.txt.gz

955c6022-dcd8-417c-b3d5-1faadd0872a7) to node baremetal-7503-1 (UUID 1f07d4dd-955e-41b9-86d5-5f468770ff4f)\nProvisioning started on node baremetal-7503-1 (UUID 1f07d4dd-955e-41b9-86d5-5f468770ff4f)\nProvisioning started on node baremetal-7503-0 (UUID d1548719-3a75-44e3-a43a-6a6c7baa0a84)\n", "msg": "Node 1f07d4dd-955e-41b9-86d5-5f468770ff4f reached failure state \"deploy failed\"; the last error is Timeout reached while waiting for callback for node 1f07d4dd-955e-41b9-86d5-5f468770ff4f"}

Revision history for this message
Ronelle Landy (rlandy) wrote :

https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset035-master/e719c1a/logs/undercloud/var/log/extra/big-errors.txt.gz

shows:

2022-01-25 13:02:31.070 ERROR /var/log/containers/neutron/server.log: 18 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2022-01-25 13:02:31.070 ERROR /var/log/containers/neutron/server.log: 18 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
2022-01-25 13:02:31.070 ERROR /var/log/containers/neutron/server.log: 18 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2022-01-25 13:02:31.070 ERROR /var/log/containers/neutron/server.log: 18 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch
2022-01-25 13:02:31.070 ERROR /var/log/containers/neutron/server.log: 18 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
2022-01-25 13:02:31.070 ERROR /var/log/containers/neutron/server.log: 18 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch
2022-01-25 13:02:31.070 ERROR /var/log/containers/neutron/server.log: 18 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
2022-01-25 13:02:31.070 ERROR /var/log/containers/neutron/server.log: 18 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/neutron/api/rpc/handlers/dhcp_rpc.py", line 169, in get_active_networks_info
2022-01-25 13:02:31.070 ERROR /var/log/containers/neutron/server.log: 18 ERROR oslo_messaging.rpc.server ret.append(self.get_network_info(
2022-01-25 13:02:31.070 ERROR /var/log/containers/neutron/server.log: 18 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/neutron/api/rpc/handlers/dhcp_rpc.py", line 216, in get_network_info
2022-01-25 13:02:31.070 ERROR /var/log/containers/neutron/server.log: 18 ERROR oslo_messaging.rpc.server segment_ids = {ns.id for ns in network.segments if
2022-01-25 13:02:31.070 ERROR /var/log/containers/neutron/server.log: 18 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/neutron/api/rpc/handlers/dhcp_rpc.py", line 217, in <setcomp>
2022-01-25 13:02:31.070 ERROR /var/log/containers/neutron/server.log: 18 ERROR oslo_messaging.rpc.server host in ns.hosts}
2022-01-25 13:02:31.070 ERROR /var/log/containers/neutron/server.log: 18 ERROR oslo_messaging.rpc.server NameError: free variable 'host' referenced before assignment in enclosing scope
2022-01-25 13:02:31.070 ERROR /var/log/containers/neutron/server.log: 18 ERROR oslo_messaging.rpc.server
2022-01-25 13:02:31.542 ERROR /var/log/containers/neutron/server.log: 16 ERROR oslo_messaging.rpc.server [req-864c5d14-c683-401e-9db3-123993105513 - - - - -] Exception during message handling: NameError: free variable 'host' referenced before assignment in enclosing scope

Revision history for this message
Douglas Viroel (dviroel) wrote :
Revision history for this message
Douglas Viroel (dviroel) wrote :

looks like nodes are failing to receive an ip address[1]. Expected behavior should be a DHCPOFFER [2].

Jan 25 13:04:08 localhost dhclient[430]: DHCPDISCOVER on ens4 to 255.255.255.255 port 67 interval 13 (xid=0x58a0c400)
Jan 25 13:04:13 localhost dhclient[437]: DHCPDISCOVER on ens6 to 255.255.255.255 port 67 interval 14 (xid=0x4ab16715)
Jan 25 13:04:16 localhost dhclient[433]: DHCPDISCOVER on ens5 to 255.255.255.255 port 67 interval 15 (xid=0x5a9b9170)
Jan 25 13:04:17 localhost dhclient[440]: DHCPDISCOVER on ens7 to 255.255.255.255 port 67 interval 14 (xid=0x762b1715)
Jan 25 13:04:21 localhost dhclient[443]: DHCPDISCOVER on ens8 to 255.255.255.255 port 67 interval 13 (xid=0xc84f9c4f)
Jan 25 13:04:21 localhost dhclient[430]: DHCPDISCOVER on ens4 to 255.255.255.255 port 67 interval 17 (xid=0x58a0c400)
Jan 25 13:04:27 localhost dhclient[437]: DHCPDISCOVER on ens6 to 255.255.255.255 port 67 interval 8 (xid=0x4ab16715)
Jan 25 13:04:31 localhost dhclient[440]: DHCPDISCOVER on ens7 to 255.255.255.255 port 67 interval 8 (xid=0x762b1715)
Jan 25 13:04:32 localhost dhclient[433]: DHCPDISCOVER on ens5 to 255.255.255.255 port 67 interval 7 (xid=0x5a9b9170)
Jan 25 13:04:34 localhost dhclient[443]: DHCPDISCOVER on ens8 to 255.255.255.255 port 67 interval 5 (xid=0xc84f9c4f)
Jan 25 13:04:35 localhost dhclient[437]: DHCPDISCOVER on ens6 to 255.255.255.255 port 67 interval 4 (xid=0x4ab16715)
Jan 25 13:04:37 localhost systemd[1]: NetworkManager-wait-online.service: Main process exited, code=exited, status=1/FAILURE
Jan 25 13:04:37 localhost systemd[1]: NetworkManager-wait-online.service: Failed with result 'exit-code'.

[1] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset035-master/e719c1a/logs/undercloud/var/log/containers/ironic-inspector/ramdisk/
[2] https://logserver.rdoproject.org/57/38357/1/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset035-master/37a7ebc/logs/undercloud/var/log/containers/ironic/deploy/

Revision history for this message
Ronelle Landy (rlandy) wrote :
Revision history for this message
Ronelle Landy (rlandy) wrote :

<dviroel> slaweq: rlandy|ruck: nodes that are failing are still having hard times to get an IP - iiuc
<slaweq> dviroel: rlandy|ruck that error with Failed to bind port are because of the cloud config I guess, see:
<slaweq> https://paste.opendev.org/show/812372/
<slaweq> that's neutron server log about it
<slaweq> it has enabled 2 mech_drivers: "openvswitch" and "baremetal"
<slaweq> openvswitch can't bound port as there is no neutron-ovs-agent on the node
<slaweq> and "baremetal" can't bound port with vnic_type=normal

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I took a look again on the results of https://logserver.rdoproject.org/82/38582/2/check/periodic-tripleo-ci-centos-9-ovb-1ctlr_1comp-featureset002-master/3506780/job-output.txt

Here are my findings:

1. "Failed to bind port on host..." in neutron server logs are red herrings - that is the same also in the Centos 8 jobs and later that port is bound properly with baremetal mech_driver,

2. In the log https://logserver.rdoproject.org/82/38582/2/check/periodic-tripleo-ci-centos-9-ovb-1ctlr_1comp-featureset002-master/3506780/logs/undercloud/home/zuul/overcloud_node_provision.log.txt.gz I found error:

Node 635a9c6c-f912-4d71-a1ac-bfd2999595d7 reached failure state \"deploy failed\"; the last error is Timeout reached while waiting for callback for node 635a9c6c-f912-4d71-a1ac-bfd2999595d7

3. So I checked neutron server log looking for that ID and I found port which was bound on that node:

2022-01-26 11:21:27.169 15 DEBUG neutron.plugins.ml2.managers [req-eb2fe8f4-f2ab-4b7c-ae5b-0fd0e4729711 - fake_project_id - - -] Bound port: 1baac803-390e-4479-833f-a03906167b93, host: 635a9c6c-f912-4d71-a1ac-bfd2999595d7, vif_type: other, vif_details: {"connectivity": "legacy"}, binding_levels: [{'bound_driver': 'baremetal', 'bound_segment': {'id': '40635e99-cbc6-444f-a2ba-a94b3f2ba0f2', 'network_type': 'flat', 'physical_network': 'ctlplane', 'segmentation_id': None, 'network_id': '55fd99bc-2c12-483d-9b61-a45ae5714baf'}}] _bind_port_level /usr/lib/python3.9/site-packages/neutron/plugins/ml2/managers.py:932

4. So from neutron perspective all seems ok for me here, I took quick look at ironic-conductor log and for that node I saw:

2022-01-26 11:22:56.909 2 INFO ironic.conductor.utils [req-07fc14b2-3dcd-4bce-83fe-047c74a90e11 admin - - - -] Successfully set node 635a9c6c-f912-4d71-a1ac-bfd2999595d7 power state to power on by rebooting.
2022-01-26 11:22:56.920 2 INFO ironic.conductor.deployments [req-07fc14b2-3dcd-4bce-83fe-047c74a90e11 admin - - - -] Deploy step {'step': 'deploy', 'priority': 100, 'argsinfo': None, 'interface': 'deploy'} on node 635a9c6c-f912-4d71-a1ac-bfd2999595d7 being executed asynchronously, waiting for driver.

which is around the time when port was bound in neutron-server. Later there are only logs like:

2022-01-26 11:23:42.787 2 DEBUG ironic.conductor.task_manager [req-e518b3a7-dca1-43b1-8017-58cf8a497fcc - - - - -] Attempting to get shared lock on node 635a9c6c-f912-4d71-a1ac-bfd2999595d7 (for power state sync) __init__ /usr/lib/python3.9/site-packages/ironic/conductor/task_manager.py:235
2022-01-26 11:23:42.788 2 DEBUG ironic.conductor.task_manager [req-e518b3a7-dca1-43b1-8017-58cf8a497fcc - - - - -] Successfully released shared lock for power state sync on node 635a9c6c-f912-4d71-a1ac-bfd2999595d7 (lock was held 0.00 sec) release_resources /usr/lib/python3.9/site-packages/ironic/conductor/task_manager.py:448

And finally it is switched to deploy failed.

I don't know how Ironic works and how Neutron should works with Ironic so maybe I'm missing something there but for me currently it looks like someone from Ironic team should take a look into that issue.

Revision history for this message
Harald Jensås (harald-jensas) wrote :

I don't see a dnsmasq process for neutron dhcp-agent running
https://logserver.rdoproject.org/82/38582/2/check/periodic-tripleo-ci-centos-9-ovb-1ctlr_1comp-featureset002-master/3506780/logs/undercloud/var/log/extra/ps.txt.gz

Ironic Inspectors dnsmasq is there:

 0.0 472 3092 10004 149215 \_ /sbin/dnsmasq --conf-file=/etc/ironic-inspector/dnsmasq.conf -k --log-facility=/var/log/ironic-inspector/dnsmasq.log

But no dnsmasq for neutron.

Comparing to a CentOS 8 job:
https://logserver.rdoproject.org/25/823925/3/openstack-check/tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001/8b5c715/logs/undercloud/var/log/extra/ps.txt.gz

On CentOS 8 we can see:
 0.0 176 816 4236 131256 | \_ dumb-init --single-child -- /usr/sbin/dnsmasq -k --no-hosts --no-resolv --pid-file=/var/lib/neutron/dhcp/3d657407-c811-4c00-b360-64a49c742bfb/pid --dhcp-hostsfile=/var/lib/neutron/dhcp/3d657407-c811-4c00-b360-64a49c742bfb/host --addn-hosts=/var/lib/neutron/dhcp/3d657407-c811-4c00-b360-64a49c742bfb/addn_hosts --dhcp-optsfile=/var/lib/neutron/dhcp/3d657407-c811-4c00-b360-64a49c742bfb/opts --dhcp-leasefile=/var/lib/neutron/dhcp/3d657407-c811-4c00-b360-64a49c742bfb/leases --dhcp-match=set:ipxe,175 --dhcp-userclass=set:ipxe6,iPXE --local-service --bind-dynamic --dhcp-range=set:subnet-bb0010fc-e929-4129-9c6c-41acebddbe4d,192.168.24.0,static,255.255.255.0,86400s --dhcp-option-force=option:mtu,1350 --dhcp-lease-max=256 --conf-file=/dev/null --domain=localdomain

neutron-dhcp-agent's dnsmasq is not running would explain why Ironic would eventually timeout.

Revision history for this message
Harald Jensås (harald-jensas) wrote :

In job log:
https://logserver.rdoproject.org/82/38582/2/check/periodic-tripleo-ci-centos-9-ovb-1ctlr_1comp-featureset002-master/3506780/logs/undercloud/var/log/containers/neutron/dhcp-agent.log.txt.gz

2022-01-26 10:58:36.489 147155 INFO neutron.agent.dhcp.agent
  [req-e043493e-0fe5-4956-b2dc-f6523891bdae - - - - -]
    All active networks have been fetched through RPC.

2022-01-26 10:58:36.490 147155 INFO neutron.agent.dhcp.agent
  [-] Starting network 55fd99bc-2c12-483d-9b61-a45ae5714baf dhcp configuration

2022-01-26 10:58:36.491 147155 DEBUG oslo_concurrency.lockutils
  [-] Lock "resize_greenpool" acquired by "neutron.agent.dhcp.agent.DhcpAgent._resize_process_pool" :: waited 0.000s inner /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:386

2022-01-26 10:58:36.491 147155 DEBUG oslo_concurrency.lockutils
  [-] Lock "resize_greenpool" "released" by "neutron.agent.dhcp.agent.DhcpAgent._resize_process_pool" :: held 0.000s inner /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:400

2022-01-26 10:58:36.492 147155 INFO neutron.agent.dhcp.agent
  [-] Finished network 55fd99bc-2c12-483d-9b61-a45ae5714baf dhcp configuration

Locally attempting to reproduce the issue, I was not able to reproduce.

_A LOT_ more is going on between these log entries.
  All active networks have been fetched through RPC.
  <- and ->
  Finished network <UUID> for dhcp configuration

https://opendev.org/openstack/neutron/commit/1562f9141ba9c90c69c1c851dd9a50c8108ab15f

Revision history for this message
Harald Jensås (harald-jensas) wrote :

Ooops, hit submit too soon. Following up from previous post:

I belive change[1] is the culprit.

- if len(network.subnets) and self.call_driver('enable', network):
+ if (any(s for s in network.subnets if s.enable_dhcp) and
+ self.call_driver('enable', network)):

In CI we never see "Calling driver for network <UUID>" logged, which is logged by call_driver() see[2]. I do see that logged in my local env where dnsmasq is properly spawned.

There must be somethin, a race?, causing the new condition on `enable_dhcp` for the subnets to compute as False. So the call_driver() method and this entire conditioned block[3] does not run:

        if (any(s for s in network.subnets if s.enable_dhcp) and
                self.call_driver('enable', network)):
            self.update_isolated_metadata_proxy(network)
            self.cache.put(network)
            # After enabling dhcp for network, mark all existing
            # ports as ready. So that the status of ports which are
            # created before enabling dhcp can be updated.
            self.dhcp_ready_ports |= {p.id for p in network.ports}

[1] https://opendev.org/openstack/neutron/commit/1562f9141ba9c90c69c1c851dd9a50c8108ab15f
[2] 2022-01-28 17:49:06.766 35283 DEBUG neutron.agent.dhcp.agent [-] Calling driver for network: 58a36bab-8a1b-43a4-bd83-cb153f4d2232 action: enable call_driver /usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py:203
[3] https://opendev.org/openstack/neutron/src/branch/master/neutron/agent/dhcp/agent.py#L403-L410

Ronelle Landy (rlandy)
Changed in tripleo:
milestone: xena-3 → yoga-1
Revision history for this message
Marios Andreou (marios-b) wrote :
Revision history for this message
Marios Andreou (marios-b) wrote :
Ronelle Landy (rlandy)
Changed in tripleo:
status: Triaged → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 20.0.0.0rc1

This issue was fixed in the openstack/neutron 20.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.