Hello,
It seems the "provide" command is hitting a timeout on my env[1]:
2021-08-23 13:33:46.122226 | 2442014a-9670-7de9-f208-00000000000f | FATAL | Make nodes available | localhost | error={"changed": false, "msg": "Timeout waiting for node 1b82598d-5c0d-447e-b0e8-05c5cc8b2b63 to have bridge_mappings set
in the ironic-neutron-agent entry"}
The task that actually hits this timeout is this one:
https://opendev.org/openstack/tripleo-ansible/src/branch/master/tripleo_ansible/ansible_plugins/modules/os_baremetal_provide_node.py#L379-L389
I tried to make that timeout a bit bigger (120 instead of 60), but nope, it's still failing.
After a check on neutron agents, it seems the one we're needing is taking too long to pop in the list.
Before the timeout, there's this list:
+--------------------------------------+--------------------+-------------------------+-------------------+-------+-------+---------------------------+
| ID | Agent Type | Host | Availability Zone | Alive | State | Binary |
+--------------------------------------+--------------------+-------------------------+-------------------+-------+-------+---------------------------+
| 5f390380-74ee-44f3-b3e5-b1cbf18fe142 | Open vSwitch agent | undercloud.mydomain.tld | None | :-) | UP | neutron-openvswitch-agent |
| 7be250d6-b9ae-4a35-8985-5e0c4375d848 | L3 agent | undercloud.mydomain.tld | nova | :-) | UP | neutron-l3-agent |
| ecee547a-d0e1-4605-8dd4-c21a98ebfc4f | DHCP agent | undercloud.mydomain.tld | nova | :-) | UP | neutron-dhcp-agent |
+--------------------------------------+--------------------+-------------------------+-------------------+-------+-------+---------------------------+
We can see we're missing the ironic-neutron-agent one. It appears a bit later though:
+--------------------------------------+--------------------+--------------------------------------+-------------------+-------+-------+---------------------------+
| ID | Agent Type | Host | Availability Zone | Alive | State | Binary |
+--------------------------------------+--------------------+--------------------------------------+-------------------+-------+-------+---------------------------+
| 5f390380-74ee-44f3-b3e5-b1cbf18fe142 | Open vSwitch agent | undercloud.mydomain.tld | None | :-) | UP | neutron-openvswitch-agent |
| 7be250d6-b9ae-4a35-8985-5e0c4375d848 | L3 agent | undercloud.mydomain.tld | nova | :-) | UP | neutron-l3-agent |
| ad27978d-a635-4054-b798-c45d6d92da3f | Baremetal Node | 1b82598d-5c0d-447e-b0e8-05c5cc8b2b63 | None | :-) | UP | ironic-neutron-agent |
| ecee547a-d0e1-4605-8dd4-c21a98ebfc4f | DHCP agent | undercloud.mydomain.tld | nova | :-) | UP | neutron-dhcp-agent |
+--------------------------------------+--------------------+--------------------------------------+-------------------+-------+-------+---------------------------+
Seems that 120s isn't long enough, but if we have to raise that timeout that much, it probably means there's "something" really slow in the whole chain.
It was working fine on the very same hardware last week.
The UC node has 12-cores and 27G of RAM (in a virtual machine)
The OC is just 1-controller
TripleO related packages versions:
openstack-tripleo-heat-templates-15.0.1-0.20210823104807.d12028c.el8.noarch
tripleo-ansible-4.0.1-0.20210823130700.75cb8b5.el8.noarch
python3-tripleoclient-17.0.1-0.20210821033245.72cba87.el8.noarch
ansible-tripleo-ipsec-11.0.1-0.20210304160420.b5559c8.el8.noarch
openstack-tripleo-validations-15.0.1-0.20210819182907.d00a7f9.el8.noarch
openstack-tripleo-common-containers-16.1.1-0.20210823075805.1ef4c88.el8.noarch
puppet-tripleo-15.0.1-0.20210823102254.3296269.el8.noarch
ansible-tripleo-ipa-0.2.2-0.20210422191945.9159108.el8.noarch
openstack-tripleo-common-16.1.1-0.20210823075805.1ef4c88.el8.noarch
python3-tripleo-repos-0.1.1-0.20210817121802.041c6d5.el8.noarch
ansible-role-tripleo-modify-image-1.2.3-0.20210820065807.0b9fdcc.el8.noarch
python3-tripleo-common-16.1.1-0.20210823075805.1ef4c88.el8.noarch
The containers were built on Aug-18, and ironic_neutron_agent container has the following neutron-related packages:
python3-neutron-18.1.0-0.20210816102305.7988ab5.el8.noarch
openstack-neutron-ml2-18.1.0-0.20210816102305.7988ab5.el8.noarch
python3-ironic-neutron-agent-4.1.0-0.20210802103610.ae68d69.el8.noarch
python3-neutronclient-7.5.0-0.20210610134047.ae39756.el8.noarch
openstack-neutron-common-18.1.0-0.20210816102305.7988ab5.el8.noarch
openstack-neutron-18.1.0-0.20210816102305.7988ab5.el8.noarch
openstack-neutron-rpc-server-18.1.0-0.20210816102305.7988ab5.el8.noarch
python3-neutron-lib-2.13.0-0.20210805084212.2b46c1f.el8.noarch
Not really sure what data I can provide. Timeline?
We can see in /var/log/containers/neutron/ironic-neutron-agent.log the following lines:
2021-08-23 13:25:15.383 79921 INFO networking_baremetal.agent.ironic_neutron_agent [-] Starting agent networking-baremetal.
2021-08-23 13:25:15.527 79921 INFO networking_baremetal.agent.ironic_neutron_agent [-] Adding member id 56bb8051-e0f1-4e2a-9b1d-6c4ef3bdce05 on host undercloud.mydomain.tld to hashring.
2021-08-23 13:35:15.569 79921 INFO networking_baremetal.agent.ironic_neutron_agent [-] Reporting state for host agent 1b82598d-5c0d-447e-b0e8-05c5cc8b2b63 with new configuration: {'bridge_mappings': {'ctlplane': 'yes'}, 'log_agent_heartbeats': False}
2021-08-23 13:35:15.570 79921 DEBUG networking_baremetal.agent.ironic_neutron_agent [-] Reporting state for host: 1b82598d-5c0d-447e-b0e8-05c5cc8b2b63 with configuration: {'bridge_mappings': {'ctlplane': 'yes'}, 'log_agent_heartbeats': False} _report_state /usr/lib/python3.6/site-packages/networking_baremetal/agent/ironic_neutron_agent.py:241
2021-08-23 13:40:15.595 79921 DEBUG networking_baremetal.agent.ironic_neutron_agent [-] Reporting state for host: 1b82598d-5c0d-447e-b0e8-05c5cc8b2b63 with configuration: {'bridge_mappings': {'ctlplane': 'yes'}, 'log_agent_heartbeats': False} _report_state /usr/lib/python3.6/site-packages/networking_baremetal/agent/ironic_neutron_agent.py:241
It seems the relevant line (Reporting state for host agent 1b82598d-5c0d-447e-b0e8-05c5cc8b2b63 with new configuration) comes almost 2 minutes after the timeout error reported by `openstack overcloud node provision'...
Maybe a hint?
Cheers,
C.
[1] https://github.com/cjeanner/tripleo-lab
Oh, another thing:
(undercloud) [CentOS-8 - stack@undercloud ~]$ openstack network agent show ad27978d- a635-4054- b798-c45d6d92da 3f ------- ------+ ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ---+ ------- ------+ ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ---+ neutron- agent | heartbeats' : False} | 5c0d-447e- b0e8-05c5cc8b2b 63 | a635-4054- b798-c45d6d92da 3f | ------- ------+ ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ---+
+------
| Field | Value |
+------
| admin_state_up | UP |
| agent_type | Baremetal Node |
| alive | :-) |
| availability_zone | None |
| binary | ironic-
| configuration | {'bridge_mappings': {'ctlplane': 'yes'}, 'log_agent_
| created_at | 2021-08-23 13:35:15 |
| description | None |
| ha_state | None |
| host | 1b82598d-
| id | ad27978d-
| last_heartbeat_at | 2021-08-23 13:45:15 |
| name | None |
| resources_synced | None |
| started_at | 2021-08-23 13:35:15 |
| topic | N/A |
+------
We can see it's, indeed, started pretty late, compared to the failure timeline :(.