ironic-python-agent doesn't wait for IPv4 DHCP before trying to determine API versions

Bug #1945503 reported by Drew Freiberger
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Ironic Conductor Charm
New
Undecided
Unassigned

Bug Description

When cleaning ironic baremetal nodes, if IPv6 is enabled, network-online.target is reached immediately due to IPv6 addresses being bound at boot time.

The ironic-python-agent then times out on DNS request for reaching the ironic-api units.

Sep 28 14:29:39 ubuntu ironic-python-agent[1053]: 2021-09-28 14:29:39.986 1053 WARNING ironic_python_agent.ironic_api_client [-] Error detected while attempting to perform lookup with https://ironic-api-int.mysite.com:6385, retrying. Error: HTTPSConnectionPool(host='ironic-api-int.mysite.com', port=6385): Max retries exceeded with url: /v1/lookup?addresses=0c%3A42%3Aa1%3Acb%3A75%3A90%2C0c%3A42%3Aa1%3Acb%3A75%3A91 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f28482e9a58>: Failed to establish a new connection: [Errno -2] Name or service not known',)): requests.exceptions.ConnectionError: HTTPSConnectionPool(host='ironic-api-int.mysite.com', port=6385): Max retries exceeded with url: /v1/lookup?addresses=0c%3A42%3Aa1%3Acb%3A75%3A90%2C0c%3A42%3Aa1%3Acb%3A75%3A91 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f28482e9a58>: Failed to establish a new connection: [Errno -2] Name or service not known',))#033[00m

systemd shows network-online.target was reached at 14:29:05, but system logs show dhclient for eno1 wasn't started until 14:29:34 and didn't receive an address until 14:29:46.

I think there needs to be something added either to the ironic-python-agent to wait for all DHCP IPs to be plumbed on links with carriers, or some investigation into disabling ipv4 for deployment/cleaning boot purposes would be useful in the charm to work around this issue.

Revision history for this message
Drew Freiberger (afreiberger) wrote :
Revision history for this message
Drew Freiberger (afreiberger) wrote :

This is a bit of a red herring that came up during deployment of new baremetal nodes. The nodes were stuck in 'clean wait' and the nodes were heartbeating properly to ironic-conductor, but because the prior 'clean failed' by the ironic service put the nodes into "maintenance=True" mode, they wouldn't clean upon running 'openstack baremetal node provide <uuid>' without first running 'openstack baremetal maintenance unset <uuid>'.

Ultimately, I think this is an upstream bug that "openstack baremetal node provide <uuid>" should clear the maintenance=True state if it were set True automatically by a prior failed clean action.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.