Victoria legacy run-os-net-config.sh deployment issue

Bug #1904449 reported by Luke Short
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Invalid
Critical
Luke Short

Bug Description

Description
===========
An Overcloud deployment fails early on (before even executing the config-download playbooks). It actually fails to get the config-download content.

Steps to reproduce
==================
- Deploy an Overcloud using the Victoria release and pre-deployed nodes.
    - Here is my deploy command: http://paste.openstack.org/show/800066/
    - I am using legacy network configuration via the use of the TripleO Heat template parameter: `NetworkConfigWithAnsible: false`

Expected result
===============
The config is downloaded and then executed to start the installation of OpenStack.

Actual result
=============
The config fails to download.

Environment
===========
- OpenStack Victoria (current-tripleo-rdo from 2020-11-10)
- CentOS 8.2
- Ansible 2.9.13
- x1 Controller and x1 Compute

Logs & Configs
==============
2020-11-16 17:36:06.560330 | 52540001-e2d1-5ae8-eb91-00000000000a | TASK | Download config
2020-11-16 17:36:17.905133 | 52540001-e2d1-5ae8-eb91-00000000000a | FATAL | Download config | localhost | error={"changed": false, "error": "'NoneType' object has no attribute 'items'", "msg": "Error downloading config for overcloud: 'NoneType' object has no attribute 'items'", "success": false}

Luke Short (ekultails)
Changed in tripleo:
status: New → Triaged
importance: Undecided → Critical
Revision history for this message
wes hayutin (weshayutin) wrote :

Luke, let's sync up to make sure upstream CI is hitting this so it doesn't happen.

Luke Short (ekultails)
Changed in tripleo:
status: Triaged → Confirmed
Revision history for this message
Luke Short (ekultails) wrote :

Kevin Carter (cloudnull) and I were able to narrow this down to this change:

https://review.opendev.org/#/c/751720/

The switch from legacy network configurations managed by Heat (really a Bash script) to Ansible configurations was supposed to be backwards-compatible via the Heat parameter `NetworkConfigWithAnsible: false`. However, it is not. The original 'network/scripts/run-os-net-config.sh' script from tripleo-heat-templates does not exist in Victoria (which is required). TripleO also seems to parse information in such a way that is biased towards the new Ansible way leading to this deployment error originally mentioned.

It would be great to (A) address this to be backwards compatible, (B) get more regression coverage in CI for this, (C) improve the error handling, and (D) add documentation on how to make the new style of network templates.

Regarding B, a simple CI job using a Standalone deployment with the parameter `NetworkConfigWithAnsible: true` and some form of network isolation or custom network configuration should replicate the issue.

summary: - Victoria cannot download config
+ Victoria legacy run-os-net-config.sh deployment issue
Revision history for this message
Luke Short (ekultails) wrote :

A related error I ran into before the reported issue:

2020-11-16 17:29:10.359 55061 ERROR tripleoclient.v1.overcloud_deploy.DeployOvercloud heatclient.exc.CommandError: Could not fetch contents for file:///tmp/tripleoclient-tjoz0wgn/tripleo-heat-templates/network/scripts/run-os-net-config.sh

Partial fix to get past that error (grab run-os-net-config.sh from Ussuri):

$ cd ~/templates; mkdir network/scripts/; curl -O https://raw.githubusercontent.com/openstack/tripleo-heat-templates/stable/ussuri/network/scripts/run-os-net-config.sh

Revision history for this message
Rabi Mishra (rabi) wrote :

Any reason you're using NetworkConfigWithAnsible: false? You should not be using it as you're using in-tree environment 'deployed-server-environment.yaml'. If you want to use heat network config, you should change that environment to an old version where OS::TripleO::{{role.name}}::Net::SoftwareConfig is appropriately mapped.

OS::TripleO::{{role.name}}::Net::SoftwareConfig: ../net-config-static-bridge.yaml

This is not a bug.

Revision history for this message
Kevin Carter (kevin-carter) wrote :

Rabi, it seems like the issue here is that there's no backwards compatibility with existing templates. This is a functional network template used in an OSP16.1 deployment https://gist.github.com/cloudnull/7bf2b47b5c467f23dd4ad88c04be8c71

Sadly because of [0] and [1] this isn't working when deploying with master.

[0] https://gist.github.com/cloudnull/7bf2b47b5c467f23dd4ad88c04be8c71#file-gistfile1-yaml-L188-L189
[1] https://gist.github.com/cloudnull/7bf2b47b5c467f23dd4ad88c04be8c71#file-gistfile1-yaml-L284-L287

While the fix is to simply update the templates to use the new scheme, I can see this becoming an issue for updates/upgrades.

Additionally should a user provide the old style network templates, the failure is extremely hard to track-down, you have to use both extreamly verbose output and ansible in DEBUG mode to see where the failure is, then go into the module with a text editor to determine why it failed.

If there's no way to support backwards compatibility, maybe we can make the client detect and fail early in the deployment process?

Revision history for this message
Rabi Mishra (rabi) wrote :

Oh your config uses SoftwareConfig resource and run-os-net-config.sh script. You need to convert it by using tools/convert_nic_config.py[1] and then use it.

This would be documented for upgrades like earlier conversions needed for heat nic config templates[2].

[1] https://github.com/openstack/tripleo-heat-templates/blob/master/tools/convert_nic_config.py
[2] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html/fast_forward_upgrades/assembly-preparing_for_overcloud_upgrade#converting-network-interface-templates-to-the-new-structure

Revision history for this message
Rabi Mishra (rabi) wrote :

There was a small bug in the conversion script I mentioned above which has been now fixed with https://review.opendev.org/763125. I'm working on updating the docs to include the use of the script along with the details of new ansible nic configs.

Revision history for this message
Rabi Mishra (rabi) wrote :

I just proposed upstream docs update https://review.opendev.org/#/c/763147/. Unless there are any more concerns I'll mark this invalid.

Changed in tripleo:
status: Confirmed → Invalid
Revision history for this message
Luke Short (ekultails) wrote :

Rabi, thank you for all of the helpful insight.

I wonder if in the coming weeks we can spend some cycles together to make sure the upgrade process would be a better UX. More specifically, it seems like we already detect if old templates are used and could probably extend that logic to run the conversion script automatically. Or it could be some pre-upgrade validation, something similar to what Kevin was suggesting.

Changed in tripleo:
status: Invalid → Confirmed
status: Confirmed → Triaged
Revision history for this message
Kevin Carter (kevin-carter) wrote :

Hey Rabi/Luke, when deploying a new environment from master I'm running into the following error using the following network template and net-data.

Error - http://paste.openstack.org/show/800272/

Network template for the controller - http://paste.openstack.org/show/800275/
Network data file - http://paste.openstack.org/show/800276/

The error is a little compressed given it was executed via ansible, however this is the output from executing os-net-config on the target host.

# os-net-config --config-file /etc/os-net-config/config.json --debug --detailed-exit-codes

Error - http://paste.openstack.org/show/800273/

It looks like the net config JSON file is generating a subnet of "None".

JSON file - http://paste.openstack.org/show/800274/

Based on the network templates I'm using I'm not sure why the subnet would be set to None or how to correctly update this within my deployment? I'm assuming our tooling isn't pulling the defaults?

Revision history for this message
Kevin Carter (kevin-carter) wrote :

So I was able to get through the deployment. In the end I believe I was able to make it go by having the templates in the right order. I think we should take a look at the mechanism that generates the network configs (YAML or JSON) and add assertions in code to ensure we're not running into a None type object when we're expecting a CIDR. Maybe we could get it to default None to a /32?

Regardless I've remarked this LP as invalid and will file a new bug should future issues be found.

Changed in tripleo:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.