ovn-chassis in error state if enable-auto-restarts is false

Bug #1943970 reported by Giuseppe Petralia
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
charm-ovn-chassis
Confirmed
High
Liam Young

Bug Description

When installing ovn-chassis with enable-auto-restarts set to false units go in error state, because install hook is skipped and then nova-compute-relation-joined fails with:

unit-ovn-chassis-sriov-84: 12:01:19 ERROR unit.ovn-chassis-sriov/84.juju-log nova-compute:154: Hook error:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-ovn-chassis-sriov-84/.venv/lib/python3.6/site-packages/charms_openstack/charm/core.py", line 975, in render_configs
    _render(os.path.basename(conf))
  File "/var/lib/juju/agents/unit-ovn-chassis-sriov-84/.venv/lib/python3.6/site-packages/charms_openstack/charm/core.py", line 972, in _render
    perms=self.permission_override_map.get(conf) or 0o640,
  File "/var/lib/juju/agents/unit-ovn-chassis-sriov-84/.venv/lib/python3.6/site-packages/charmhelpers/core/templating.py", line 92, in render
    host.write_file(target, content.encode(encoding), owner, group, perms)
  File "/var/lib/juju/agents/unit-ovn-chassis-sriov-84/.venv/lib/python3.6/site-packages/charmhelpers/core/host.py", line 547, in write_file
    gid = grp.getgrnam(group).gr_gid
KeyError: "getgrnam(): name not found: 'neutron'"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-ovn-chassis-sriov-84/.venv/lib/python3.6/site-packages/charms/reactive/__init__.py", line 74, in main
    bus.dispatch(restricted=restricted_mode)
  File "/var/lib/juju/agents/unit-ovn-chassis-sriov-84/.venv/lib/python3.6/site-packages/charms/reactive/bus.py", line 390, in dispatch
    _invoke(other_handlers)
  File "/var/lib/juju/agents/unit-ovn-chassis-sriov-84/.venv/lib/python3.6/site-packages/charms/reactive/bus.py", line 359, in _invoke
    handler.invoke()
  File "/var/lib/juju/agents/unit-ovn-chassis-sriov-84/.venv/lib/python3.6/site-packages/charms/reactive/bus.py", line 181, in invoke
    self._action(*args)
  File "/var/lib/juju/agents/unit-ovn-chassis-sriov-84/charm/reactive/ovn_chassis_charm_handlers.py", line 112, in configure_ovs
    'amqp.connected'))
  File "/var/lib/juju/agents/unit-ovn-chassis-sriov-84/.venv/lib/python3.6/site-packages/charms_openstack/charm/core.py", line 997, in render_with_interfaces
    charm_instance=self))
  File "/var/lib/juju/agents/unit-ovn-chassis-sriov-84/.venv/lib/python3.6/site-packages/charms_openstack/charm/core.py", line 981, in render_configs
    _render('_'.join(conf.split(os.path.sep))[1:])
  File "/var/lib/juju/agents/unit-ovn-chassis-sriov-84/.venv/lib/python3.6/site-packages/charms_openstack/charm/core.py", line 972, in _render
    perms=self.permission_override_map.get(conf) or 0o640,
  File "/var/lib/juju/agents/unit-ovn-chassis-sriov-84/.venv/lib/python3.6/site-packages/charmhelpers/core/templating.py", line 84, in render
    raise e
  File "/var/lib/juju/agents/unit-ovn-chassis-sriov-84/.venv/lib/python3.6/site-packages/charmhelpers/core/templating.py", line 79, in render
    template = template_env.get_template(source)
  File "/var/lib/juju/agents/unit-ovn-chassis-sriov-84/.venv/lib/python3.6/site-packages/jinja2/environment.py", line 830, in get_template
    return self._load_template(name, self.make_globals(globals))
  File "/var/lib/juju/agents/unit-ovn-chassis-sriov-84/.venv/lib/python3.6/site-packages/jinja2/environment.py", line 804, in _load_template
    template = self.loader.load(self, name, globals)
  File "/var/lib/juju/agents/unit-ovn-chassis-sriov-84/.venv/lib/python3.6/site-packages/jinja2/loaders.py", line 408, in load
    raise TemplateNotFound(name)
jinja2.exceptions.TemplateNotFound: etc_openvswitch_system-id.conf

Workaround is to run the deferred-hooks with the action and then resolve the units in error.
This is ovn-chassis-sriov rev 14

Revision history for this message
Liam Young (gnuoy) wrote :

I had certainly envisaged that enable-auto-restarts would be set to True during deployment and then set to False once a deployment completed. However the charm should dtrt thing so I agree this is a bug

Changed in charm-ovn-chassis:
status: New → Incomplete
status: Incomplete → Confirmed
importance: Undecided → High
Revision history for this message
Giuseppe Petralia (peppepetra) wrote :

There is a corner case when we do expansions. While we would like too keep enable-auto-restarts set to False for safety, because older units may have queued restarts from older hooks execution, we need to cope with new units getting added to the cloud.

Liam Young (gnuoy)
Changed in charm-ovn-chassis:
assignee: nobody → Liam Young (gnuoy)
Revision history for this message
Liam Young (gnuoy) wrote :

I think it makes sense to add the ability for enable-auto-restarts to be set at the unit level via an action. A new flag, internal to the charm, called unit-enable-auto-restarts could be used to manage this state. It would be set/unset via unit charm actions.

If unit-enable-auto-restarts is set to True or False then is overrides the charm config option enable-auto-restarts. If it is unset the behaviour reverts to whatever the charm config enable-auto-restarts is set too. If unit-enable-auto-restarts is set then this should be clear in the units workload status message.

In the scenario where an application has enable-auto-restarts charm config set to false and a single unit needs to be rebooted the operator can set unit-enable-auto-restarts=True for the unit that requires maintenance. The unit can then be rebooted and when the maintenance is complete unit-enable-auto-restarts=False.

The scenario where an application needs to be expanded but existing units need to be forbidden from performing service interrupting events is slightly more complicated.

- Set unit-enable-auto-restarts=False via action for existing units
- Set charm config option enable-auto-restarts=True.
- Expand application
- Either set unit-enable-auto-restarts=False for the new unit or...
- Set enable-auto-restarts=False and unset unit-enable-auto-restarts on all units.

I'm increasingly of the opinion that in the longer term the application level config option enable-auto-restarts would be removed in favour of the unit level setting.

Revision history for this message
Giuseppe Petralia (peppepetra) wrote :
Revision history for this message
Giuseppe Petralia (peppepetra) wrote :

From operations perspective the unit-enable-auto-restarts seems to introduce more overhead to manage big clouds via actions to perform "Set unit-enable-auto-restarts=False via action for existing units"

Also treating the enable-auto-restarts as an unit level config may produce mixed environments where units allowing restarts and unit not allowing restarts coexist and if the operator forgets to set unit-enable-auto-restarts=False for an unit this may cause outages impacting customers workloads.

Ideally deferred events should ensure the following use cases:

- first installation is not broken by the deferred events

- deferred events don't prevent unit to be rebooted and allow to the unit to run all hooks and restarts they need when booting to get into a working state.

- if a queued deferred event needs to run to ensure that an unit is fully working, the unit turns into a blocked state until the deferred event is not executed via action.

Revision history for this message
Paul Goins (vultaire) wrote (last edit ):

This was merged June 6th, and is included in the stable/22.03 branch. There is a stable/22.03 release in Charmhub as well, so this is technically fix released at this point, unless I'm mistaken.

Revision history for this message
Paul Goins (vultaire) wrote :

Also, if this is encountered in the context of an SRIOV-enabled node, it may be possible to work around this by installing the neutron-sriov-agent package when this issue is encountered, and then running "juju resolved $UNIT" to retry the failed hook.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.