meaning of option "router_auto_schedule" is ambiguous

Bug #1973656 reported by norman shen
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
New
Low
Unassigned

Bug Description

I found meaning of option "router_auto_schedule" is hard to follow. A quick code review finds it is only used at (tests excluded)

```python
    def get_router_ids(self, context, host):
        """Returns IDs of routers scheduled to l3 agent on <host>

        This will autoschedule unhosted routers to l3 agent on <host> and then
        return all ids of routers scheduled to it.
        """
        if extensions.is_extension_supported(
                self.l3plugin, constants.L3_AGENT_SCHEDULER_EXT_ALIAS):
            if cfg.CONF.router_auto_schedule:
                self.l3plugin.auto_schedule_routers(context, host)
        return self.l3plugin.list_router_ids_on_host(context, host)
```

which seems to be fixing router without agents associated with it. And even if I turn this option off, router is still able to be properly scheduled to agents. because

```python
    @registry.receives(resources.ROUTER, [events.AFTER_CREATE],
                       priority_group.PRIORITY_ROUTER_EXTENDED_ATTRIBUTE)
    def _after_router_create(self, resource, event, trigger, context,
                             router_id, router, router_db, **kwargs):
        if not router['ha']:
            return
        try:
            self.schedule_router(context, router_id)
            router['ha_vr_id'] = router_db.extra_attributes.ha_vr_id
            self._notify_router_updated(context, router_id)
        except Exception as e:
            with excutils.save_and_reraise_exception() as ctx:
                if isinstance(e, l3ha_exc.NoVRIDAvailable):
                    ctx.reraise = False
                    LOG.warning("No more VRIDs for router: %s", e)
                else:
                    LOG.exception("Failed to schedule HA router %s.",
                                  router_id)
                router['status'] = self._update_router_db(
                    context, router_id,
                    {'status': constants.ERROR})['status']
```

seems to not respecting this option.

So IMO auto_schedule_router might better be renamed to something like `fix_dangling_routers` etc and could be turned off if user wants to fix wrong routers manually. The reason is that could router by agent is pretty expensive for a relatively large deployment with around 10,000 routers.

Tags: l3-ipam-dhcp
norman shen (jshen28)
description: updated
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

According to the description of that config option, which is "Allow auto scheduling of routers to L3 agent." I'm not sure if it should be responsible only for scheduling broken routers or about scheduling routers to the L3 agents at all.
IMO we should first talk about this in the team meeting before we will decide what we should do next with it.

tags: added: l3-ipam-dhcp
Changed in neutron:
importance: Undecided → Low
Revision history for this message
Brian Haley (brian-haley) wrote :

So some of the context of this option is gone, but I can tell you it was originally introduced in the Quantum days in 2013 - https://review.opendev.org/c/openstack/neutron/+/21175

There were other changes after that moving code around the tree, etc.

I *think* the intention was for an operator to be able to control auto-scheduling of routers during upgrades, for example when they manually move routers off an agent in order to bring the controller node down. But I don't remember exactly as it's been a looong time, even for me. I just don't believe the intent was to ever fix a 'dangling' router, it was to just leave things where they were for a period of time. That doesn't mean it's not confusing.

Oleg might remember as we was around those days as well and he fixed the last bug :)

Revision history for this message
Oleg Bondarev (obondarev) wrote :

I agree with Brian, so this option controls whether an unscheduled router (e.g. created when all l3 agents were down) should be scheduled to the first l3 agent that comes online/restarts. This might be useful for an operator who wants a more precise control over router scheduling.

Revision history for this message
norman shen (jshen28) wrote (last edit ):

Indeed fixing unscheduled router is useful but IMO it does not quite fits with the
option name "router_auto_schedule", by its name, it more sounds like users/operators choose to
schedule router to agent on their own.

I am bringing this option up is because the logic to find an unscheduled router is rather expensive and could causes heavy burden on database each time l3 agent reboots.

IMO creating new routers while all l3 agents are down are pretty rare situations so the cost of
this option IMO might not be quite paid off...

Revision history for this message
Brian Haley (brian-haley) wrote :

As a former operator I remember this type of operation being quite useful, we would "drain" a controller before a reboot and this would help with that, otherwise we were seeing routers come back unexpectedly. That was a while ago...

Revision history for this message
Lajos Katona (lajos-katona) wrote :

We discussed this topic during the drivers meeting, see the logs:
https://meetings.opendev.org/meetings/neutron_drivers/2022/neutron_drivers.2022-05-27-14.00.log.html#l-94

The agreement was to first improve the documentation, and make the documentation clear about how this cfg option works, and after that let's start the discussion if we need to change this behavior to be something like network_auto_schedule for dhcp-agents.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/842141
Committed: https://opendev.org/openstack/neutron/commit/d4654e3011d22b6d789215fcaa1f15f2d7b9a99a
Submitter: "Zuul (22348)"
Branch: master

commit d4654e3011d22b6d789215fcaa1f15f2d7b9a99a
Author: ushen <email address hidden>
Date: Tue May 17 18:50:44 2022 +0800

    Filter out unsatisfied routers in SQL

    We saw auto_schedule_routers took over 40 seconds
    for a DVR enabled environment with option
    auto_schedule_routers enabled.

    Adding new arguments to get_router_agents_count and
    dealing with routers separately depending on whether
    it is a regular router or HA. The benefits are
    we do not need to loop over every router available in
    environment. Another reason for doing this is that
    get_router_agents_count is used solely to heal
    routers with less than required agents so number of
    routers with less agents is small for most of the times.

    Related-Bug: #1973656

    Change-Id: Ic29275815a8c32cee7a6470509687a18fa594514

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.