Router with l3_agent state unknown

Bug #2028338 reported by Mohankumar
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Incomplete
Undecided
Mohankumar

Bug Description

❯ openstack network agent list --router r1 --long

+--------------------------------------+------------+---------------------------------------------+-------------------+-------+-------+----------------------+----------+
| ID | Agent Type | Host | Availability Zone | Alive | State | Binary | HA State |
+--------------------------------------+------------+---------------------------------------------+-------------------+-------+-------+----------------------+----------+
| 6faf003f-ef57-46a7-a3d4-eb63a8561e09 | L3 agent | req-multi-devstack-stein-85-4568-minion2-mn | nova | XXX | UP | neutron-ovh-l3-agent | unknown |
| bf90fed6-a31d-46a8-8fe9-a9ce587f241d | L3 agent | req-multi-devstack-stein-85-4568-minion1-mn | nova | XXX | UP | neutron-ovh-l3-agent | standby |
+--------------------------------------+------------+---------------------------------------------+-------------------+-------+-------+----------------------+----------+

❯ openstack network agent list --router r1 --long

+--------------------------------------+------------+---------------------------------------------+-------------------+-------+-------+----------------------+----------+
| ID | Agent Type | Host | Availability Zone | Alive | State | Binary | HA State |
+--------------------------------------+------------+---------------------------------------------+-------------------+-------+-------+----------------------+----------+
| 6faf003f-ef57-46a7-a3d4-eb63a8561e09 | L3 agent | req-multi-devstack-stein-85-4568-minion2-mn | nova | :-) | UP | neutron-ovh-l3-agent | unknown |
| bf90fed6-a31d-46a8-8fe9-a9ce587f241d | L3 agent | req-multi-devstack-stein-85-4568-minion1-mn | nova | :-) | UP | neutron-ovh-l3-agent | standby |
+--------------------------------------+------------+---------------------------------------------+-------------------+-------+-------+----------------------+----------+

When neutron-server notices the l3_agent not active /agent health-check not successful. It set l3_agent as "unknown" But when l3_agent active agent like
process force-stop/crash scenarios . When the agent is recovered the l3_agent state not reverted back to original state .

When some update happens on the router , it able to restore the actual state but not after agent is restored / recovered

description: updated
Changed in neutron:
assignee: nobody → Mohankumar (mohankumar-n)
Revision history for this message
Bence Romsics (bence-romsics) wrote :

Hi,

Thank you for the report! It sounds like there's a bug here, but could you please give us a bit more detail on how to reproduce this, like exact commands that lead to the error?

Changed in neutron:
status: New → Incomplete
Revision history for this message
Mohankumar (mohankumar-n) wrote (last edit ):

Reproduce steps:

>> create DVR router with two l3 agents in HA mode (master/standby)

>> stop the master l3_agent process(or both l3_agents) in snat node and wait for neutron - server to mark it unknown https://github.com/openstack/neutron/blob/master/neutron/db/l3_hamode_db.py#L609

>> Start the master l3_agent process, you can see in "openstack network agent list --router r1 --long"
the status still marked as "unknown", not restroing the actual state "Master" in this case. But other traffics are fine . The vrrp / underneth keepalived are working as usual .

Whenever we trigger some update on the router object , It able to restore the actual state .

description: updated
description: updated
Changed in neutron:
status: Incomplete → New
status: New → Incomplete
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Mohankumar:

Please check the L3 agent logs and search for any message present in [1]. This method is executed in a parallel thread and is executed periodically to inform about the status of the agent.

In the Neutron API you should also be able to find (using logs in DEBUG mode), the "report_state" RPC call from the agents. This RPC call updates the agent status.

You should provide the logs of the Neutron API and the L3 agent to start debugging this issue. And the version you are using.

Regards.

[1]https://github.com/openstack/neutron/blob/dbe4ba910b3236ff3ac42e33dcb4cc067b1f9177/neutron/agent/l3/agent.py#L1025-L1065

Revision history for this message
Mohankumar (mohankumar-n) wrote :

you can see after multiple restart "L3 agent" stuck in unknown state.

stack@req-generic-73-3947-mn:~$ openstack network agent list --router r1 --long
+--------------------------------------+------------+------------------------+-------------------+-------+-------+------------------+----------+
| ID | Agent Type | Host | Availability Zone | Alive | State | Binary | HA State |
+--------------------------------------+------------+------------------------+-------------------+-------+-------+------------------+----------+
| 16e220ee-da1b-4dff-a7e9-2dc737ecf7eb | L3 agent | req-generic-73-3947-mn | nova | :-) | UP | neutron-l3-agent | unknown |
| 60864b8e-f52a-4f45-9dc0-70457658043e | L3 agent | req-generic-72-ea2a-mn | nova | :-) | UP | neutron-l3-agent | active |
+--------------------------------------+------------+------------------------+-------------------+-------+-------+------------------+----------+
stack@req-generic-73-3947-mn:~$

neutron-server.log: https://pastebin.ubuntu.com/p/TjcRc5ZMkB/

l3-agent.log: https://pastebin.ubuntu.com/p/KJN4TKqfjV/

In neutron server(q-svc logs) i couldn't find report_state loop, maybe Journalctl not collecting API logs

Revision history for this message
Mohankumar (mohankumar-n) wrote (last edit ):
Download full text (4.4 KiB)

what is working to get rid of this issue , add some checks after agent revived in "fetch_and_sync_all_routers" method @neutron/agent/l3/agent.py

   if r.get('_ha_state') == 'unknown':
      LOG.debug("Processing router with l3_agent in "
                "unknown state")
      self._process_added_router(r)

and entire method looks like this :

    def fetch_and_sync_all_routers(self, context, ns_manager):
        prev_router_ids = set(self.router_info)
        curr_router_ids = set()
        timestamp = timeutils.utcnow()
        router_ids = []
        chunk = []
        is_snat_agent = (self.conf.agent_mode ==
                         lib_const.L3_AGENT_MODE_DVR_SNAT)
        try:
            router_ids = self.plugin_rpc.get_router_ids(context)
            LOG.debug("Router IDs from Neutron API: %s", router_ids)
            self._remove_orphan_routers_config(
                ha_confs_path=self.ha_confs_path,
                router_ids=set(router_ids),
            )
            # fetch routers by chunks to reduce the load on server and to
            # start router processing earlier
            for i in range(0, len(router_ids), self.sync_routers_chunk_size):
                chunk = router_ids[i:i + self.sync_routers_chunk_size]
                routers = self.plugin_rpc.get_routers(context, chunk)
                LOG.debug('Processing :%r', routers)
                for r in routers:
                    curr_router_ids.add(r['id'])
                    ns_manager.keep_router(r['id'])
                    if r.get('distributed'):
                        # need to keep fip namespaces as well
                        ext_net_id = (r['external_gateway_info'] or {}).get(
                            'network_id')
                        if ext_net_id:
                            ns_manager.keep_ext_net(ext_net_id)
                        elif is_snat_agent and not r.get('ha'):
                            ns_manager.ensure_snat_cleanup(r['id'])
                        if r.get('_ha_state') == 'unknown':
                            LOG.debug("Processing router with l3_agent in "
                                      "unknown state")
                            self._process_added_router(r)
                    update = queue.ResourceUpdate(
                        r['id'],
                        PRIORITY_SYNC_ROUTERS_TASK,
                        resource=r,
                        action=ADD_UPDATE_ROUTER,
                        timestamp=timestamp)
                    self._queue.add(update)
        except oslo_messaging.MessagingTimeout:
            if self.sync_routers_chunk_size > SYNC_ROUTERS_MIN_CHUNK_SIZE:
                self.sync_routers_chunk_size = max(
                    self.sync_routers_chunk_size // 2,
                    SYNC_ROUTERS_MIN_CHUNK_SIZE)
                LOG.error('Server failed to return info for routers in '
                          'required time, decreasing chunk size to: %s',
                          self.sync_routers_chunk_size)
            else:
                LOG.error('Server failed to return info for routers in '
                          'required time even with min chunk size: %s. '
                     ...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.