During Newton->Ocata upgrade compute nodes lose network connectivity

Bug #1664670 reported by Marius Cornea
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Brent Eagles

Bug Description

During Newton->Ocata upgrade compute nodes lose network connectivity. As a result the upgrade process gets stuck because nova-compute is not able to start because it's not able to reach the rabbitmq servers running on controller nodes.

This is the compute node upgrade output:
http://paste.openstack.org/show/598878/

From what I can the issue appears to be related to openvswitch:

[root@overcloud-novacompute-1 ~]# tail -f /var/log/openvswitch/ovs-vswitchd.log
2017-02-14T19:00:59.068Z|05074|rconn|WARN|br-tun<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2017-02-14T19:00:59.068Z|05075|rconn|WARN|br-infra<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2017-02-14T19:01:07.067Z|05076|rconn|WARN|br-ex<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2017-02-14T19:01:07.067Z|05077|rconn|WARN|br-int<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2017-02-14T19:01:07.067Z|05078|rconn|WARN|br-tun<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2017-02-14T19:01:07.067Z|05079|rconn|WARN|br-infra<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2017-02-14T19:01:15.067Z|05080|rconn|WARN|br-ex<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2017-02-14T19:01:15.067Z|05081|rconn|WARN|br-int<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2017-02-14T19:01:15.067Z|05082|rconn|WARN|br-tun<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2017-02-14T19:01:15.067Z|05083|rconn|WARN|br-infra<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2017-02-14T19:01:23.067Z|05084|rconn|WARN|br-ex<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2017-02-14T19:01:23.067Z|05085|rconn|WARN|br-int<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2017-02-14T19:01:23.067Z|05086|rconn|WARN|br-tun<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2017-02-14T19:01:23.067Z|05087|rconn|WARN|br-infra<->tcp:127.0.0.1:6633: connection failed (Connection refused)

The interface use for reaching the rabbitmq servers(vlan200) is part of the br-infra bridge:

[root@overcloud-novacompute-1 ~]# ovs-vsctl list-ports br-infra
eth1
phy-br-infra
vlan200

neutron-openvswitch-agent is stopped:

[root@overcloud-novacompute-1 ~]# systemctl status neutron-openvswitch-agent
● neutron-openvswitch-agent.service - OpenStack Neutron Open vSwitch Agent
   Loaded: loaded (/usr/lib/systemd/system/neutron-openvswitch-agent.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Tue 2017-02-14 16:15:08 UTC; 2h 48min ago
 Main PID: 44934 (code=exited, status=0/SUCCESS)

Feb 13 09:25:37 overcloud-novacompute-1 systemd[1]: Started OpenStack Neutron Open vSwitch Agent.
Feb 13 09:25:38 overcloud-novacompute-1 neutron-openvswitch-agent[44934]: Guru meditation now registers SIGUSR1 and SIGUSR2 by default for backward compatibility. SIGUSR1 will no longer be registered in a future release, s...erate reports.
Feb 13 09:25:39 overcloud-novacompute-1 neutron-openvswitch-agent[44934]: Option "verbose" from group "DEFAULT" is deprecated for removal. Its value may be silently ignored in the future.
Feb 13 09:25:39 overcloud-novacompute-1 neutron-openvswitch-agent[44934]: Option "rpc_backend" from group "DEFAULT" is deprecated for removal. Its value may be silently ignored in the future.
Feb 13 09:25:41 overcloud-novacompute-1 neutron-openvswitch-agent[44934]: Option "notification_driver" from group "DEFAULT" is deprecated. Use option "driver" from group "oslo_messaging_notifications".
Feb 13 09:25:41 overcloud-novacompute-1 sudo[45004]: neutron : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/neutron-rootwrap-daemon /etc/neutron/rootwrap.conf
Feb 13 09:25:41 overcloud-novacompute-1 ovs-vsctl[45011]: ovs|00001|vsctl|INFO|Called as /bin/ovs-vsctl --timeout=10 --oneline --format=json -- --id=@manager create Manager "target=\"ptcp:6640:127.0.0.1\"" -- add Open_vS...options @manager
Feb 13 09:25:47 overcloud-novacompute-1 sudo[45195]: neutron : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ovsdb-client monitor Interface name,ofport,external_ids --format=json
Feb 14 16:15:07 overcloud-novacompute-1 systemd[1]: Stopping OpenStack Neutron Open vSwitch Agent...
Feb 14 16:15:08 overcloud-novacompute-1 systemd[1]: Stopped OpenStack Neutron Open vSwitch Agent.
Hint: Some lines were ellipsized, use -l to show in full.

Tags: upgrade
Revision history for this message
Brent Eagles (beagles) wrote :

Any chance on getting hold of more of the logs from the compute node?

Revision history for this message
Brent Eagles (beagles) wrote :
Download full text (4.2 KiB)

I was able to access the system (thanks Marius!) and here are some details:

- the bridge through which all traffic flows (br-infra) is indeed configured as a bridge mapping with the neutron ovs agent. This means that the agent will set fail_mode to secure, making it essential that the neutron ovs agent is running or things will start flowing as soon as ovs notices
- the agent itself looks to be have been relatively cleanly brought down but not brought back.
- the yum update itself seems to have been hung from about the time the neutron ovs agent was updated and stopped, a quick grep for tripleo_upgrade_node.sh process on that node indicates that it too is hung.
- now here is the pretty: the hang seems to be on a try-restart for openstack-nova-compute and it seems to be stuck because it cannot reach the rabbit server, which is listening on the internal API network, which is not functional because bridge on which that traffic flows is non-functional because the neutron ovs agent is not running.

I'm not sure what stops the neutron ovs agent without starting it again, but it appears to be linked with a check on ovs that ends up stopping ovs and the neutron agent and while it restarts ovs, it doesn't do the same for the neutron agent.

Feb 14 16:14:56 localhost systemd: Started Session 38 of user heat-admin.
Feb 14 16:14:56 localhost systemd-logind: New session 38 of user heat-admin.
Feb 14 16:14:56 localhost systemd: Starting Session 38 of user heat-admin.
Feb 14 16:14:59 localhost os-collect-config: /var/lib/os-collect-config/local-data not found. Skipping
Feb 14 16:14:59 localhost os-collect-config: No local metadata found (['/var/lib/os-collect-config/local-data'])
Feb 14 16:15:07 localhost systemd: Reloading.
Feb 14 16:15:07 localhost systemd: [/usr/lib/systemd/system/microcode.service:10] Trailing garbage, ignoring.
Feb 14 16:15:07 localhost systemd: microcode.service lacks both ExecStart= and ExecStop= setting. Refusing.
Feb 14 16:15:07 localhost systemd: Stopping Open vSwitch Internal Unit...
Feb 14 16:15:07 localhost systemd: Stopping OpenStack Neutron Open vSwitch Agent...
Feb 14 16:15:07 localhost ovs-ctl: ovs-vswitchd is not running.
Feb 14 16:15:07 localhost ovs-ctl: ovsdb-server is not running.
Feb 14 16:15:07 localhost systemd: Stopped Open vSwitch Internal Unit.
Feb 14 16:15:08 localhost systemd: Stopped OpenStack Neutron Open vSwitch Agent.
Feb 14 16:15:08 localhost systemd: Stopping Open vSwitch...
Feb 14 16:15:08 localhost systemd: Stopped Open vSwitch.
Feb 14 16:15:08 localhost systemd: Cannot add dependency job for unit microcode.service, ignoring: Unit is not loaded properly: Invalid argument.
Feb 14 16:15:08 localhost systemd: Starting Open vSwitch Database Unit...
Feb 14 16:15:08 localhost ovs-ctl: Backing up database to /etc/openvswitch/conf.db.backup7.12.1-2211824403 [ OK ]
Feb 14 16:15:08 localhost ovs-ctl: Compacting database [ OK ]
Feb 14 16:15:08 localhost ovs-ctl: Converting database schema [ OK ]
Feb 14 16:15:08 localhost ovs-ctl: Starting ovsdb-server [ OK ]
Feb 14 16:15:08 localhost ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait -- init -- set Open_vSwitch . db-version=7.14.0
Feb 14 16...

Read more...

Revision history for this message
Marius Cornea (mcornea) wrote :

I think this[1] might be related. From what I can see this function is run on the compute node. You can see the /home/heat-admin/OVS_UPGRADE/ directory created.

[1] https://github.com/openstack/tripleo-heat-templates/blob/master/extraconfig/tasks/pacemaker_common_functions.sh#L301-L322

Revision history for this message
Brent Eagles (beagles) wrote :

TBH, I think there are two bugs here:

1. we have a situation where service dependencies drop a service that is required and, even if appropriate, it we may need a mitigation for that
2. the yum update is hanging because (FWICT) nova doesn't like the try-restart without access to rabbitmq. If yum was not hung, the puppet-apply that we would eventually get to would resolve the issue of the neutron-openvswitch-agent not running. AFAICT, since we still have connectivity on the control plane and also to external networks we would get "unstuck" if this didn't happen. Even so, we should look at (1.) just in case.

Additionally I think we need to add some cautionary notes on network configuration to our docs. This is mainly happening because the neutron-openvswitch-agent in the overcloud and the internal_api network are linked by way of bridge mappings. If they were not, then this particular scenario would not happen.

Revision history for this message
Marius Cornea (mcornea) wrote :

I'm attaching the logs captured from the compute node.

Brent Eagles (beagles)
Changed in tripleo:
status: Triaged → Confirmed
Brent Eagles (beagles)
Changed in tripleo:
assignee: nobody → Brent Eagles (beagles)
Changed in tripleo:
milestone: ocata-rc1 → ocata-rc2
Changed in tripleo:
status: Confirmed → Triaged
Revision history for this message
Brent Eagles (beagles) wrote :

While this might not affect that many "production deployments" as we expect that they are probably not running internal API, external, control plane, etc. networks over a bridge that is configured in the neutron-ovs-agent, test environments (like the one featured in this BZ) very possibly would. People evaluating POC deployments that would be affected would also get a poor impression.

To work this out I think we need to:
1 - reproduce the nova try-restart hang in the absence of the API network and get nova looking at it. I *think* this is the broken link as to why this no longer works, so if we can resolve this ...
2 - show that a puppet apply will the ovs agent back and resolve the network connection issues if a service is stuck (like in 1.)
3 - investigate if it's possible to mitigate the situation where the neutron ovs agent is stopped as a side-effect and not restarted. Nova might not be the only service that this happens with.

Revision history for this message
Marius Cornea (mcornea) wrote :

There is a common production scenario that comes to mind which I think could be affected by this issue: systems with 2 nics grouped in a bond which are part of an OVS bridge. The OVS bridge contains both the vlan interfaces created by tripleo for segregating the isolated networks but it is also used for vlan provider networks hence it needs to be part of a bridge mapping.

Revision history for this message
Brent Eagles (beagles) wrote :

Just a side-note: I just noticed this bug and it looks like it could be related:

https://bugs.launchpad.net/tripleo/+bug/1665717

FWICT, if the nova-compute had not hung the yum update process, we would eventually get to the puppet-apply which should take care of restarting the neutron-ovs-agent.

Revision history for this message
Marios Andreou (marios-b) wrote :

guys i think it may be related to and fixed by https://review.openstack.org/#/c/436990/ Remove the openvswitch special case in tripleo_upgrade_node.sh

Revision history for this message
Brent Eagles (beagles) wrote :

Thanks Marios. Marius, can you confirm?

Revision history for this message
Marius Cornea (mcornea) wrote :

Yes, I can confirm it fixed the issues I reported in this bug.

Revision history for this message
Brent Eagles (beagles) wrote :

Nice! Thanks everyone!

Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.