firewall group stuck in PENDING_UPDATE

Bug #1862200 reported by Jason Hobbs
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Neutron Gateway Charm
New
Undecided
Unassigned

Bug Description

On a 6 node openstack deploy, the default firewall group is stuck in PENDING_UPDATE and I'm unable to modify it. I'm also unable to connect to any instances as a result.

firewall group:
http://paste.ubuntu.com/p/MsMvdhf4HT/

crashdump: http://people.canonical.com/~jhobbs/juju-crashdump-openstack-2020-02-06-15.43.03.tar

bundle: http://paste.ubuntu.com/p/b86x458zMz/

description: updated
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

sub'd to field critical as it's blocking solutions QA test runs and we have no workaround.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

I believe l2-population should be set to False in l3ha/vrrp scenarios, but would like for someone to confirm that as I've not been hands-on with the scenario for a minute.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

irc notes:

22:12 < beisner> jhobbs: Upon going cross-eyed a bit, the best I've got is to reconsider your neutron-api setting of l2-population: True; with VRRP (l3ha) we test with
                 False on l2pop.
22:13 < thedac> beisner: Yeah, that is an iteresting theory. Since that is how we test. I am jumping in now with more focus.
22:14 < beisner> aha, there is also this: https://wiki.ubuntu.com/OpenStack/OpenStackCharms/ReleaseNotes1504

from the release notes there:
22:20 < jhobbs> More than 1 neutron gateway node
22:20 < jhobbs> l2-population is disabled
22:20 < jhobbs> interesting, ok

Revision history for this message
David Ames (thedac) wrote :

Log spelunking I see a couple of possible problems in the neutron-api logs:

https://paste.ubuntu.com/p/8pzZnrPvbp/

A deadlock in the DB:
2020-02-05 22:40:50.834 246978 ERROR oslo_db.sqlalchemy.exc_filters oslo_db.exception.DBDeadlock: (pymysql.err.InternalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') [SQL: 'DELETE FROM provisioningblocks WHERE provisioningblocks.standard_attr_id = %(standard_attr_id)s AND provisioningblocks.entity = %(entity)s'] [parameters: {'standard_attr_id': 3172, 'entity': 'DHCP'}] (Background on this error at: http://sqlalche.me/e/2j85)
2020-02-05 22:40:50.834 246978 ERROR oslo_db.sqlalchemy.exc_filters

And then probably the symptom the firewall in pending state:
2020-02-05 22:46:46.212 246972 ERROR neutron_lib.callbacks.manager neutron_lib.exceptions.firewall_v2.FirewallGroupInPendingState: Operation cannot be performed since associated firewall group d47b5e21-47f0-4ccb-a880-a1967ee39517 is in PENDING_UPDATE.
2020-02-05 22:46:46.212 246972 ERROR neutron_lib.callbacks.manager

Note for future log spelunkers the model seems to have settled near 20:56

Revision history for this message
David Ames (thedac) wrote :

Deploy of a (simpler) bionic-stien cloud with firewall-driver=openvswitch does not present the problem.

The DB deadlock appears to be the primary culprit so far. I am asking the wider team for input.

Revision history for this message
Liam Young (gnuoy) wrote :

The deadlock message looks like the one that has been intermittently affecting our test runs https://bugs.launchpad.net/charm-neutron-api/+bug/1849125

Revision history for this message
David Ames (thedac) wrote :

Jason,

Did we get a result from the latest test run? If you have any other juju crash dumps I would like to compare them.

If we see the pending firewall group without the DB deadlock we can drop that line of investigation. If however, it is present in each case we need to focus more there.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1862200] Re: firewall group stuck in PENDING_UPDATE

Hi David,

Yes we got a test run from that. It's still up. The firewall group is not
stuck in pending this time, but we're still not getting access to floating
IPs, so that may be a red herring.

The environment is still up. I can ping the external network's router, but
not the floating IPs. I don't get an ARP response for them even. That's
where I'm at now.

On Fri, Feb 7, 2020 at 11:41 AM David Ames <email address hidden> wrote:

> Jason,
>
> Did we get a result from the latest test run? If you have any other juju
> crash dumps I would like to compare them.
>
> If we see the pending firewall group without the DB deadlock we can drop
> that line of investigation. If however, it is present in each case we
> need to focus more there.
>
> --
> You received this bug notification because you are a member of Canonical
> Field Critical, which is subscribed to the bug report.
> https://bugs.launchpad.net/bugs/1862200
>
> Title:
> firewall group stuck in PENDING_UPDATE
>
> Status in OpenStack neutron-gateway charm:
> New
>
> Bug description:
> On a 6 node openstack deploy, the default firewall group is stuck in
> PENDING_UPDATE and I'm unable to modify it. I'm also unable to connect
> to any instances as a result.
>
> firewall group:
> http://paste.ubuntu.com/p/MsMvdhf4HT/
>
> crashdump: http://people.canonical.com/~jhobbs/juju-crashdump-
> openstack-2020-02-06-15.43.03.tar
>
> bundle: http://paste.ubuntu.com/p/b86x458zMz/
>
> To manage notifications about this bug go to:
>
> https://bugs.launchpad.net/charm-neutron-gateway/+bug/1862200/+subscriptions
>

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

The lack of ARP is because we're not getting the proper setup done in the qrouter netns; it isn't getting the floating IP bound or the iptables rules for nat setup.

20:10 < jhobbs> thedac: so, the arp comes by binding the floating ip to the external interface in the qrouter netns
20:10 < jhobbs> that gets the packet to that interface where iptables rules pick up and nat it to the internal ip
20:11 < jhobbs> by adding those iptables rules and binding the ip to the interface manually, i am able to hit the floating ip
20:11 < jhobbs> so whatever is responsible for setting that up automatically isn't happening
20:11 < jhobbs> https://www.softwareab.net/wordpress/openstack-fix-missing-external-ips-neutron/ <--- i got that from following the steps here
20:12 -!- pedrofragola [pedrofragol@191.55.133.165] has quit [The TLS connection was non-properly terminated.]
20:12 < jhobbs> ip a add 10.244.32.17/32 broadcast 10.244.32.17 scope global dev qg-186df4c6-c1; iptables -t nat -A neutron-l3-agent-PREROUTING -d 10.244.32.17/32 -j DNAT
                --to-destination 172.16.0.212; iptables -t nat -A neutron-l3-agent-OUTPUT -d 10.244.32.17/32 -j DNAT --to-destination 172.16.0.212; iptables -t nat -A
                neutron-l3-agent-float-snat -s 10.244.32.17/32 -j SNAT --to-source 172.16.0.212
20:12 < jhobbs> that's what i did to make it work
20:13 < jhobbs> the page says in his case the root cause was a failed notification via amqp to set those rules up

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

22:27 < thedac> jhobbs: Oh, wait, you might have found the problem.
22:28 < thedac> jhobbs: https://pastebin.canonical.com/p/kkFQBNr2Kq/ is related to https://bugs.launchpad.net/charm-neutron-gateway/+bug/1861457
22:28 < mup> Bug #1861457: pyroute2 0.5.2 doesn't support neutron-common 14.0.4 <sts> <OpenStack neutron-gateway charm:Invalid> <neutron (Ubuntu):Incomplete by james-page>
             <https://launchpad.net/bugs/1861457>
22:29 < thedac> jhobbs: I think that is it ^^^
22:29 < thedac> coreycb: around? Where are we with that bug ^^^
22:30 < thedac> coreycb: nevermind, looks like jamespage is on that one
22:31 < jhobbs> thedac: ok
22:33 < jhobbs> so maybe we just got unlucky on the timing, switching from 9 nodes to 6 nodes, and it would have failed with 9 anyhow
22:33 < jhobbs> looks like i can try to verify that's the case by upgrading that package and recreating my routers
22:33 < thedac> Yeah, that might be worth a test. And yes I think this is a timing problem with the package update

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.