neutron

Neutron multinode grenade sometimes fails at resource phase create

Bug #1527675 reported by Sean M. Collins on 2015-12-18

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	neutron	Fix Released	High	Sean M. Collins

Bug Description

Mailing list thread:

http://lists.openstack.org/pipermail/openstack-dev/2015-November/080503.html

http://logs.openstack.org/35/187235/11/experimental/gate-grenade-dsvm-neutron-multinode/a5af283/logs/grenade.sh.txt.gz

Pinging the VM works, however when SSH'ing in to verify further, the connection is repeatedly closed.

http://logs.openstack.org/35/187235/11/experimental/gate-grenade-dsvm-neutron-multinode/a5af283/logs/grenade.sh.txt.gz#_2015-11-30_20_25_18_391

2015-11-30 20:20:18.283 | + ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i /opt/stack/save/cinder_key.pem cirros@172.24.5.53 'echo '\''I am a teapot'\'' > verify.txt'
2015-11-30 20:25:18.391 | Connection closed by 172.24.5.53

Sean M. Collins (scollins) on 2015-12-18

Changed in neutron:
assignee:	nobody → Sean M. Collins (scollins)

Revision history for this message

Sean M. Collins (scollins) wrote on 2015-12-18:

So - connection being closed by the remote - I think that means that the iptables rules are set correctly and the packets are going into the guest, which is then sending an RST? Either that or the iptables rules are not getting set correctly and that's what is sending the RST. I forget which.

Revision history for this message

Sean M. Collins (scollins) wrote on 2015-12-18:

We did have a run complete successfully, once.

http://lists.openstack.org/pipermail/openstack-dev/2015-November/080509.html

Sean M. Collins (scollins) on 2015-12-18

Changed in neutron:
importance:	Undecided → High

Revision history for this message

Ihar Hrachyshka (ihar-hrachyshka) wrote on 2015-12-18:

In the provided logs, the last time l2 agent triggered iptables-save was:

http://logs.openstack.org/35/187235/11/experimental/gate-grenade-dsvm-neutron-multinode/a5af283/logs/old/screen-q-agt.txt.gz#_2015-11-30_19_29_00_526

While the port of the instance, and its security group, were created a lot later (~25 secs later):

http://logs.openstack.org/35/187235/11/experimental/gate-grenade-dsvm-neutron-multinode/a5af283/logs/grenade.sh.txt.gz#_2015-11-30_19_29_25_182

Does it suggest that l2 agent did not update firewall to reflect the new sec group and the instance port?..

Revision history for this message

Ihar Hrachyshka (ihar-hrachyshka) wrote on 2015-12-18:

If we look into iptables output:

http://logs.openstack.org/35/187235/11/experimental/gate-grenade-dsvm-neutron-multinode/a5af283/logs/iptables.txt.gz

we can see that only one of iptables tables contain a rule to allow port 22 connections, while it's clear from grenade log that multiple secgroups and ports were created to allow port 22 requests. It also suggests that at least some secgroup updates did not apply on l2 agent side.

Revision history for this message

Ihar Hrachyshka (ihar-hrachyshka) wrote on 2015-12-21:

It would be interesting to understand why we run ssh check for cinder_server1 only, but not for nova_server1. Is it a discrepancy we should clear throughout all long standing resources?

Revision history for this message

Ihar Hrachyshka (ihar-hrachyshka) wrote on 2015-12-21:

In successful run, I see dropbear ssh daemon is started in console log of the cinder_server1 instance: "Starting dropbear sshd: generating rsa key... generating dsa key... OK": http://logs.openstack.org/69/143169/60/experimental/gate-grenade-dsvm-neutron-multinode/7c05ff0/logs/worlddump-2015-11-25-142414-cinder_resources_created.txt.gz

But console output is not there at all in failing run: http://logs.openstack.org/35/187235/11/experimental/gate-grenade-dsvm-neutron-multinode/a5af283/logs/worlddump-2015-11-30-204140.txt.gz (there is console log, but it belongs to nova_server1, not cinder_server1). I don't currently understand why the success worlddump does not contain two console logs - one for nova_server1 and another one for cinder_server1 though.

Sean M. Collins (scollins) on 2016-01-05

Changed in neutron:
status:	New → Confirmed

Revision history for this message

Sean M. Collins (scollins) wrote on 2016-01-05:

So - the difference is that the Nova resource phase only boots an instance and pings it.

https://github.com/openstack-dev/grenade/blob/master/projects/60_nova/resources.sh#L96

While the Cinder resource phase boots an instance, pings it, then tries to SSH in.

https://github.com/openstack-dev/grenade/blob/master/projects/70_cinder/resources.sh#L140

Revision history for this message

Sean M. Collins (scollins) wrote on 2016-01-05:

The reason why there is a worlddump for the Nova resource, and not the cinder, appears to be because the worlddump is called after the resource is successfully created. The cinder resource fails due to the SSH issue, and then I guess it never does a world dump.

http://logs.openstack.org/69/143169/60/experimental/gate-grenade-dsvm-neutron-multinode/bf6bae1/logs/grenade.sh.txt.gz#_2015-11-23_18_52_59_628

Revision history for this message

Sean M. Collins (scollins) wrote on 2016-01-05:

Sean Dague noticed something:

2015-11-30 19:40:09.340 | + ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i /opt/stack/save/cinder_key.pem cirros@172.24.5.53 'echo '\''I am a teapot'\'' > verify.txt'
2015-11-30 19:45:09.468 | Connection closed by 172.24.5.53

There's a 5 minute gap between the SSH command, then the connection being closed. This is chewing a ton of wall-clock time and eventually the jenkins job is terminated.

http://logs.openstack.org/35/187235/11/experimental/gate-grenade-dsvm-neutron-multinode/a5af283/console.html.gz#_2015-11-30_20_41_40_515

Revision history for this message

Ihar Hrachyshka (ihar-hrachyshka) wrote on 2016-01-06:

#10

Nice. Do we have a patch for grenade to fail more gracefully, probably not wasting so many cycles trying to ssh?

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-11: Related fix proposed to neutron (master)

#11

Related fix proposed to branch: master
Review: https://review.openstack.org/265759

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-11: Related fix proposed to neutron (stable/liberty)

#12

Related fix proposed to branch: stable/liberty
Review: https://review.openstack.org/265858

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-14: Change abandoned on neutron (master)

#13

Change abandoned by Ihar Hrachyshka (<email address hidden>) on branch: master
Review: https://review.openstack.org/265759

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-14: Change abandoned on neutron (stable/liberty)

#14

Change abandoned by Ihar Hrachyshka (<email address hidden>) on branch: stable/liberty
Review: https://review.openstack.org/265858
Reason: Instead will use devstack patch.

Revision history for this message

Sean M. Collins (scollins) wrote on 2016-01-14:

#15

Just for the viewers at home - Ihar identified it as an issue with MTU - where the GRE tunnel overhead between the node and subnode is not accounted for, which is why we're seeing dropbear connect then everything hangs.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-21: Related fix merged to neutron (master)

#16

Reviewed: https://review.openstack.org/263486
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7e7d5dc8613d14aba50d74343deb841bc1faf5e7
Submitter: Jenkins
Branch: master

commit 7e7d5dc8613d14aba50d74343deb841bc1faf5e7
Author: Sean M. Collins <email address hidden>
Date: Mon Jan 4 16:14:39 2016 -0800

Make Neutron attempt to advertise MTUs by default

    It seems like a useful option, and it could be a more sane default,
    rather than pushing the responsibility to enable this onto deployers and
    provisioning tools[1]. In fact, it may be worth doing some more work to
    take away some of the hassle of configuring all these things correctly,
    and see if it can be automatically determined.

[1]: https://specs.openstack.org/openstack/fuel-specs/specs/7.0/jumbo-frames-between-instances.html

Related-Bug: #1527675
Change-Id: I5cbbc4660f8c4e15e59f8f5ce0419501bdd27348

Revision history for this message

Mike Spreitzer (mike-spreitzer) wrote on 2016-01-25:

#17

As reported at https://bugs.launchpad.net/devstack/+bug/1532924/comments/2, merely getting the correct MTU calculcated is not enough, we also have a problem getting the calculated value set on the network interface inside the VM.

Revision history for this message

Sean M. Collins (scollins) wrote on 2016-01-25:

#18

Which is what advertise_mtu and path_mtu do.

Revision history for this message

Mike Spreitzer (mike-spreitzer) wrote on 2016-01-25:

#19

As reported at https://bugs.launchpad.net/devstack/+bug/1532924/comments/2, `advertise_mtu=True` was set in $NEUTRON_CONF. I did not do anything about `path_mtu` in my `local.conf`, because I was following the steps at https://specs.openstack.org/openstack/fuel-specs/specs/7.0/jumbo-frames-between-instances.html#proposed-change for VXLAN and these did not mention `path_mtu`. Indeed, now that I look for it I see that /etc/neutron/plugins/ml2/ml2_conf.ini sets `path_mtu=1500`.

I am happy to try again. I think you are telling me I can delete the parts of local.conf that set `network_device_mtu=1450` in $NEUTRON_CONF and `physical_network_mtus = public:1450` in /$Q_PLUGIN_CONF_FILE provided that I set `Q_ML2_PLUGIN_PATH_MTU=1500` in the localrc section of local.conf (because https://review.openstack.org/#/c/267604/ has merged). Also, since https://review.openstack.org/#/c/263486/ has merged, I can remove the part of my local.conf that sets `network_device_mtu=1450` in $NEUTRON_CONF. Should I also remove the part of my local.conf that sets `network_device_mtu=1450` in $NOVA_CONF ?

Revision history for this message

Mike Spreitzer (mike-spreitzer) wrote on 2016-01-25:

#20

Sorry, there was a typo; I meant to say I expect that https://review.openstack.org/#/c/263486/ means I can delete the part of local.conf that says to set `advertise_mtu=True` in $NEUTRON_CONF .

Revision history for this message

Mike Spreitzer (mike-spreitzer) wrote on 2016-01-25:

#21

I tried devstack again, with the latest (commit ffb96b8 - Merge "always default to floating ips for validation"). I removed all the neutron and nova config file sections from my local.conf, instead adding `Q_ML2_PLUGIN_PATH_MTU=1500` to the localrc part. So far I have just done an all-on-one-node devstack like this. It produced a system where the network interface inside the VM has an MTU of 1450 (which is correct, since this is a VXLAN configuration). All the other MTUs are 1500; these include br-int, br-ex, br-tun, qbr$id, qbo$id, qvb$id, and tap$id on the host, and qr-$id and qg-$id in the Neutron router's network namespace.

I wonder whether those are entirely correct. The traffic through the Neutron router gets a VXLAN header added to it, right?

Experimentally I have seen no end-to-end problem yet. I have made an SSH connection from the host to the VM, used `scp` (in the VM) to copy a 400 KB file from the VM to an external server, and then used `scp` (in the VM) to copy the file back into the VM. This would not have succeeded in my earlier configurations.

Revision history for this message

Mike Spreitzer (mike-spreitzer) wrote on 2016-01-25:

#22

I went on to http://docs.openstack.org/developer/devstack/guides/neutron.html#adding-additional-compute-nodes, adding only 'Q_ML2_PLUGIN_PATH_MTU=1500' to the exhibited local.conf contents. And it worked!

Revision history for this message

Sean M. Collins (scollins) wrote on 2016-01-27:

#23

I'm marking this as fix committed since a couple patches to DevStack, Neutron, and devstack-gate have gotten us past the resource creation phase for the multinode run. We still have test failures that we need to address, but I think that more specific bugs can be opened to track down the root causes.

Changed in neutron:
status:	Confirmed → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-22: Change abandoned on neutron (master)

#24

Change abandoned by Ihar Hrachyshka (<email address hidden>) on branch: master
Review: https://review.openstack.org/265759

Ihar Hrachyshka (ihar-hrachyshka) on 2018-03-16

Changed in neutron:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1532924

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.