fuel-ccp

VM isn't connected to tenant network

Bug #1652829 reported by Leontii Istomin on 2016-12-27

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	fuel-ccp	Triaged	High	Fuel CCP Bug Team

Bug Description

We spawned 72 VMs on top of 3 compute nodes, using heat (24 VMs per one node).
Some instances weren't reachable via tenant network. For example:
| 93b95c73-f849-4ffb-9108-63cf262d3a9f | cassandra_vm_0 | ACTIVE | slice0-node162-net=11.62.0.8, 10.144.1.35 | ubuntu-software-config-last |

root@node1:~# ssh -i .ssh/slace ubuntu@10.144.1.35
Connection closed by 10.144.1.35 port 22

it's unreachable from tenant network as well. For example from instance b1946719-b401-447d-8103-cc43b03b1481 which has been spawned by the same heat stack on the same compute node (node162): http://paste.openstack.org/show/593486/

Environment description:
k8s deployed by Kargo on top of 200 hardware nodes.

logs from OpenStack pods and outputs of some commands have been dumped by this script: http://paste.openstack.org/show/593487/
and attached here: http://mos-scale-share.mirantis.com/dump-2016-12-27-19-05-39.tar.xz

See original description

Tags:

Leontii Istomin (listomin) on 2016-12-27

description:

updated

Revision history for this message

Leontii Istomin (listomin) wrote on 2016-12-27:

vm_net_issue.png Edit (9.1 KiB, image/png)

Tried to reboot the VM. It seems the VM reached metadata (see screenshot), but steel unreachable from the tenant network

Revision history for this message

Leontii Istomin (listomin) wrote on 2016-12-27:

Have found the following in neutron-server logs: http://paste.openstack.org/show/593483/
Probably it's the root cause of the issue, but not sure.
We reduced number of neutron servers to 1 from default=3 in this case.

Revision history for this message

Leontii Istomin (listomin) wrote on 2016-12-28:

http://paste.openstack.org/show/593483/ isn't root cause. We found this error when a VM was accessible and didn't see the error when a VM was unaccessible

Revision history for this message

Elena Ezhova (eezhova) wrote on 2016-12-28:

Looks like the root cause was VMs failing to connect to a metadata agent on boot and thus not getting an SSH key:

2016-12-27 16:10:45,281 - util.py[WARNING]: Failed fetching metadata from url http://169.254.169.254/2009-04-04/meta-data/

Revision history for this message

Elena Ezhova (eezhova) wrote on 2016-12-29:

As only one router was used in the testing process, all metadata requests from instances were being handled by only one metadata agent that as a result couldn't process some of them due to high load.

So the recommendation here would be to create several routers to balance the load among all metadata-agents.

Revision history for this message

Artem Yasakov (ayasakov) wrote on 2016-12-29:

Thanks for recommendation.
After increasing the number of routers, this bug have not been reproduced.

Changed in fuel-ccp:
status:	New → Invalid

Revision history for this message

Artem Yasakov (ayasakov) wrote on 2016-12-29:

It's reproduced again.

Now, we spawned 288 VMs on top of 12 compute nodes. Also, we have 48 networks and 12 routers.
After that, we spawned new VMs using heat, and it's unreachable from tenant network.

root@node1:~/ayasakov$ ssh -i key.pem ubuntu@10.144.1.115
Connection closed by 10.144.1.115 port 22

Changed in fuel-ccp:
status:	Invalid → New

Sergey Reshetnyak (sreshetniak) on 2017-01-13

Changed in fuel-ccp:
status:	New → Triaged
importance:	Undecided → High

Sergey Kraynev (skraynev) on 2017-01-16

Changed in fuel-ccp:
assignee:	nobody → Elena Ezhova (eezhova)

Revision history for this message

Elena Ezhova (eezhova) wrote on 2017-01-16:

This issue can be explained by the way images used for VMs were configured:

1. Cloud-init was configured to use only Ec2 as its datasource, which means that on instance's boot all metadata was fetched from metadata-agent, even when config-drive was configured.
ubuntu@wordpress-vm-0:~$ cat /etc/cloud/cloud.cfg.d/91-dib-cloud-init-datasources.cfg
datasource_list: [ Ec2, None ]
2. At the same time, all instances were running an os-collect-config daemon that is constantly polling Neutron metadata agent and heat-api-cfn.

With several hundreds of simultaneous requests to handle metadata agent shows poor performance and can fail to serve requests within reasonable time (often taking over 10 seconds). These performance issues lead to some instances failing to get public keys on boot stage.

Note: Benchmarking of neutron metadata agent from a VM on a multinode DevStack and on CCP (both in default configuration) has shown significant performance drop on CCP:
DevStack http://paste.openstack.org/show/594838/
CCP http://paste.openstack.org/show/594839/

With images that are built to include ConfigDrive in a list of datasources this issue wasn't reproduced. Additionally it is possible to completely refrain from polling neutron metadata-agent using os-collect-config>=6.0.0b2.

I suggest to close this bug and open a new one concerning metadata-agent performance.

Revision history for this message

Sergey Kraynev (skraynev) wrote on 2017-01-16:

According last comment let's move this bug to Won't Fix.

Changed in fuel-ccp:
status:	Triaged → Won't Fix

Revision history for this message

Sergey Reshetnyak (sreshetniak) wrote on 2017-01-16:

#10

Reproduced on my env. Not all instances have access to metadata.

Changed in fuel-ccp:
status:	Won't Fix → Triaged
assignee:	Elena Ezhova (eezhova) → Fuel CCP Bug Team (fuel-ccp-bugs)

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

vm_net_issue.png Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.