VM isn't connected to tenant network

Bug #1652829 reported by Leontii Istomin
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
fuel-ccp
Triaged
High
Fuel CCP Bug Team

Bug Description

We spawned 72 VMs on top of 3 compute nodes, using heat (24 VMs per one node).
Some instances weren't reachable via tenant network. For example:
| 93b95c73-f849-4ffb-9108-63cf262d3a9f | cassandra_vm_0 | ACTIVE | slice0-node162-net=11.62.0.8, 10.144.1.35 | ubuntu-software-config-last |

root@node1:~# ssh -i .ssh/slace ubuntu@10.144.1.35
Connection closed by 10.144.1.35 port 22

it's unreachable from tenant network as well. For example from instance b1946719-b401-447d-8103-cc43b03b1481 which has been spawned by the same heat stack on the same compute node (node162): http://paste.openstack.org/show/593486/

Environment description:
k8s deployed by Kargo on top of 200 hardware nodes.

logs from OpenStack pods and outputs of some commands have been dumped by this script: http://paste.openstack.org/show/593487/
and attached here: http://mos-scale-share.mirantis.com/dump-2016-12-27-19-05-39.tar.xz

Tags: scale
description: updated
Revision history for this message
Leontii Istomin (listomin) wrote :

Tried to reboot the VM. It seems the VM reached metadata (see screenshot), but steel unreachable from the tenant network

Revision history for this message
Leontii Istomin (listomin) wrote :

Have found the following in neutron-server logs: http://paste.openstack.org/show/593483/
Probably it's the root cause of the issue, but not sure.
We reduced number of neutron servers to 1 from default=3 in this case.

Revision history for this message
Leontii Istomin (listomin) wrote :

http://paste.openstack.org/show/593483/ isn't root cause. We found this error when a VM was accessible and didn't see the error when a VM was unaccessible

Revision history for this message
Elena Ezhova (eezhova) wrote :

Looks like the root cause was VMs failing to connect to a metadata agent on boot and thus not getting an SSH key:

2016-12-27 16:10:45,281 - util.py[WARNING]: Failed fetching metadata from url http://169.254.169.254/2009-04-04/meta-data/

Revision history for this message
Elena Ezhova (eezhova) wrote :

As only one router was used in the testing process, all metadata requests from instances were being handled by only one metadata agent that as a result couldn't process some of them due to high load.

So the recommendation here would be to create several routers to balance the load among all metadata-agents.

Revision history for this message
Artem Yasakov (ayasakov) wrote :

Thanks for recommendation.
After increasing the number of routers, this bug have not been reproduced.

Changed in fuel-ccp:
status: New → Invalid
Revision history for this message
Artem Yasakov (ayasakov) wrote :

It's reproduced again.

Now, we spawned 288 VMs on top of 12 compute nodes. Also, we have 48 networks and 12 routers.
After that, we spawned new VMs using heat, and it's unreachable from tenant network.

root@node1:~/ayasakov$ ssh -i key.pem ubuntu@10.144.1.115
Connection closed by 10.144.1.115 port 22

Changed in fuel-ccp:
status: Invalid → New
Changed in fuel-ccp:
status: New → Triaged
importance: Undecided → High
Changed in fuel-ccp:
assignee: nobody → Elena Ezhova (eezhova)
Revision history for this message
Elena Ezhova (eezhova) wrote :

This issue can be explained by the way images used for VMs were configured:

1. Cloud-init was configured to use only Ec2 as its datasource, which means that on instance's boot all metadata was fetched from metadata-agent, even when config-drive was configured.
   ubuntu@wordpress-vm-0:~$ cat /etc/cloud/cloud.cfg.d/91-dib-cloud-init-datasources.cfg
   datasource_list: [ Ec2, None ]
2. At the same time, all instances were running an os-collect-config daemon that is constantly polling Neutron metadata agent and heat-api-cfn.

With several hundreds of simultaneous requests to handle metadata agent shows poor performance and can fail to serve requests within reasonable time (often taking over 10 seconds). These performance issues lead to some instances failing to get public keys on boot stage.

Note: Benchmarking of neutron metadata agent from a VM on a multinode DevStack and on CCP (both in default configuration) has shown significant performance drop on CCP:
DevStack http://paste.openstack.org/show/594838/
CCP http://paste.openstack.org/show/594839/

With images that are built to include ConfigDrive in a list of datasources this issue wasn't reproduced. Additionally it is possible to completely refrain from polling neutron metadata-agent using os-collect-config>=6.0.0b2.

I suggest to close this bug and open a new one concerning metadata-agent performance.

Revision history for this message
Sergey Kraynev (skraynev) wrote :

According last comment let's move this bug to Won't Fix.

Changed in fuel-ccp:
status: Triaged → Won't Fix
Revision history for this message
Sergey Reshetnyak (sreshetniak) wrote :

Reproduced on my env. Not all instances have access to metadata.

Changed in fuel-ccp:
status: Won't Fix → Triaged
assignee: Elena Ezhova (eezhova) → Fuel CCP Bug Team (fuel-ccp-bugs)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.