Sometimes cannot SSH to instances which created via heat-stack

Bug #1545043 reported by Evgeny Sikachev
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Status tracked in 10.0.x
10.0.x
Invalid
High
Evgeny Sikachev
8.0.x
Invalid
High
Evgeny Sikachev
9.x
Invalid
High
Evgeny Sikachev

Bug Description

ENVIRONMENT: MOS 8.0 ISO 529(RC1)

Sometimes I am cannot connect via SSH. If run creation of 500 VMs via Heat Stack, sometimes, we cannot connect to 3-5 VMs. See attach.

Image from: http://sahara-files.mirantis.com/mos80/sahara-liberty-vanilla-2.7.1-ubuntu-14.04.qcow2

If enable batching in Heat this problem also exist

Tags: area-nova
Revision history for this message
Evgeny Sikachev (esikachev) wrote :
Changed in mos:
assignee: MOS QA Team (mos-qa) → MOS Nova (mos-nova)
Changed in mos:
milestone: 8.0-mu-1 → 8.0-updates
Changed in mos:
importance: Undecided → High
description: updated
description: updated
tags: added: area-nova
Revision history for this message
Evgeny Sikachev (esikachev) wrote :
Revision history for this message
Evgeny Sikachev (esikachev) wrote :

reproduces on SWARM jobs

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Hi Nova team, do you have any ETA for the fix?
it is a blocker for QA team because even simple functional tests failed

tags: added: blocker-for-qa
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :
Download full text (3.3 KiB)

I gave it a try on new 9.0 environment and didn't manage to reproduce the issue.

I booted 30 VMs (as many as I can do on two virtualized compute nodes):

nova boot --flavor m1.micro --image TestVM --key-name test --nic net-id=0c46b343-2063-4828-bf01-d4a2909c7ca5 --min-count 30 demo-meta

and checked console output of each instance:

for inst in $(nova list | grep demo | awk '{print $4}'); do nova console-log $inst; done

Each and every instance successfully retrieved the metadata. There also were no errors in nova-api logs (metadata API is one of nova-api services).

Versions of nova and neutron packages:

root@node-1:~# dpkg -l | grep nova
ii nova-api 2:13.0.0~rc2-3~u14.04+mos4 all OpenStack Compute - compute API frontend
ii nova-cert 2:13.0.0~rc2-3~u14.04+mos4 all OpenStack Compute - certificate manager
ii nova-common 2:13.0.0~rc2-3~u14.04+mos4 all OpenStack Compute - common files
ii nova-conductor 2:13.0.0~rc2-3~u14.04+mos4 all OpenStack Compute - conductor service
ii nova-consoleauth 2:13.0.0~rc2-3~u14.04+mos4 all OpenStack Compute - Console Authenticator
ii nova-consoleproxy 2:13.0.0~rc2-3~u14.04+mos4 all OpenStack Compute - NoVNC proxy
ii nova-scheduler 2:13.0.0~rc2-3~u14.04+mos4 all OpenStack Compute - virtual machine scheduler
ii python-nova 2:13.0.0~rc2-3~u14.04+mos4 all OpenStack Compute - libraries
ii python-novaclient 2:3.3.0-3~u14.04+mos16 all client library for OpenStack Compute API - Python 2.7

root@node-1:~# dpkg -l | grep neutron
ii neutron-common 2:8.0.0~rc2-1~u14.04+mos4 all OpenStack virtual network service - common files
ii neutron-dhcp-agent 2:8.0.0~rc2-1~u14.04+mos4 all OpenStack virtual network service - DHCP agent
ii neutron-l3-agent 2:8.0.0~rc2-1~u14.04+mos4 all OpenStack virtual network service - l3 agent
ii neutron-metadata-agent 2:8.0.0~rc2-1~u14.04+mos4 all OpenStack virtual network service - metadata agent
ii neutron-openvswitch-agent 2:8.0.0~rc2-1~u14.04+mos4 all OpenStack virtual network service - Open vSwitch agent
ii neutron-server 2:8.0.0~rc2-1~u14.04+mos4 all OpenStack virtual network service - server
ii python-neutron 2:8.0.0~rc2-1~u14.04+mos4 all OpenStack virtual network service - Python library
ii python-neutron-lib 0.0.2-3+mos6 all Neutron shared routines and utilities - Python 2.7
ii python-neutronclient 1:4.1.1-1~u14.04+mos3 all client API library for Neutron - Python 2.7

The image is different, but I don't think th...

Read more...

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

I asked Timur and Evgeny for assistance - QA team is going to try to reproduce this again and ping us on IRC ,so that we can jump and debug a live environment with a repro.

Revision history for this message
Evgeny Sikachev (esikachev) wrote :

pay attention, all VMs have a ssh_key in logs of instance? this key correctly applied? this is a key point

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

I took a look at the new repro on CI - the only related problem in logs is

2016-04-13 23:51:47.501 19558 INFO nova.metadata.wsgi.server [req-6dcbf363-7715-48f8-8c96-0d287a959a99 - - - - -] 10.109.34.4,10.109.31.3 "GET /2009-04-04/meta-data/public-keys HTTP/1.1" status: 404 len: 195 time: 0.3889701
2016-04-13 23:51:47.676 19557 INFO nova.metadata.wsgi.server [-] 10.109.34.4,10.109.31.3 "GET /2009-04-04/user-data HTTP/1.1" status: 404 len: 195 time: 0.0012429

So far it looks like a rare race condition in Nova, which sometimes leads to 404 errors when cloud-init tries to fetch SSH public keys.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Please disregard my previous comment - I confused the instance UUID and grep'ed for wrong data. For both Sahara VMs GET /2009-04-04/meta-data/public-keys succeeded with 200

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

When trying to run this OSTF test locally it fails with a timeout error, just like the latest build on CI:

 https://product-ci.infra.mirantis.net/view/9.0_swarm/job/9.0.system_test.ubuntu.services_ha_one_controller/78/console

at the same time in console log I can see that provisioning has started successfully (this must effectively mean the SSH key has been injected).

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

I'm trying to reproduce the issue booting 30 VMs at once with an SSH keypair and testing I can log in via SSH. Something like:

while true; do heat stack-create -f cluster.yaml --poll 10 demo; for inst in $(nova list | grep demo | python -c "import re; import sys; print('\n'.join(re.search('(10.109.3.\d{1,3})', line).groups()[0] for line in sys.stdin.readlines()))"); do ssh -l cirros $inst whoami || (echo "FAILED" && break); done; heat stack-delete -y demo && sleep 120; done

and two HOT templates:

root@node-2:~# cat cluster.yaml
heat_template_version: 2015-04-30

parameters:
    size:
        type: number
        label: cluster size
        default: 30

resources:
    cluster:
        type: OS::Heat::ResourceGroup
        properties:
            count: { get_param: size }
            resource_def: {type: server.yaml}
root@node-2:~# cat server.yaml
heat_template_version: 2015-04-30

resources:
    port:
        type: OS::Neutron::Port
        properties:
            network: 0c46b343-2063-4828-bf01-d4a2909c7ca5

    ip:
        type: OS::Neutron::FloatingIP
        properties:
            floating_network: d2d21b2f-f213-49a1-9b9b-5cb798dd2c1d
            port_id: {get_resource: port}

    demo:
        type: OS::Nova::Server
        properties:
            image: 4ec670ee-18ed-49c4-9381-e7bd3c14a477
            flavor: m1.micro
            key_name: demo
            networks:
                - port: {get_resource: port}

so far - no luck: SSH keys are always injected properly.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Ok, the script above actually can reproduce the problem:

http://xsnippet.org/361631/raw/

^ both nova-metadata and neutron-metadata-proxy assure that GET to /2009-04-04/meta-data/public-keys/0/openssh-key was successful, still, somehow, cirros fails to fetch the URL:

failed to get http://169.254.169.254/2009-04-04/meta-data/public-keys/0/openssh-key

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

So this looks very much like this is a problem on the client side:

- as you can see from the attached logs both Nova and Neutron return 200 on requests to 2009-04-04/meta-data/public-keys*

- at the same time GET /2009-04-04/meta-data/public-keys/0/openssh-key fails within a VM

When looking at logs I noticed that sometimes it takes (much) longer to return a response for this request:

/var/log/neutron/neutron-ns-metadata-proxy-1f151b99-6349-4d14-8f1c-a12525af70bf.log:2016-04-18 09:55:23.248 29976 INFO neutron.wsgi [-] 192.168.111.44 - - [18/Apr/2016 09:55:23] "GET /2009-04-04/meta-data/public-keys/0/openssh-key HTTP/1.1" 200 530 10.291870

I changed the Nova code on my environment to wait for 10s before returning a response for GET /2009-04-04/meta-data/public-keys/0/openssh-key and now the issue is reproduced in 100% of times: instance console log shows the very same error, as if a VM failed to fetch the URL, while both Nova and Neutron served it successfully.

So far I believe this is a read timeout value in cloud-init / cirros-init / etc.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

The handling of the request in Nova is pretty straightforward:

- try to fetch the info from cache, if the corresponding key exists

- fetch the info from the database

- cache the result in memcache

- render data according to the cloud meta-data format

On the environments I've seen we often run into swap. Giving the fact the request goes through multiple different services (neutron-metadata-agent, neutron-ns-metadata-proxy, nova-metadata-api), such temporary lags in response time are sort of expected.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Looks like ubuntu trusty cloud image has read timeout set to 5s for EC2 metadata and 10s for OpenStack metadata:

http://xsnippet.org/361632/raw/

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Checked the original image, mentioned by Evgeny in the description:

2016-04-18 15:59:18.138 22739 INFO nova.metadata.wsgi.server [-] 192.168.111.40,192.168.0.2 "GET /2009-04-04/meta-data/public-keys/ HTTP/1.1" status: 200 len: 141 time: 10.0009279
2016-04-18 15:59:24.463 22738 INFO nova.metadata.wsgi.server [req-8c0295a3-a3e3-496f-a1f3-5525aec464f1 - - - - -] 192.168.111.40,192.168.0.2 "GET /2009-04-04/meta-data/public-keys/ HTTP/1.1" status: 200 len: 141 time: 10.2354801
2016-04-18 15:59:30.380 22739 INFO nova.metadata.wsgi.server [req-0810a1bb-0c37-40d7-9293-9f29bb3b0d1c - - - - -] 192.168.111.40,192.168.0.2 "GET /2009-04-04/meta-data/public-keys/ HTTP/1.1" status: 200 len: 141 time: 10.1707349
2016-04-18 15:59:36.231 22739 INFO nova.metadata.wsgi.server [-] 192.168.111.40,192.168.0.2 "GET /2009-04-04/meta-data/public-keys/ HTTP/1.1" status: 200 len: 141 time: 10.0007911
2016-04-18 15:59:42.484 22738 INFO nova.metadata.wsgi.server [req-61d09265-3588-40ae-bba5-422f82fbdbdb - - - - -] 192.168.111.40,192.168.0.2 "GET /2009-04-04/meta-data/public-keys/ HTTP/1.1" status: 200 len: 141 time: 10.2356319
2016-04-18 15:59:48.231 22738 INFO nova.metadata.wsgi.server [-] 192.168.111.40,192.168.0.2 "GET /2009-04-04/meta-data/public-keys/ HTTP/1.1" status: 200 len: 141 time: 10.0024791

^ it gives up after 5 retries. Unlike Ubuntu Trusty Cloud image, this one does not try to get OpenStack metadata, only EC2 one.

Experiments show that read timeout value in the image is set to 4-5 seconds - if a request takes any longer - SSH keys won't be put into ~/.ssh/authorized_keys

Revision history for this message
Evgeny Sikachev (esikachev) wrote :

I have updated my controller to 14GB RAM. this is not a SWAP problem

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Evgeny and I gave this issue another try on his environment:

- we didn't manage to reproduce the problem with unavailable metadata - every single time a public SSH key is injected properly

- the only way we could reproduce the problem was to make the controller node "busy" (i.e. what is described in the comments above)

We found out, that it actually was another issue, which prevented Sahara clusters from booting correctly - https://bugs.launchpad.net/mos/+bug/1572078

Evgeny is going to give this issue another try, when https://bugs.launchpad.net/mos/+bug/1572078 is fixed. Moving to Incomplete for now.

tags: removed: blocker-for-qa scale
Revision history for this message
Evgeny Sikachev (esikachev) wrote :

need check on scale lab. don't forget look on the SWAP on controller

tags: added: scale
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Hi Performance QA team, we need you help with testing of the issue on scale lab. It looks like it is possible to reproduce the issue with Nova metadata under the high workload and this is why this bug in Incomplete state now - could you please check in on MOS 9.0 on some 20+ nodes lab and verify that issues can't be reproduced (then move it to Invalid) or confirm that issue can be easily reproduced under the load, move co Confirmed state then and add some additional comments here.

Thank you!

Revision history for this message
Evgeny Sikachev (esikachev) wrote :

Reproduced on local lab. Problem with using SWAP on controller. Sorry for incorrect comment above

tags: removed: scale
Revision history for this message
Timofey Durakov (tdurakov) wrote :

@esikacgev, could you confirm that this issue is based on controller load level, and is reproduced only on loaded env(mem swap)?

Revision history for this message
Evgeny Sikachev (esikachev) wrote :

This problem was reproduced on the controller without SWAP. RAM usage 12/20 GB

summary: - Metadata sometimes does not reach the VM
+ Sometimes cannot SSH to instances which created via heat-stack
description: updated
Revision history for this message
Evgeny Sikachev (esikachev) wrote :

I guess that the problem on our local lab, because on another lab all works fine

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.