Bug #1545043 “Sometimes cannot SSH to instances which created vi...” : Bugs : Mirantis OpenStack

Revision history for this message

Evgeny Sikachev (esikachev) wrote on 2016-02-12:

#1

wc-log-11.02 Edit (33.8 KiB, text/plain)

Timur Nurlygayanov (tnurlygayanov) on 2016-02-12

Changed in mos:
assignee:	MOS QA Team (mos-qa) → MOS Nova (mos-nova)

Evgeny Sikachev (esikachev) on 2016-02-12

Changed in mos:
milestone:	8.0-mu-1 → 8.0-updates

Timur Nurlygayanov (tnurlygayanov) on 2016-02-12

Changed in mos:
importance:	Undecided → High
description:	updated

Evgeny Sikachev (esikachev) on 2016-02-12

description:

updated

Roman Podoliaka (rpodolyaka) on 2016-02-15

tags:

added: area-nova

Revision history for this message

Evgeny Sikachev (esikachev) wrote on 2016-02-18:

#2

snapshot https://drive.google.com/a/mirantis.com/file/d/0B8UU8DiYLmsWalZ6eVFfb0dMZlE/view?usp=sharing

Revision history for this message

Evgeny Sikachev (esikachev) wrote on 2016-04-13:

#3

reproduces on SWARM jobs

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-04-13:

#4

Hi Nova team, do you have any ETA for the fix?
it is a blocker for QA team because even simple functional tests failed

tags:

added: blocker-for-qa

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-04-13:

#5

Download full text (3.3 KiB)

I gave it a try on new 9.0 environment and didn't manage to reproduce the issue.

I booted 30 VMs (as many as I can do on two virtualized compute nodes):

nova boot --flavor m1.micro --image TestVM --key-name test --nic net-id=0c46b343-2063-4828-bf01-d4a2909c7ca5 --min-count 30 demo-meta

and checked console output of each instance:

for inst in $(nova list | grep demo | awk '{print $4}'); do nova console-log $inst; done

Each and every instance successfully retrieved the metadata. There also were no errors in nova-api logs (metadata API is one of nova-api services).

Versions of nova and neutron packages:

root@node-1:~# dpkg -l | grep nova
ii nova-api 2:13.0.0~rc2-3~u14.04+mos4 all OpenStack Compute - compute API frontend
ii nova-cert 2:13.0.0~rc2-3~u14.04+mos4 all OpenStack Compute - certificate manager
ii nova-common 2:13.0.0~rc2-3~u14.04+mos4 all OpenStack Compute - common files
ii nova-conductor 2:13.0.0~rc2-3~u14.04+mos4 all OpenStack Compute - conductor service
ii nova-consoleauth 2:13.0.0~rc2-3~u14.04+mos4 all OpenStack Compute - Console Authenticator
ii nova-consoleproxy 2:13.0.0~rc2-3~u14.04+mos4 all OpenStack Compute - NoVNC proxy
ii nova-scheduler 2:13.0.0~rc2-3~u14.04+mos4 all OpenStack Compute - virtual machine scheduler
ii python-nova 2:13.0.0~rc2-3~u14.04+mos4 all OpenStack Compute - libraries
ii python-novaclient 2:3.3.0-3~u14.04+mos16 all client library for OpenStack Compute API - Python 2.7

root@node-1:~# dpkg -l | grep neutron
ii neutron-common 2:8.0.0~rc2-1~u14.04+mos4 all OpenStack virtual network service - common files
ii neutron-dhcp-agent 2:8.0.0~rc2-1~u14.04+mos4 all OpenStack virtual network service - DHCP agent
ii neutron-l3-agent 2:8.0.0~rc2-1~u14.04+mos4 all OpenStack virtual network service - l3 agent
ii neutron-metadata-agent 2:8.0.0~rc2-1~u14.04+mos4 all OpenStack virtual network service - metadata agent
ii neutron-openvswitch-agent 2:8.0.0~rc2-1~u14.04+mos4 all OpenStack virtual network service - Open vSwitch agent
ii neutron-server 2:8.0.0~rc2-1~u14.04+mos4 all OpenStack virtual network service - server
ii python-neutron 2:8.0.0~rc2-1~u14.04+mos4 all OpenStack virtual network service - Python library
ii python-neutron-lib 0.0.2-3+mos6 all Neutron shared routines and utilities - Python 2.7
ii python-neutronclient 1:4.1.1-1~u14.04+mos3 all client API library for Neutron - Python 2.7

The image is different, but I don't think th...

I gave it a try on new 9.0 environment and didn't manage to reproduce the issue.

I booted 30 VMs (as many as I can do on two virtualized compute nodes):

nova boot --flavor m1.micro --image TestVM --key-name test --nic net-id=0c46b343-2063-4828-bf01-d4a2909c7ca5 --min-count 30 demo-meta

and checked console output of each instance:

for inst in $(nova list | grep demo | awk '{print $4}'); do nova console-log $inst; done

Each and every instance successfully retrieved the metadata. There also were no errors in nova-api logs (metadata API is one of nova-api services).

Versions of nova and neutron packages:

root@node-1:~# dpkg -l | grep nova
ii  nova-api                             2:13.0.0~rc2-3~u14.04+mos4             all          OpenStack Compute - compute API frontend
ii  nova-cert                            2:13.0.0~rc2-3~u14.04+mos4             all          OpenStack Compute - certificate manager
ii  nova-common                          2:13.0.0~rc2-3~u14.04+mos4             all          OpenStack Compute - common files
ii  nova-conductor                       2:13.0.0~rc2-3~u14.04+mos4             all          OpenStack Compute - conductor service
ii  nova-consoleauth                     2:13.0.0~rc2-3~u14.04+mos4             all          OpenStack Compute - Console Authenticator
ii  nova-consoleproxy                    2:13.0.0~rc2-3~u14.04+mos4             all          OpenStack Compute - NoVNC proxy
ii  nova-scheduler                       2:13.0.0~rc2-3~u14.04+mos4             all          OpenStack Compute - virtual machine scheduler
ii  python-nova                          2:13.0.0~rc2-3~u14.04+mos4             all          OpenStack Compute - libraries
ii  python-novaclient                    2:3.3.0-3~u14.04+mos16                 all          client library for OpenStack Compute API - Python 2.7

root@node-1:~# dpkg -l | grep neutron
ii  neutron-common                       2:8.0.0~rc2-1~u14.04+mos4              all          OpenStack virtual network service - common files
ii  neutron-dhcp-agent                   2:8.0.0~rc2-1~u14.04+mos4              all          OpenStack virtual network service - DHCP agent
ii  neutron-l3-agent                     2:8.0.0~rc2-1~u14.04+mos4              all          OpenStack virtual network service - l3 agent
ii  neutron-metadata-agent               2:8.0.0~rc2-1~u14.04+mos4              all          OpenStack virtual network service - metadata agent
ii  neutron-openvswitch-agent            2:8.0.0~rc2-1~u14.04+mos4              all          OpenStack virtual network service - Open vSwitch agent
ii  neutron-server                       2:8.0.0~rc2-1~u14.04+mos4              all          OpenStack virtual network service - server
ii  python-neutron                       2:8.0.0~rc2-1~u14.04+mos4              all          OpenStack virtual network service - Python library
ii  python-neutron-lib                   0.0.2-3+mos6                           all          Neutron shared routines and utilities - Python 2.7
ii  python-neutronclient                 1:4.1.1-1~u14.04+mos3                  all          client API library for Neutron - Python 2.7

The image is different, but I don't think this should matter, as we are pretty much interested in booting as many VMs simultaneously as we can to ensure there is high load on metadata services.

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-04-13:

#6

I asked Timur and Evgeny for assistance - QA team is going to try to reproduce this again and ping us on IRC ,so that we can jump and debug a live environment with a repro.

Revision history for this message

Evgeny Sikachev (esikachev) wrote on 2016-04-13:

#7

pay attention, all VMs have a ssh_key in logs of instance? this key correctly applied? this is a key point

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-04-14:

#8

I took a look at the new repro on CI - the only related problem in logs is

2016-04-13 23:51:47.501 19558 INFO nova.metadata.wsgi.server [req-6dcbf363-7715-48f8-8c96-0d287a959a99 - - - - -] 10.109.34.4,10.109.31.3 "GET /2009-04-04/meta-data/public-keys HTTP/1.1" status: 404 len: 195 time: 0.3889701
2016-04-13 23:51:47.676 19557 INFO nova.metadata.wsgi.server [-] 10.109.34.4,10.109.31.3 "GET /2009-04-04/user-data HTTP/1.1" status: 404 len: 195 time: 0.0012429

So far it looks like a rare race condition in Nova, which sometimes leads to 404 errors when cloud-init tries to fetch SSH public keys.

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-04-15:

#9

Please disregard my previous comment - I confused the instance UUID and grep'ed for wrong data. For both Sahara VMs GET /2009-04-04/meta-data/public-keys succeeded with 200

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-04-15:

#10

When trying to run this OSTF test locally it fails with a timeout error, just like the latest build on CI:

https://product-ci.infra.mirantis.net/view/9.0_swarm/job/9.0.system_test.ubuntu.services_ha_one_controller/78/console

at the same time in console log I can see that provisioning has started successfully (this must effectively mean the SSH key has been injected).

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-04-15:

#11

I'm trying to reproduce the issue booting 30 VMs at once with an SSH keypair and testing I can log in via SSH. Something like:

while true; do heat stack-create -f cluster.yaml --poll 10 demo; for inst in $(nova list | grep demo | python -c "import re; import sys; print('\n'.join(re.search('(10.109.3.\d{1,3})', line).groups()[0] for line in sys.stdin.readlines()))"); do ssh -l cirros $inst whoami || (echo "FAILED" && break); done; heat stack-delete -y demo && sleep 120; done

and two HOT templates:

root@node-2:~# cat cluster.yaml
heat_template_version: 2015-04-30

parameters:
    size:
        type: number
        label: cluster size
        default: 30

resources:
    cluster:
        type: OS::Heat::ResourceGroup
        properties:
            count: { get_param: size }
            resource_def: {type: server.yaml}
root@node-2:~# cat server.yaml
heat_template_version: 2015-04-30

resources:
    port:
        type: OS::Neutron::Port
        properties:
            network: 0c46b343-2063-4828-bf01-d4a2909c7ca5

    ip:
        type: OS::Neutron::FloatingIP
        properties:
            floating_network: d2d21b2f-f213-49a1-9b9b-5cb798dd2c1d
            port_id: {get_resource: port}

    demo:
        type: OS::Nova::Server
        properties:
            image: 4ec670ee-18ed-49c4-9381-e7bd3c14a477
            flavor: m1.micro
            key_name: demo
            networks:
                - port: {get_resource: port}

so far - no luck: SSH keys are always injected properly.

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-04-18:

#12

Ok, the script above actually can reproduce the problem:

http://xsnippet.org/361631/raw/

^ both nova-metadata and neutron-metadata-proxy assure that GET to /2009-04-04/meta-data/public-keys/0/openssh-key was successful, still, somehow, cirros fails to fetch the URL:

failed to get http://169.254.169.254/2009-04-04/meta-data/public-keys/0/openssh-key

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-04-18:

#13

So this looks very much like this is a problem on the client side:

- as you can see from the attached logs both Nova and Neutron return 200 on requests to 2009-04-04/meta-data/public-keys*

- at the same time GET /2009-04-04/meta-data/public-keys/0/openssh-key fails within a VM

When looking at logs I noticed that sometimes it takes (much) longer to return a response for this request:

/var/log/neutron/neutron-ns-metadata-proxy-1f151b99-6349-4d14-8f1c-a12525af70bf.log:2016-04-18 09:55:23.248 29976 INFO neutron.wsgi [-] 192.168.111.44 - - [18/Apr/2016 09:55:23] "GET /2009-04-04/meta-data/public-keys/0/openssh-key HTTP/1.1" 200 530 10.291870

I changed the Nova code on my environment to wait for 10s before returning a response for GET /2009-04-04/meta-data/public-keys/0/openssh-key and now the issue is reproduced in 100% of times: instance console log shows the very same error, as if a VM failed to fetch the URL, while both Nova and Neutron served it successfully.

So far I believe this is a read timeout value in cloud-init / cirros-init / etc.

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-04-18:

#14

The handling of the request in Nova is pretty straightforward:

- try to fetch the info from cache, if the corresponding key exists

- fetch the info from the database

- cache the result in memcache

- render data according to the cloud meta-data format

On the environments I've seen we often run into swap. Giving the fact the request goes through multiple different services (neutron-metadata-agent, neutron-ns-metadata-proxy, nova-metadata-api), such temporary lags in response time are sort of expected.

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-04-18:

#15

Looks like ubuntu trusty cloud image has read timeout set to 5s for EC2 metadata and 10s for OpenStack metadata:

http://xsnippet.org/361632/raw/

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-04-18:

#16

Checked the original image, mentioned by Evgeny in the description:

2016-04-18 15:59:18.138 22739 INFO nova.metadata.wsgi.server [-] 192.168.111.40,192.168.0.2 "GET /2009-04-04/meta-data/public-keys/ HTTP/1.1" status: 200 len: 141 time: 10.0009279
2016-04-18 15:59:24.463 22738 INFO nova.metadata.wsgi.server [req-8c0295a3-a3e3-496f-a1f3-5525aec464f1 - - - - -] 192.168.111.40,192.168.0.2 "GET /2009-04-04/meta-data/public-keys/ HTTP/1.1" status: 200 len: 141 time: 10.2354801
2016-04-18 15:59:30.380 22739 INFO nova.metadata.wsgi.server [req-0810a1bb-0c37-40d7-9293-9f29bb3b0d1c - - - - -] 192.168.111.40,192.168.0.2 "GET /2009-04-04/meta-data/public-keys/ HTTP/1.1" status: 200 len: 141 time: 10.1707349
2016-04-18 15:59:36.231 22739 INFO nova.metadata.wsgi.server [-] 192.168.111.40,192.168.0.2 "GET /2009-04-04/meta-data/public-keys/ HTTP/1.1" status: 200 len: 141 time: 10.0007911
2016-04-18 15:59:42.484 22738 INFO nova.metadata.wsgi.server [req-61d09265-3588-40ae-bba5-422f82fbdbdb - - - - -] 192.168.111.40,192.168.0.2 "GET /2009-04-04/meta-data/public-keys/ HTTP/1.1" status: 200 len: 141 time: 10.2356319
2016-04-18 15:59:48.231 22738 INFO nova.metadata.wsgi.server [-] 192.168.111.40,192.168.0.2 "GET /2009-04-04/meta-data/public-keys/ HTTP/1.1" status: 200 len: 141 time: 10.0024791

^ it gives up after 5 retries. Unlike Ubuntu Trusty Cloud image, this one does not try to get OpenStack metadata, only EC2 one.

Experiments show that read timeout value in the image is set to 4-5 seconds - if a request takes any longer - SSH keys won't be put into ~/.ssh/authorized_keys

Revision history for this message

Evgeny Sikachev (esikachev) wrote on 2016-04-18:

#17

I have updated my controller to 14GB RAM. this is not a SWAP problem

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-04-19:

#18

Evgeny and I gave this issue another try on his environment:

- we didn't manage to reproduce the problem with unavailable metadata - every single time a public SSH key is injected properly

- the only way we could reproduce the problem was to make the controller node "busy" (i.e. what is described in the comments above)

We found out, that it actually was another issue, which prevented Sahara clusters from booting correctly - https://bugs.launchpad.net/mos/+bug/1572078

Evgeny is going to give this issue another try, when https://bugs.launchpad.net/mos/+bug/1572078 is fixed. Moving to Incomplete for now.

tags:

removed: blocker-for-qa scale

Revision history for this message

Evgeny Sikachev (esikachev) wrote on 2016-04-19:

#19

need check on scale lab. don't forget look on the SWAP on controller

tags:

added: scale

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-04-20:

#20

Hi Performance QA team, we need you help with testing of the issue on scale lab. It looks like it is possible to reproduce the issue with Nova metadata under the high workload and this is why this bug in Incomplete state now - could you please check in on MOS 9.0 on some 20+ nodes lab and verify that issues can't be reproduced (then move it to Invalid) or confirm that issue can be easily reproduced under the load, move co Confirmed state then and add some additional comments here.

Thank you!

Revision history for this message

Evgeny Sikachev (esikachev) wrote on 2016-04-20:

#21

Reproduced on local lab. Problem with using SWAP on controller. Sorry for incorrect comment above

tags:

removed: scale

Revision history for this message

Timofey Durakov (tdurakov) wrote on 2016-04-28:

#22

@esikacgev, could you confirm that this issue is based on controller load level, and is reproduced only on loaded env(mem swap)?

Revision history for this message

Evgeny Sikachev (esikachev) wrote on 2016-04-28:

#23

This problem was reproduced on the controller without SWAP. RAM usage 12/20 GB

Evgeny Sikachev (esikachev) on 2016-05-06

summary:	- Metadata sometimes does not reach the VM + Sometimes cannot SSH to instances which created via heat-stack
description:	updated

Revision history for this message

Evgeny Sikachev (esikachev) wrote on 2016-05-06:

#24

I guess that the problem on our local lab, because on another lab all works fine

	Status	Importance	Assigned to	Milestone
Mirantis OpenStack	Status tracked in 10.0.x
10.0.x	Invalid	High	Evgeny Sikachev	Mirantis OpenStack 10.0
8.0.x	Invalid	High	Evgeny Sikachev	Mirantis OpenStack 8.0-updates
9.x	Invalid	High	Evgeny Sikachev	Mirantis OpenStack 9.0

Mirantis OpenStack

Sometimes cannot SSH to instances which created via heat-stack

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches