boot_runcommand_delete failed on ssh step

Bug #1418655 reported by Leontii Istomin
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Invalid
High
MOS Scale
6.0.x
Won't Fix
High
Dina Belova
6.1.x
Won't Fix
High
Ilya Shakhat
7.0.x
Invalid
High
MOS Scale

Bug Description

api: '1.0'
astute_sha: f7cda2171b0b677dfaeb59693d980a2d3ee4c3e0
auth_required: true
build_id: 2015-02-04_09-06-18
build_number: '67'
feature_groups:
- mirantis
fuellib_sha: f41f594850fb8ced89a2a29649cd478492d2b29b
fuelmain_sha: c799e3a6d88289e58db764a6be7910aab7da3149
nailgun_sha: ec399d1d636be8c501336359041b13a78c139dd5
ostf_sha: 3b57985d4d2155510894a1f6d03b478b201f7780
production: docker
release: 6.0.1
release_versions:
  2014.2-6.0.1:
    VERSION:
      api: '1.0'
      astute_sha: f7cda2171b0b677dfaeb59693d980a2d3ee4c3e0
      build_id: 2015-02-04_09-06-18
      build_number: '67'
      feature_groups:
      - mirantis
      fuellib_sha: f41f594850fb8ced89a2a29649cd478492d2b29b
      fuelmain_sha: c799e3a6d88289e58db764a6be7910aab7da3149
      nailgun_sha: ec399d1d636be8c501336359041b13a78c139dd5
      ostf_sha: 3b57985d4d2155510894a1f6d03b478b201f7780
      production: docker
      release: 6.0.1

I executed rally light test:
http://mos-scale.vm.mirantis.net:8080/view/ENV-13%20(25%20nodes%20(up%20to%20100))/job/13_env_run_rally_light/14/console

rally-scenarios.vm.boot_runcommand_delete test was failed:
16:08:04 stderr-0000: {{{
16:08:04 SSHTimeout
16:08:04 Timeout waiting for '172.16.47.221'
16:08:04 Traceback (most recent call last):
16:08:04 File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/runners/base.py", line 77, in _run_scenario_once
16:08:04 method_name)(**kwargs) or scenario_output
16:08:04 File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/scenarios/vm/vmtasks.py", line 108, in boot_runcommand_delete
16:08:04 password, interpreter, script)
16:08:04 File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/scenarios/vm/utils.py", line 68, in run_command
16:08:04 self.wait_for_ssh(ssh)
16:08:04 File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/scenarios/base.py", line 254, in func_atomic_actions
16:08:04 f = func(self, *args, **kwargs)
16:08:04 File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/scenarios/vm/utils.py", line 44, in wait_for_ssh
16:08:04 ssh.wait()
16:08:04 File "/opt/stack/.venv/lib/python2.7/site-packages/rally/common/sshutils.py", line 258, in wait
16:08:04 raise SSHTimeout(_("Timeout waiting for '%s'") % self.host)
16:08:04 SSHTimeout: Timeout waiting for '172.16.47.221'
16:08:04 }}}

As you can see only first iteration was failed during ssh connection:
16:08:04 +-----------------------------+--------+--------+---------+----------+-------+--------+
16:08:04 | Action | min | avg | max | failures | total | result |
16:08:04 +-----------------------------+--------+--------+---------+----------+-------+--------+
16:08:04 | nova.associate_floating_ip | 1.150 | 5.647 | 7.225 | 0 | 5 | :) |
16:08:04 | nova.boot_server | 15.768 | 29.921 | 49.319 | 0 | 5 | :) |
16:08:04 | nova.delete_server | 2.502 | 4.810 | 7.059 | 0 | 5 | :) |
16:08:04 | nova.dissociate_floating_ip | 1.679 | 3.831 | 6.696 | 0 | 5 | :) |
16:08:04 | vm.run_command | 3.570 | 5.858 | 7.911 | 1 | 5 | :( |
16:08:04 | vm.wait_for_ping | 0.014 | 10.428 | 48.485 | 0 | 5 | :) |
16:08:04 | vm.wait_for_ssh | 0.233 | 8.343 | 24.568 | 1 | 5 | :( |
16:08:04 | ~total | 72.769 | 98.537 | 167.055 | 1 | 5 | :( |
16:08:04 +-----------------------------+--------+--------+---------+----------+-------+--------+
I have not found any log messages about neutron issues. Probably, security groups were not applied on the time, when we tried to reach an instance via ssh.

rally scenarios logs are attached. You can find diagnostic snapshot here: https://drive.google.com/a/mirantis.com/file/d/0Bx4ptZV1Jt7hYkVtSHNxRHE1M2s/view?usp=sharing

Tags: neutron scale
Revision history for this message
Leontii Istomin (listomin) wrote :
tags: added: neutron
Revision history for this message
Leontii Istomin (listomin) wrote :
Download full text (3.7 KiB)

It was reproduced again.

_StringException: config: {
    {
        {
            {
                "kw": {
                    "runner": {
                        "type": "constant",
                        "concurrency": 5,
                        "times": 5
                    },
                    "args": {
                        "username": "cirros",
                        "floating_network": "net04_ext",
                        "script": "/opt/stack/rally-scenarios/vm/instance_metadata.sh",
                        "image": {
                            "name": "TestVM"
                        },
                        "interpreter": "/bin/sh",
                        "flavor": {
                            "name": "m1.nano"
                        }
                    },
                    "sla": {
                        "failure_rate": {
                            "max": 0
                        }
                    },
                    "context": {
                        "network": {
                            "networks_per_tenant": 1,
                            "start_cidr": "10.2.0.0/24"
                        },
                        "users": {
                            "project_domain": "default",
                            "users_per_tenant": 2,
                            "user_domain": "default",
                            "tenants": 2,
                            "resource_management_workers": 30
                        }
                    }
                },
                "name": "VMTasks.boot_runcommand_delete",
                "pos": 0
            }
        }
    }
}results: {
    {
        {
            +-----------------------------+--------+---------+---------+----------+-------+--------+|Action|min|avg|max|failures|total|result|+-----------------------------+--------+---------+---------+----------+-------+--------+|nova.associate_floating_ip|1.615|4.494|9.010|0|5|: )||nova.boot_server|14.614|21.468|25.680|0|5|: )||nova.delete_server|2.820|7.264|12.891|0|5|: )||nova.dissociate_floating_ip|1.888|5.748|10.159|0|5|: )||vm.run_command|6.362|7.555|9.677|1|5|: (||vm.wait_for_ping|0.471|33.186|58.845|0|5|: )||vm.wait_for_ssh|0.552|4.652|11.345|1|5|: (||~total|63.252|114.565|183.871|1|5|: (|+-----------------------------+--------+---------+---------+----------+-------+--------+
        }
    }
}stderr-0003: {
    {
        {
            SSHTimeoutTimeoutwaitingfor'172.16.46.180'Traceback(mostrecentcalllast): File"/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/runners/base.py",
            line77,
            in_run_scenario_oncemethod_name)(**kwargs)orscenario_outputFile"/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/scenarios/vm/vmtasks.py",
            line108,
            inboot_runcommand_deletepassword,
            interpreter,
            script)File"/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/scenarios/vm/utils.py",
            line68,
            inrun_commandself.wait_for_ssh(ssh)File"/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/scenarios/base.py",
            line254,
            infunc_atomic_actionsf=func(self,
            *args,
...

Read more...

Dina Belova (dbelova)
Changed in mos:
importance: Undecided → High
status: New → Confirmed
Changed in mos:
milestone: none → 6.1
Revision history for this message
Ilya Shakhat (shakhat) wrote :

The issue can be reproduced without Rally, just by running a script that boots instance, assigns floating IP a does ssh. The reason is in failure of fetching metadata, the instance log contains "2015-02-19 12:07:14,760 - util.py[WARNING]: Failed fetching metadata from url http://169.254.169.254/2009-04-04/meta-data". As result the instance misses keys and hostname

Revision history for this message
Ilya Shakhat (shakhat) wrote :

Increased load parameters in the scenario, waiting for the repro at scale

Revision history for this message
Alexander Ignatov (aignatov) wrote :
Revision history for this message
Alexander Ignatov (aignatov) wrote :

Waiting for repro at scale, live env is more prefferable

Revision history for this message
Dina Belova (dbelova) wrote :

was reproduced today. Sadly the appropriate analysis of the test results was not processed, and the env was already redeployed. The access to the lab will be provided after the next reproduction.

Revision history for this message
Ilya Shakhat (shakhat) wrote :

The issue was observed in one of dev-labs on Fuel version 5.2.5. The reason is that the key is not injected into the instance.

Instance console log:
~~~~~~~~~~~~~~~~
checking http://169.254.169.254/2009-04-04/instance-id
successful after 1/20 tries: up 0.97. iid=i-00000ea8
failed to get http://169.254.169.254/2009-04-04/meta-data/public-keys
warning: no ec2 metadata for public-keys
failed to get http://169.254.169.254/2009-04-04/user-data
warning: no ec2 metadata for user-data

Metadata proxy:
~~~~~~~~~~~~~
2015-04-07 15:33:32.551 6179 INFO neutron.wsgi [-] (6179) accepted ('100.0.1.144', 48951)
2015-04-07 15:33:32.845 6179 INFO neutron.wsgi [-] 100.0.1.144 - - [07/Apr/2015 15:33:32] "GET /2009-04-04/meta-data/instance-id HTTP/1.1" 200 126 0.292055
2015-04-07 15:33:32.851 6179 INFO neutron.wsgi [-] (6179) accepted ('100.0.1.144', 48952)
2015-04-07 15:33:42.858 6179 INFO neutron.wsgi [-] (6179) accepted ('100.0.1.144', 48953)
2015-04-07 15:33:42.925 6179 INFO neutron.wsgi [-] 100.0.1.144 - - [07/Apr/2015 15:33:42] "GET /2009-04-04/meta-data/instance-id HTTP/1.1" 200 126 0.066979

Metadata proxy when key injected successfully:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2015-04-07 15:04:29.773 6179 INFO neutron.wsgi [-] (6179) accepted ('100.0.1.144', 60757)
2015-04-07 15:04:30.030 6179 INFO neutron.wsgi [-] 100.0.1.144 - - [07/Apr/2015 15:04:30] "GET /2009-04-04/meta-data/instance-id HTTP/1.1" 200 126 0.256710
2015-04-07 15:04:30.037 6179 INFO neutron.wsgi [-] (6179) accepted ('100.0.1.144', 60758)
2015-04-07 15:04:30.096 6179 INFO neutron.wsgi [-] 100.0.1.144 - - [07/Apr/2015 15:04:30] "GET /2009-04-04/meta-data/public-keys HTTP/1.1" 200 131 0.058433
2015-04-07 15:04:30.100 6179 INFO neutron.wsgi [-] (6179) accepted ('100.0.1.144', 60759)
2015-04-07 15:04:30.156 6179 INFO neutron.wsgi [-] 100.0.1.144 - - [07/Apr/2015 15:04:30] "GET /2009-04-04/meta-data/public-keys/0/openssh-key HTTP/1.1" 200 516 0.054804
2015-04-07 15:04:30.161 6179 INFO neutron.wsgi [-] (6179) accepted ('100.0.1.144', 60760)
2015-04-07 15:04:30.226 6179 INFO neutron.wsgi [-] 100.0.1.144 - - [07/Apr/2015 15:04:30] "GET /2009-04-04/meta-data/instance-id HTTP/1.1" 200 126 0.063837

Metadata agent log:
~~~~~~~~~~~~~~~~
2015-04-07 15:33:32.843 6643 INFO eventlet.wsgi.server [req-2ae3f896-d8f4-4235-9102-dcda80db3cde None] 100.0.1.144,<local> - - [07/Apr/2015 15:33:32] "GET /2009-04-04/meta-data/instance-id HTTP/1.1" 200 126 0.289115
2015-04-07 15:33:42.924 6636 INFO eventlet.wsgi.server [req-2ae3f896-d8f4-4235-9102-dcda80db3cde None] 100.0.1.144,<local> - - [07/Apr/2015 15:33:42] "GET /2009-04-04/meta-data/instance-id HTTP/1.1" 200 126 0.063073

It appears that the instance was able to connect to metadata proxy, but then the traffic failed. The request was not forwarded to agent, so the failure appeared during HTTP request operation.

Revision history for this message
Dina Belova (dbelova) wrote :

No repro yet. Will move to the confirmed if we'll face it.

Revision history for this message
Leontii Istomin (listomin) wrote :

reproduced the issue with the following configuration:

api: '1.0'
astute_sha: 055b2d82fe8499b27c7047295e2e36a7a2c5d430
auth_required: true
build_id: 2015-04-16_21-30-10
build_number: '317'
feature_groups:
- mirantis
fuellib_sha: db5f39e96e7ab9f79691202755e547bf8242661f
fuelmain_sha: 0de2d2039e76839d339f977df45b111bef7200d6
nailgun_sha: 52d92c86e68602fb5dd2f3b8870a773d20a387ef
openstack_version: 2014.2-6.1
ostf_sha: b0991dbad159f53d335efa5d31cb94016ad5312e
production: docker
python-fuelclient_sha: 279ffc358e40dbdc162cfe76898cbd0874529f1f
release: '6.1'

Baremetal,Centos,IBP,HA,Neutron-gre,Сeph-all,Nova-debug,Nova-quotas,6.1_317
Controllers:3 Computes:17

We changed rally scenario to doesn't delete instances and floatings after ending.

rally exception:
Traceback (most recent call last):
  File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/runners/base.py", line 79, in _run_scenario_once
    method_name)(**kwargs) or scenario_output
  File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/scenarios/vm/vmtasks.py", line 88, in boot_runcommand_delete
    password, interpreter, script)
  File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/scenarios/vm/utils.py", line 158, in _run_command
    self._wait_for_ssh(ssh)
  File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/scenarios/base.py", line 261, in func_atomic_actions
    f = func(self, *args, **kwargs)
  File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/scenarios/vm/utils.py", line 127, in _wait_for_ssh
    ssh.wait()
  File "/opt/stack/.venv/lib/python2.7/site-packages/rally/common/sshutils.py", line 258, in wait
    raise SSHTimeout(_("Timeout waiting for '%s'") % self.host)
SSHTimeout: Timeout waiting for '172.16.47.71'

have found that instance couldn't get privat ip address via dhcp.

then agent has been restarted by something:
2015-05-08T12:30:43.065867+00:00 info: 2015-05-08 12:30:43.066 1939 INFO neutron.openstack.common.service [req-cb2867aa-47f8-4ffe-8a8c-8fe7b7d6ffd9 None] Caught SIGTERM, exiting
And when we performed "udhcp eth0" command manually, it got address.

Snapshot is coming

Revision history for this message
Leontii Istomin (listomin) wrote :
Revision history for this message
Ilya Shakhat (shakhat) wrote :

According to the snapshot the failure in rally execution happened at 2015-05-08 10:32:18.070. The message says "Timeout waiting for '172.16.47.71'".

Out of the logs the following is observed:

Instance is created and Rally tries to ping it for the first time:
2015-05-08 10:29:32.906 25753 RALLYDEBUG rally.benchmark.scenarios.vm.utils [-] Host 172.16.47.71 is ICMP down

Instance boot up and its kernel start responding on ping:
2015-05-08 10:30:16.930 25753 RALLYDEBUG rally.benchmark.scenarios.vm.utils [-] Host 172.16.47.71 is ICMP up

Rally was able to connect on SSH protocol (port 22 replied and instance returned its host key):
2015-05-08 10:31:02.401 25753 DEBUG paramiko.transport [-] Adding ssh-rsa host key for 172.16.47.71: f0ebb35a00cf6510b209da440cce27a4 _log /opt/stack/.venv/lib/python2.7/site-packages/paramiko/transport.py:1430

The same attempts continue until 2015-05-08 10:32:15.113:
2015-05-08 10:32:15.113 25753 DEBUG paramiko.transport [-] Adding ssh-rsa host key for 172.16.47.71: f0ebb35a00cf6510b209da440cce27a4 _log /opt/stack/.venv/lib/python2.7/site-packages/paramiko/transport.py:1430

Rally tries to connect via provided key, but fails:
2015-05-08 10:32:15.114 25753 DEBUG paramiko.transport [-] Trying SSH key 91f726981b7dd36c2219f2b156db1a24 _log /opt/stack/.venv/lib/python2.7/site-packages/paramiko/transport.py:1430
2015-05-08 10:32:16.870 25753 INFO paramiko.transport [-] Authentication (publickey) failed.

The library falls back to system keys, but fails too (as expected):
2015-05-08 10:32:16.872 25753 DEBUG paramiko.transport [-] Trying discovered key 59cf4c74be4942acb96428799da2c80e in /home/developer/.ssh/id_rsa _log /opt/stack/.venv/lib/python2.7/site-packages/paramiko/transport.py:1430
2015-05-08 10:32:16.873 25753 DEBUG paramiko.transport [-] userauth is OK _log /opt/stack/.venv/lib/python2.7/site-packages/paramiko/transport.py:1430
2015-05-08 10:32:16.948 25753 INFO paramiko.transport [-] Authentication (publickey) failed.

After that the connection is terminated and Rally raises timeout exception:
2015-05-08 10:32:17.069 25753 RALLYDEBUG rally.common.sshutils [-] Ssh is still unavailable: SSHError(u"Exception <class 'paramiko.ssh_exception.AuthenticationException'> was raised during connect. Exception value is: AuthenticationException('Authentication failed.',)",)
2015-05-08 10:32:18.070 25753 ERROR rally.benchmark.runners.base [-] Timeout waiting for '172.16.47.71'

From the above it is clear that networking was fully functional. There are chances that the root cause is in lost metadata, however metadata proxy logs are missing in the snapshot and the further analysis is not possible (bug https://bugs.launchpad.net/fuel/+bug/1455426 is filed).

For further investigation it is recommended to preserve live environment with the issue.

Revision history for this message
Alexander Ignatov (aignatov) wrote :

Taking into account Ilya's comment above, lack of time before 6.1 HCF and uncertainty of RCA I'm moving this bug to 7.0. We will proceed with investigations in 7.0 release time frame.

Revision history for this message
Alexander Ignatov (aignatov) wrote :

Requesting for scale team to verify wether this issue exists or not at scale. I remember that this issue was occurred only in CentOs environments. Please reverify it on ubuntu. Moved bug to Incomplete and assigned to scale team

Revision history for this message
Leontii Istomin (listomin) wrote :

Hasn't been reproduced with 7.0

Revision history for this message
Leontii Istomin (listomin) wrote :

Probably, reproduced the issue with 7.0-98 build, but lost logs. Will try to reproduce again

Revision history for this message
Dina Belova (dbelova) wrote :

Was reproduced on 7.0-98.

Revision history for this message
Leontii Istomin (listomin) wrote :

root cause of the issue in 7.0 is https://bugs.launchpad.net/fuel/+bug/1479815.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.