tempest.scenario.test_security_groups_basic_ops.TestSecurityGroupsBasicOps.test_port_security_disable_security_group fails due to instances failing to retrieve public key via metadata service

Bug #1765545 reported by Daniel Alvarez
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Incomplete
Undecided
Unassigned
networking-ovn
New
Undecided
Unassigned

Bug Description

Running tempest test tempest.scenario.test_security_groups_basic_ops.TestSecurityGroupsBasicOps.test_port_security_disable_security_group fails sometimes when trying to authenticate via public key to the access point instance [0].

After debugging, I managed to connect to the instance via virsh console and check that the instance had not the SSH key installed:

  ____ ____ ____
 / __/ __ ____ ____ / __ \/ __/
/ /__ / // __// __// /_/ /\ \
\___//_//_/ /_/ \____/___/
   http://cirros-cloud.net
login as 'cirros' user. default password: 'cubswin:)'. use 'sudo' for root.
tempest-server-tempest-testsecuritygroupsbasicops-679466304-acc login: cirros
Password:
$ cat .ssh/authorized_keys
$

Checking up in the console log, I can see the following:
cirros-ds 'net' up at 6.37
checking http://169.254.169.254/2009-04-04/instance-id
successful after 1/20 tries: up 6.54. iid=i-0000006e
failed to get http://169.254.169.254/2009-04-04/meta-data/public-keys/0/openssh-key
warning: no ec2 metadata for public-keys
failed to get http://169.254.169.254/2009-04-04/user-data
warning: no ec2 metadata for user-data
found datasource (ec2, net)

So it looks like it is able to fetch the instance-id but not getting the public-key.
When I try to do it manually, it retrieves it successfully:

$ curl 169.254.169.254/2009-04-04/meta-data/public-keys/0/openssh-key
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDICvVroPErVzHbx+a1lhI4RU33f0Nb4DT2FiNbKhaI1ZBl4/zRbqFY5a4lMipV810dCzJSViGJVw0VzNgDOf/zCt6Joosem5qC8hKwRgX5tcEXQ0UnVCiXddP1bydbRVt4BofTCTUPb4SZ3Z4zl0+L4WWB1CY58KYl19Lr7H4zqMXPqa6Mw+k1dpo0YBk3ZZR4pIxGtN916w6x6vtSIy2oDg4zaxUuewGaQNp9wENEuP3+TOseTymBxpbdys2RpUKXM2vhWWDDbrzG0+juOFxn111SgFYom05sjONDM310xHX5KBm6QuJO6ObCkSIKre9wvU60i19YW7pxBtyfztIJ Generated-by-Nova

Also, running the following command doesn't work:

$ sudo cirros-apply net -v
$ cat .ssh/authorized_keys
$

If, instead I run the following command and reboot, it will get properly installed:
$ sudo cirros-per boot cirros-apply-net cirros-apply net && reboot
...
$ cat .ssh/authorized_keys
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDICvVroPErVzHbx+a1lhI4RU33f0Nb4DT2FiNbKhaI1ZBl4/zRbqFY5a4lMipV810dCzJSViGJVw0VzNgDOf/zCt6Joosem5qC8hKwRgX5tcEXQ0UnVCiXddP1bydbRVt4BofTCTUPb4SZ3Z4zl0+L4WWB1CY58KYl19Lr7H4zqMXPqa6Mw+k1dpo0YBk3ZZR4pIxGtN916w6x6vtSIy2oDg4zaxUuewGaQNp9wENEuP3+TOseTymBxpbdys2RpUKXM2vhWWDDbrzG0+juOFxn111SgFYom05sjONDM310xHX5KBm6QuJO6ObCkSIKre9wvU60i19YW7pxBtyfztIJ Generated-by-Nova

After checking the ovn metadata proxy log and also nova-metadata-api logs, I can see the requests and the 200 OK responses:

2018-04-19 22:31:37.383 24 INFO eventlet.wsgi.server [-] 10.100.0.7,<local> "GET /2009-04-04/meta-data/instance-id HTTP/1.1" status: 200 len: 146 time: 4.0820560
2018-04-19 22:31:38.800 24 INFO eventlet.wsgi.server [-] 10.100.0.7,<local> "GET /2009-04-04/meta-data/public-keys HTTP/1.1" status: 200 len: 183 time: 1.1210902
2018-04-19 22:31:49.148 24 INFO eventlet.wsgi.server [-] 10.100.0.7,<local> "GET /2009-04-04/meta-data/instance-id HTTP/1.1" status: 200 len: 146 time: 0.0230849
2018-04-19 22:31:49.387 24 INFO eventlet.wsgi.server [-] 10.100.0.7,<local> "GET /2009-04-04/meta-data/ami-launch-index HTTP/1.1" status: 200 len: 136 time: 0.0262089
2018-04-19 22:31:50.225 24 INFO eventlet.wsgi.server [-] 10.100.0.7,<local> "GET /2009-04-04/meta-data/instance-type HTTP/1.1" status: 200 len: 142 time: 0.7244408
2018-04-19 22:31:50.482 24 INFO eventlet.wsgi.server [-] 10.100.0.7,<local> "GET /2009-04-04/meta-data/local-ipv4 HTTP/1.1" status: 200 len: 146 time: 0.0143349
2018-04-19 22:31:50.612 24 INFO eventlet.wsgi.server [-] 10.100.0.7,<local> "GET /2009-04-04/meta-data/public-ipv4 HTTP/1.1" status: 200 len: 144 time: 0.0130348
2018-04-19 22:31:50.793 24 INFO eventlet.wsgi.server [-] 10.100.0.7,<local> "GET /2009-04-04/meta-data/hostname HTTP/1.1" status: 200 len: 199 time: 0.0100901
2018-04-19 22:31:51.039 24 INFO eventlet.wsgi.server [-] 10.100.0.7,<local> "GET /2009-04-04/meta-data/local-hostname HTTP/1.1" status: 200 len: 199 time: 0.0094490
2018-04-19 22:31:51.197 24 INFO eventlet.wsgi.server [-] 10.100.0.7,<local> "GET /2009-04-04/user-data HTTP/1.1" status: 404 len: 297 time: 0.0226381
2018-04-19 22:31:51.475 24 INFO eventlet.wsgi.server [-] 10.100.0.7,<local> "GET /2009-04-04/meta-data/block-device-mapping HTTP/1.1" status: 200 len: 143 time: 0.0118120
2018-04-19 22:31:51.579 24 INFO eventlet.wsgi.server [-] 10.100.0.7,<local> "GET /2009-04-04/meta-data/block-device-mapping/ami HTTP/1.1" status: 200 len: 138 time: 0.0084291
2018-04-19 22:31:51.672 24 INFO eventlet.wsgi.server [-] 10.100.0.7,<local> "GET /2009-04-04/meta-data/public-keys/0/openssh-key HTTP/1.1" status: 200 len: 535 time: 12.7038500
2018-04-19 22:31:51.735 24 INFO eventlet.wsgi.server [-] 10.100.0.7,<local> "GET /2009-04-04/meta-data/block-device-mapping/root HTTP/1.1" status: 200 len: 143 time: 0.0147779
2018-04-19 22:31:51.930 24 INFO eventlet.wsgi.server [-] 10.100.0.7,<local> "GET /2009-04-04/meta-data/public-hostname HTTP/1.1" status: 200 len: 199 time: 0.0083301

nova-api-metadata.log:

2018-04-19 22:31:38.852 5 INFO nova.metadata.wsgi.server [req-454724c8-a0cb-4c16-8c85-dd5940c751e2 - - - - -] 10.100.0.7,172.16.2.18 "GET /2009-04-04/meta-data/public-keys HTTP/1.1" status: 200 len: 164 time: 1.1147470
2018-04-19 22:31:51.474 5 INFO nova.metadata.wsgi.server [req-f1ef4033-3378-4074-baaf-6dfae80b81cf - - - - -] 10.100.0.7,172.16.2.18 "GET /2009-04-04/meta-data/public-keys/0/openssh-key HTTP/1.1" status: 200 len: 516 time: 12.6919320

This is a TripleO (tripleo-ci-master) setup running 3 controllers and 1 compute.
Please, note that the test doesn't always fail... I ran it 10 times in a row and it failed with this error in 3 of them.

Even though it is an OVN environment, I suspect that it might have to do either with Nova or with the cirros image itself.

+--------------------------------------+----------------------------------+--------+
| ID | Name | Status |
+--------------------------------------+----------------------------------+--------+
| 671c2d8e-5909-4f61-9cd3-d81dbdb892d4 | cirros-0.3.5-x86_64-disk.img | active |
| e4d8721c-743a-4cbb-b026-b0ec1b5c84bf | cirros-0.3.5-x86_64-disk.img_alt | active |

The test would use the image with id 671c2d8e-5909-4f61-9cd3-d81dbdb892d4.

I can't see any error traces in ovn-metadata-agent, ovn-controller nor neutron-server.

Thanks,
Daniel

[0] https://github.com/openstack/tempest/blob/c961a656ccdc0f1242b4ff3237a16d4a7cdf4e07/tempest/scenario/test_security_groups_basic_ops.py#L609

Tags: metadata
affects: networking-ovn → nova
affects: nova → networking-ovn
summary: tempest.scenario.test_security_groups_basic_ops.TestSecurityGroupsBasicOps.test_port_security_disable_security_group
- fails due to metadata error
+ fails due to instances failing to retrieve public key via metadata
+ service
Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

@dalvarez, can you try running:

$ sudo bash -x cirros-apply net -v

I believe it was bash, so may be you'll see what's going on there?

very weird.

tags: added: metadata
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

This is an old bug with an unanswered question from Miguel. I'm marking this as Incomplete. If the problem still visible please put this back to New and provide the requested data.

Changed in nova:
status: New → Incomplete
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.