multiple nova server related test failure due to Host key verification failed on compute node

Bug #1861296 reported by chandan kumar
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
chandan kumar

Bug Description

Multiple tempest tests related tempest.api.compute server failed on fs020 which got skipped in this review https://review.opendev.org/#/c/701403/.

http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-1ctlr_2comp-featureset021-master/9cad320/logs/undercloud/var/log/tempest/tempest_run.log

For example this one:
{2} tempest.api.compute.admin.test_migrations.MigrationsAdminTest.test_cold_migration [314.255702s] ... FAILED

Captured traceback:
~~~~~~~~~~~~~~~~~~~
    Traceback (most recent call last):
      File "/usr/lib/python2.7/site-packages/tempest/api/compute/admin/test_migrations.py", line 140, in test_cold_migration
        self._test_cold_migrate_server(revert=False)
      File "/usr/lib/python2.7/site-packages/tempest/api/compute/admin/test_migrations.py", line 122, in _test_cold_migrate_server
        server['id'], 'VERIFY_RESIZE')
      File "/usr/lib/python2.7/site-packages/tempest/common/waiters.py", line 96, in wait_for_server_status
        raise lib_exc.TimeoutException(message)
    tempest.lib.exceptions.TimeoutException: Request timed out
    Details: (MigrationsAdminTest:test_cold_migration) Server d7250976-d701-43e8-9d03-991f2802ca81 failed to reach VERIFY_RESIZE status and task state "None" within the required time (300 s). Current status: ACTIVE. Current task state: None.

While looking at nova compute logs
http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-1ctlr_2comp-featureset021-master/9cad320/logs/overcloud-novacompute-0/var/log/containers/nova/nova-compute.log

We found this

2020-01-27 11:31:59.023 8 ERROR oslo_messaging.rpc.server [req-144faf8f-9465-4559-9a29-b4d838738639 c246cde822844b1a94c9e666d20ba0d4 54f7514fa0e549ce8e2eee91cb9317d6 - default default] Exception during message handling: ResizeError: Resize error: not able to execute ssh command: Unexpected error while running command.
Command: ssh -o BatchMode=yes 172.17.0.64 mkdir -p /var/lib/nova/instances/d7250976-d701-43e8-9d03-991f2802ca81
Exit code: 255
Stdout: u''

172.17.0.64 is the internal service api, we looked into http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-1ctlr_2comp-featureset021-master/9cad320/logs/overcloud-novacompute-0/etc/hosts

172.17.0.64 overcloud-novacompute-1.internalapi.localdomain overcloud-novacompute-1.internalapi

and looking into http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-1ctlr_2comp-featureset021-master/9cad320/logs/overcloud-novacompute-0/etc/ssh/ssh_known_hosts

[192.168.24.19]*,[overcloud-novacompute-0.localdomain]*,[overcloud-novacompute-0]*,

Internal service api is missing from ssh_known_hosts.

Which is causing communication between services bettwen compute nodes.

Revision history for this message
Michal Pryc (mpryc) wrote :

The /etc/ssh/ssh_known_hosts in the TripleO containers are mounted version of /etc/ssh/ssh_known_hosts that are in the overcloud nodes.

Those are generated by this task:
  https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/roles/tripleo_ssh_known_hosts/tasks/main.yml#L46

For some reason role_networks are not defined? so the file is wrongly created.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-ansible (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/704880

Changed in tripleo:
assignee: nobody → Oliver Walsh (owalsh)
status: Confirmed → In Progress
Revision history for this message
Oliver Walsh (owalsh) wrote :

Problem was this use of jinja2:

{% set line = "foo" %}
{% for something in something %}
{% set line = line ~ 'bar' %}
{% endfor %}

The 2nd set does not alter line in the outer scope.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-common (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/704919

tags: added: train-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-quickstart-extras (master)

Fix proposed to branch: master
Review: https://review.opendev.org/704962

Changed in tripleo:
assignee: Oliver Walsh (owalsh) → chandan kumar (chkumar246)
Changed in tripleo:
assignee: chandan kumar (chkumar246) → Oliver Walsh (owalsh)
Changed in tripleo:
assignee: Oliver Walsh (owalsh) → wes hayutin (weshayutin)
Changed in tripleo:
assignee: wes hayutin (weshayutin) → chandan kumar (chkumar246)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-common (master)

Reviewed: https://review.opendev.org/704919
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=ff3e716b1a081e139943f2dedf0aec1df22e4f37
Submitter: Zuul
Branch: master

commit ff3e716b1a081e139943f2dedf0aec1df22e4f37
Author: Oliver Walsh <email address hidden>
Date: Thu Jan 30 03:08:34 2020 +0000

    Add tripleo_role_networks to inventory role group vars

    Both invertory host vars and deployment global vars set enabled_networks.
    Host var is snake_case network, global var is CamelCase network.
    Global var takes precedence as we explicly include it in the deployment
    playbook.

    Create tripleo_role_networks to avoid conflicts and drop the enabled_networks
    host vars at it is not currently used (AFAICT).

    Depends-On: https://review.opendev.org/#/c/705757/
    Depends-On: https://review.opendev.org/#/c/705767/
    Change-Id: I6108aebf49a4bbc98394987f56dce6bcbe521b3a
    Related-bug: #1861296

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-ansible (master)

Reviewed: https://review.opendev.org/704880
Committed: https://git.openstack.org/cgit/openstack/tripleo-ansible/commit/?id=07946456d0380b441f75aeb3f7d5bec229f33bf1
Submitter: Zuul
Branch: master

commit 07946456d0380b441f75aeb3f7d5bec229f33bf1
Author: Oliver Walsh <email address hidden>
Date: Wed Jan 29 20:22:42 2020 +0000

    Simplify ssh_known_hosts role

    This is only required for compute nodes running nova_migration_target so
    we can simplify the logic significantly.

    The host.network entries are now omitted as cold/live migration only uses
    either the fqdn,short hostname, or IP. This should help a little with scaling
    too as ssh_host_keys can get gigantic with a large number of computes.

    We can assume the remaining vars for networks and fqdn/ip are all set as host
    or role group vars in the inventory.
    Just in case fallback to basic entry when the host vars are missing.

    This should also make it easier for operators to run the role in isolation
    e.g to quickly fix up the ssh keys on any compute hosts omittied from a
    scale-up.

    Also fixes bug #1861296 which was caused by attempting to use set to override
    a jinja2 var from an outer scope.

    Change-Id: I5c91122b6cbd731d369b19b13fd011114dd48175
    Depends-On: https://review.opendev.org/#/c/704919/
    Closes-bug: #1861296

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-quickstart-extras (master)

Reviewed: https://review.opendev.org/704962
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart-extras/commit/?id=a49edd82c483fa01777803f5a1f4c1d9a5e7bec3
Submitter: Zuul
Branch: master

commit a49edd82c483fa01777803f5a1f4c1d9a5e7bec3
Author: Chandan Kumar (raukadah) <email address hidden>
Date: Thu Jan 30 11:15:47 2020 +0000

    Revert "update master skip list, nova issues and timeout"

    # cinder-fix on tht
    Depends-On: https://review.opendev.org/#/c/704805/

    # ssh known host fix on tripleo-ansible
    Depends-On: https://review.opendev.org/#/c/704880/

    # Adding tripleo_role_networks on tripleo-common
    Depends-On: https://review.opendev.org/#/c/704919/

    Closes-Bug: #1861393
    Closes-Bug: #1861296

    This reverts commit 4a1526f8d48d5da7f812a70e7d2d407f7650380b.

    Change-Id: I888431e5d60591acb725a3e18ac133fa0cba496d

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-ansible 1.2.0

This issue was fixed in the openstack/tripleo-ansible 1.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-common (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/718555

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-common (stable/train)

Reviewed: https://review.opendev.org/718555
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=1c00370074d1a00c8928125ccb841fab944acd82
Submitter: Zuul
Branch: stable/train

commit 1c00370074d1a00c8928125ccb841fab944acd82
Author: Oliver Walsh <email address hidden>
Date: Thu Jan 30 03:08:34 2020 +0000

    Add tripleo_role_networks to inventory role group vars

    Both invertory host vars and deployment global vars set enabled_networks.
    Host var is snake_case network, global var is CamelCase network.
    Global var takes precedence as we explicly include it in the deployment
    playbook.

    Create tripleo_role_networks to avoid conflicts and drop the enabled_networks
    host vars at it is not currently used (AFAICT).

    Change-Id: I6108aebf49a4bbc98394987f56dce6bcbe521b3a
    Related-bug: #1861296
    (cherry picked from commit ff3e716b1a081e139943f2dedf0aec1df22e4f37)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-ansible (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/736229

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-ansible (stable/train)

Reviewed: https://review.opendev.org/736229
Committed: https://git.openstack.org/cgit/openstack/tripleo-ansible/commit/?id=5dd61d3b0acdb1fc9e4287fd194529724aca0342
Submitter: Zuul
Branch: stable/train

commit 5dd61d3b0acdb1fc9e4287fd194529724aca0342
Author: Oliver Walsh <email address hidden>
Date: Wed Jan 29 20:22:42 2020 +0000

    Simplify ssh_known_hosts role

    This is only required for compute nodes running nova_migration_target so
    we can simplify the logic significantly.

    The host.network entries are now omitted as cold/live migration only uses
    either the fqdn,short hostname, or IP. This should help a little with scaling
    too as ssh_host_keys can get gigantic with a large number of computes.

    We can assume the remaining vars for networks and fqdn/ip are all set as host
    or role group vars in the inventory.
    Just in case fallback to basic entry when the host vars are missing.

    This should also make it easier for operators to run the role in isolation
    e.g to quickly fix up the ssh keys on any compute hosts omittied from a
    scale-up.

    Also fixes bug #1861296 which was caused by attempting to use set to override
    a jinja2 var from an outer scope.

    Change-Id: I5c91122b6cbd731d369b19b13fd011114dd48175
    Depends-On: https://review.opendev.org/#/c/704919/
    Closes-bug: #1861296
    (cherry picked from commit 07946456d0380b441f75aeb3f7d5bec229f33bf1)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-ansible 0.6.0

This issue was fixed in the openstack/tripleo-ansible 0.6.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.