TestBootVolumePattern is failing in multinode jobs

Bug #1742936 reported by Arx Cruz
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Committed
High
Arx Cruz

Bug Description

This seems to be similar, or the same problem as in https://bugs.launchpad.net/tripleo/+bug/1731063
However, the package ovsdbapp used is the same that should have the fix.

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/tempest/common/utils/__init__.py", line 89, in wrapper
    return f(*func_args, **func_kwargs)
  File "/usr/lib/python2.7/site-packages/tempest/scenario/test_volume_boot_pattern.py", line 112, in test_volume_boot_pattern
    private_key=keypair['private_key'])
  File "/usr/lib/python2.7/site-packages/tempest/scenario/manager.py", line 609, in create_timestamp
    private_key=private_key)
  File "/usr/lib/python2.7/site-packages/tempest/scenario/manager.py", line 356, in get_remote_client
    linux_client.validate_authentication()
  File "/usr/lib/python2.7/site-packages/tempest/lib/common/utils/linux/remote_client.py", line 57, in wrapper
    six.reraise(*original_exception)
  File "/usr/lib/python2.7/site-packages/tempest/lib/common/utils/linux/remote_client.py", line 30, in wrapper
    return function(self, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/tempest/lib/common/utils/linux/remote_client.py", line 113, in validate_authentication
    self.ssh_client.test_connection_auth()
  File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 207, in test_connection_auth
    connection = self._get_ssh_connection()
  File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 121, in _get_ssh_connection
    password=self.password)
tempest.lib.exceptions.SSHTimeout: Connection to the 192.168.24.108 via SSH timed out.
User: cirros, Password: None

LOGS:
https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset016-master/b09f8f0/tempest.html.gz

Also notice this is only failing in multinode jobs, the ovb jobs using the same hash are not being affected:

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-master/33d447c/tempest.html.gz

Tags: ci tech
Arx Cruz (arxcruz)
Changed in tripleo:
assignee: nobody → Arx Cruz (arxcruz)
wes hayutin (weshayutin)
tags: added: alert
Revision history for this message
Brian Haley (brian-haley) wrote :

I looked at this with Ihar for a while today and didn't find any smoking gun in the log files.

There are a few things we'd need to do debug this further from the neutron side, assuming it's a networking issue:

1. Stop the system under test when this fails so we can login and look around.

2. Make sure we have the console log of the instance(s). If this isn't doable via some setting we need to look at something like https://stackoverflow.com/questions/12290336/how-to-execute-code-only-on-test-failures-with-python-unittest2 to dump the console on failure

3. Log haproxy and metadata agent serviced requests. The haproxy conf file shows it should log to /dev/log, which is exposed in the container, but I couldn't find any evidence of logging anywhere. The metadata agent does not over-ride debug level like other agents, should set "debug = True" in metadata_agent.ini

In the meantime it might be worth skipping the test until we can gather additional information if it is critical.

Revision history for this message
wes hayutin (weshayutin) wrote :

Re: #1, Upstream policy will not allow CI jobs to be stopped for investigation.
However you can you use the recreate script to produce the exact same environment in RDO-Cloud

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset016-master/b09f8f0/reproducer-quickstart.sh

Note: There are some open bugs on RDO-Cloud atm that may impact your ability to provision an environment successfully [1][2]

[1] https://bugs.launchpad.net/nova/+bug/1742827
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1533196

Revision history for this message
wes hayutin (weshayutin) wrote :
Revision history for this message
Emilien Macchi (emilienm) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/pike)

Reviewed: https://review.openstack.org/533380
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=fa02a8f8633b682ba2042ac7fb26d76453cfe65a
Submitter: Zuul
Branch: stable/pike

commit fa02a8f8633b682ba2042ac7fb26d76453cfe65a
Author: John Fulton <email address hidden>
Date: Fri Jan 5 15:26:22 2018 -0500

    Align stars to fix stable/pike gate on scenario001

    1) Fix path for iscsi config file

    We changed the bind mount to be /etc/iscsi in
    I838427ccae06cfe1be72939c4bcc2978f7dc36a8, we need to copy the files to
    /etc/iscsi so that they do not end up at '/' in the container.

    Change-Id: Id5c1f16d08ffd36a35a6669d64460a7b2240d401
    Closes-Bug: #1741850
    (cherry picked from commit 8eb351d588539c20caf768c2633832a924f40690)

    2) Fix puppet config volume for iscsid in containers

    Bind mount the /etc/iscsi host path for iscsi container puppet config.
    Use the real host path /etc/iscsi for containers dependsing on it.

    Closes-bug: #1735425

    Change-Id: I838427ccae06cfe1be72939c4bcc2978f7dc36a8
    Co-authored-by: Alan Bishop <email address hidden>
    Co-authored-by: Martin André <email address hidden>
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit 82f128f15b1b1eb7bf6ac7df0c6d01e5619309eb)

    3) Allow to override manage polling param

    Without this, we cannot override the polling yaml metrics
    from puppet template.

    Change-Id: I509dd4932402c458e222c52b5d7d5e370a5466c0
    (cherry picked from commit e870783b2c8f3b7b13459693b17425f5bf0fe53d)

    4) Disable voting on scenario001 - now timeouting to ssh the VM created
       by Tempest.

    Related-Bug: 1742936

    5) Update Ceph container CPU/memory limits in Ceph scenarios

    Ceph containers are started with `docker run --memory`
    and `docker run --cpus` to limit their memory and CPU
    resources. The defaults for OSD and MDS containers were
    recently increased [1] to values better for production
    but this change keeps them at lower values just for
    CI.

    [1] https://github.com/ceph/ceph-ansible/pull/2304

    Change-Id: I5b5cf5cc52907af092bea5e162d4b577ee05c23a
    Related-Bug: 1741499
    (cherry picked from commit d68619a26ec7cbd6176f4bb0d352d2aa91439f5c)

tags: added: in-stable-pike
Revision history for this message
Gabriele Cerami (gcerami) wrote :

This bug first appeared on 10th of January, trying to get a diff of changes from the builds on 9th

tags: removed: in-stable-pike
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

I posted https://review.openstack.org/535488 and https://review.openstack.org/535491 to collect more info on failure. We should hook them into tripleo gate and see if they reveal anything interesting. How do we achieve that?

Revision history for this message
John Fulton (jfulton-org) wrote :

apetrich mentioned a ceph issue [1] might be connected to this today in IRC. If that is the issue, the following from [2] might help. If it's not a ceph-mon issue, then disregard.

parameter_defaults:
  CephAnsibleExtraConfig:
    mon_use_fqdn: true

[1] http://paste.openstack.org/show/646902
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1507888#c5

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Nevermind my latest comment here, it was for another similar issue in barbican plugin.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-quickstart (master)

Reviewed: https://review.openstack.org/536395
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart/commit/?id=5c900a50a4e3087e8e95b8fcd3e9f1acaaa02172
Submitter: Zuul
Branch: master

commit 5c900a50a4e3087e8e95b8fcd3e9f1acaaa02172
Author: Arx Cruz <email address hidden>
Date: Mon Jan 22 16:42:47 2018 +0100

    Reducing tempest_workers to 1 for fs016

    Usually, the concurrency is set to number of cpus / 2, however we are
    seeing parallelism issues with this particular featureset when tests are
    running in in parallel. So, let's test running the jobs without
    parallelism for now.

    Change-Id: I78c8b0b47595eea5d8fd417548e1f5bfeecd0889
    Related-Bug: #1742936

Arx Cruz (arxcruz)
Changed in tripleo:
status: Triaged → Fix Released
tags: added: tech
removed: alert promotion-blocker
Changed in tripleo:
status: Fix Released → Triaged
importance: Critical → High
Changed in tripleo:
milestone: queens-3 → queens-rc1
Changed in tripleo:
milestone: queens-rc1 → rocky-1
Changed in tripleo:
milestone: rocky-1 → rocky-2
Changed in tripleo:
milestone: rocky-2 → rocky-3
Changed in tripleo:
milestone: rocky-3 → rocky-rc1
Changed in tripleo:
milestone: rocky-rc1 → stein-1
Changed in tripleo:
milestone: stein-1 → stein-2
Changed in tripleo:
milestone: stein-2 → stein-3
Changed in tripleo:
milestone: stein-3 → stein-rc1
Changed in tripleo:
milestone: stein-rc1 → train-1
Changed in tripleo:
milestone: train-1 → train-2
Changed in tripleo:
milestone: train-2 → train-3
Changed in tripleo:
milestone: train-3 → ussuri-1
Changed in tripleo:
milestone: ussuri-1 → ussuri-2
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-2 → ussuri-3
wes hayutin (weshayutin)
Changed in tripleo:
status: Triaged → Incomplete
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-3 → ussuri-rc3
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-rc3 → victoria-1
Changed in tripleo:
milestone: victoria-1 → victoria-3
Changed in tripleo:
status: Incomplete → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.