tripleo

TestBootVolumePattern is failing in multinode jobs

Bug #1742936 reported by Arx Cruz on 2018-01-12

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Committed	High	Arx Cruz	tripleo victoria-3 "tripleo victoria"

Bug Description

This seems to be similar, or the same problem as in https://bugs.launchpad.net/tripleo/+bug/1731063
However, the package ovsdbapp used is the same that should have the fix.

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/tempest/common/utils/__init__.py", line 89, in wrapper
    return f(*func_args, **func_kwargs)
  File "/usr/lib/python2.7/site-packages/tempest/scenario/test_volume_boot_pattern.py", line 112, in test_volume_boot_pattern
    private_key=keypair['private_key'])
  File "/usr/lib/python2.7/site-packages/tempest/scenario/manager.py", line 609, in create_timestamp
    private_key=private_key)
  File "/usr/lib/python2.7/site-packages/tempest/scenario/manager.py", line 356, in get_remote_client
    linux_client.validate_authentication()
  File "/usr/lib/python2.7/site-packages/tempest/lib/common/utils/linux/remote_client.py", line 57, in wrapper
    six.reraise(*original_exception)
  File "/usr/lib/python2.7/site-packages/tempest/lib/common/utils/linux/remote_client.py", line 30, in wrapper
    return function(self, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/tempest/lib/common/utils/linux/remote_client.py", line 113, in validate_authentication
    self.ssh_client.test_connection_auth()
  File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 207, in test_connection_auth
    connection = self._get_ssh_connection()
  File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 121, in _get_ssh_connection
    password=self.password)
tempest.lib.exceptions.SSHTimeout: Connection to the 192.168.24.108 via SSH timed out.
User: cirros, Password: None

LOGS:
https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset016-master/b09f8f0/tempest.html.gz

Also notice this is only failing in multinode jobs, the ovb jobs using the same hash are not being affected:

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-master/33d447c/tempest.html.gz

Tags:

Arx Cruz (arxcruz) on 2018-01-12

Changed in tripleo:
assignee:	nobody → Arx Cruz (arxcruz)

wes hayutin (weshayutin) on 2018-01-12

tags:

added: alert

Revision history for this message

Brian Haley (brian-haley) wrote on 2018-01-12:

I looked at this with Ihar for a while today and didn't find any smoking gun in the log files.

There are a few things we'd need to do debug this further from the neutron side, assuming it's a networking issue:

1. Stop the system under test when this fails so we can login and look around.

2. Make sure we have the console log of the instance(s). If this isn't doable via some setting we need to look at something like https://stackoverflow.com/questions/12290336/how-to-execute-code-only-on-test-failures-with-python-unittest2 to dump the console on failure

3. Log haproxy and metadata agent serviced requests. The haproxy conf file shows it should log to /dev/log, which is exposed in the container, but I couldn't find any evidence of logging anywhere. The metadata agent does not over-ride debug level like other agents, should set "debug = True" in metadata_agent.ini

In the meantime it might be worth skipping the test until we can gather additional information if it is critical.

Revision history for this message

wes hayutin (weshayutin) wrote on 2018-01-12:

Re: #1, Upstream policy will not allow CI jobs to be stopped for investigation.
However you can you use the recreate script to produce the exact same environment in RDO-Cloud

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset016-master/b09f8f0/reproducer-quickstart.sh

Note: There are some open bugs on RDO-Cloud atm that may impact your ability to provision an environment successfully [1][2]

[1] https://bugs.launchpad.net/nova/+bug/1742827
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1533196

Revision history for this message

wes hayutin (weshayutin) wrote on 2018-01-12:

Tried https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset016-master/b09f8f0/reproducer-quickstart.sh

No issues w/ RDO-Cloud.
Environment is coming up

Revision history for this message

Emilien Macchi (emilienm) wrote on 2018-01-16:

Note that we hit it in pike promoted repo as well: http://logs.openstack.org/80/533380/10/check/tripleo-ci-centos-7-scenario001-multinode-oooq-container/4c7b30c/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-01-16: Related fix merged to tripleo-heat-templates (stable/pike)

Reviewed: https://review.openstack.org/533380
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=fa02a8f8633b682ba2042ac7fb26d76453cfe65a
Submitter: Zuul
Branch: stable/pike

commit fa02a8f8633b682ba2042ac7fb26d76453cfe65a
Author: John Fulton <email address hidden>
Date: Fri Jan 5 15:26:22 2018 -0500

Align stars to fix stable/pike gate on scenario001

1) Fix path for iscsi config file

    We changed the bind mount to be /etc/iscsi in
    I838427ccae06cfe1be72939c4bcc2978f7dc36a8, we need to copy the files to
    /etc/iscsi so that they do not end up at '/' in the container.

    Change-Id: Id5c1f16d08ffd36a35a6669d64460a7b2240d401
    Closes-Bug: #1741850
    (cherry picked from commit 8eb351d588539c20caf768c2633832a924f40690)

2) Fix puppet config volume for iscsid in containers

Bind mount the /etc/iscsi host path for iscsi container puppet config.
Use the real host path /etc/iscsi for containers dependsing on it.

Closes-bug: #1735425

    Change-Id: I838427ccae06cfe1be72939c4bcc2978f7dc36a8
    Co-authored-by: Alan Bishop <email address hidden>
    Co-authored-by: Martin André <email address hidden>
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit 82f128f15b1b1eb7bf6ac7df0c6d01e5619309eb)

3) Allow to override manage polling param

Without this, we cannot override the polling yaml metrics
from puppet template.

Change-Id: I509dd4932402c458e222c52b5d7d5e370a5466c0
(cherry picked from commit e870783b2c8f3b7b13459693b17425f5bf0fe53d)

4) Disable voting on scenario001 - now timeouting to ssh the VM created
by Tempest.

Related-Bug: 1742936

5) Update Ceph container CPU/memory limits in Ceph scenarios

    Ceph containers are started with `docker run --memory`
    and `docker run --cpus` to limit their memory and CPU
    resources. The defaults for OSD and MDS containers were
    recently increased [1] to values better for production
    but this change keeps them at lower values just for
    CI.

[1] https://github.com/ceph/ceph-ansible/pull/2304

    Change-Id: I5b5cf5cc52907af092bea5e162d4b577ee05c23a
    Related-Bug: 1741499
    (cherry picked from commit d68619a26ec7cbd6176f4bb0d352d2aa91439f5c)

Reviewed:  https://review.openstack.org/533380
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=fa02a8f8633b682ba2042ac7fb26d76453cfe65a
Submitter: Zuul
Branch:    stable/pike

commit fa02a8f8633b682ba2042ac7fb26d76453cfe65a
Author: John Fulton <fulton@redhat.com>
Date:   Fri Jan 5 15:26:22 2018 -0500

Align stars to fix stable/pike gate on scenario001
    
    1) Fix path for iscsi config file
    
    We changed the bind mount to be /etc/iscsi in
    I838427ccae06cfe1be72939c4bcc2978f7dc36a8, we need to copy the files to
    /etc/iscsi so that they do not end up at '/' in the container.
    
    Change-Id: Id5c1f16d08ffd36a35a6669d64460a7b2240d401
    Closes-Bug: #1741850
    (cherry picked from commit 8eb351d588539c20caf768c2633832a924f40690)
    
    2) Fix puppet config volume for iscsid in containers
    
    Bind mount the /etc/iscsi host path for iscsi container puppet config.
    Use the real host path /etc/iscsi for containers dependsing on it.
    
    Closes-bug: #1735425
    
    Change-Id: I838427ccae06cfe1be72939c4bcc2978f7dc36a8
    Co-authored-by: Alan Bishop <abishop@redhat.com>
    Co-authored-by: Martin André <m.andre@redhat.com>
    Signed-off-by: Bogdan Dobrelya <bdobreli@redhat.com>
    (cherry picked from commit 82f128f15b1b1eb7bf6ac7df0c6d01e5619309eb)
    
    3) Allow to override manage polling param
    
    Without this, we cannot override the polling yaml metrics
    from puppet template.
    
    Change-Id: I509dd4932402c458e222c52b5d7d5e370a5466c0
    (cherry picked from commit e870783b2c8f3b7b13459693b17425f5bf0fe53d)
    
    4) Disable voting on scenario001 - now timeouting to ssh the VM created
       by Tempest.
    
    Related-Bug: 1742936
    
    5) Update Ceph container CPU/memory limits in Ceph scenarios
    
    Ceph containers are started with `docker run --memory`
    and `docker run --cpus` to limit their memory and CPU
    resources. The defaults for OSD and MDS containers were
    recently increased [1] to values better for production
    but this change keeps them at lower values just for
    CI.
    
    [1] https://github.com/ceph/ceph-ansible/pull/2304
    
    Change-Id: I5b5cf5cc52907af092bea5e162d4b577ee05c23a
    Related-Bug: 1741499
    (cherry picked from commit d68619a26ec7cbd6176f4bb0d352d2aa91439f5c)

tags:

added: in-stable-pike

Revision history for this message

Gabriele Cerami (gcerami) wrote on 2018-01-17:

This bug first appeared on 10th of January, trying to get a diff of changes from the builds on 9th

Emilien Macchi (emilienm) on 2018-01-18

tags:

removed: in-stable-pike

Revision history for this message

Ihar Hrachyshka (ihar-hrachyshka) wrote on 2018-01-18:

I posted https://review.openstack.org/535488 and https://review.openstack.org/535491 to collect more info on failure. We should hook them into tripleo gate and see if they reveal anything interesting. How do we achieve that?

Revision history for this message

John Fulton (jfulton-org) wrote on 2018-01-18:

apetrich mentioned a ceph issue [1] might be connected to this today in IRC. If that is the issue, the following from [2] might help. If it's not a ceph-mon issue, then disregard.

parameter_defaults:
CephAnsibleExtraConfig:
mon_use_fqdn: true

[1] http://paste.openstack.org/show/646902
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1507888#c5

Revision history for this message

Ihar Hrachyshka (ihar-hrachyshka) wrote on 2018-01-19:

Nevermind my latest comment here, it was for another similar issue in barbican plugin.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-01-22: Related fix merged to tripleo-quickstart (master)

#10

Reviewed: https://review.openstack.org/536395
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart/commit/?id=5c900a50a4e3087e8e95b8fcd3e9f1acaaa02172
Submitter: Zuul
Branch: master

commit 5c900a50a4e3087e8e95b8fcd3e9f1acaaa02172
Author: Arx Cruz <email address hidden>
Date: Mon Jan 22 16:42:47 2018 +0100

Reducing tempest_workers to 1 for fs016

    Usually, the concurrency is set to number of cpus / 2, however we are
    seeing parallelism issues with this particular featureset when tests are
    running in in parallel. So, let's test running the jobs without
    parallelism for now.

Change-Id: I78c8b0b47595eea5d8fd417548e1f5bfeecd0889
Related-Bug: #1742936

Arx Cruz (arxcruz) on 2018-01-23

Changed in tripleo:
status:	Triaged → Fix Released

Emilien Macchi (emilienm) on 2018-01-23

tags:	added: tech removed: alert promotion-blocker
Changed in tripleo:
status:	Fix Released → Triaged
importance:	Critical → High

Emilien Macchi (emilienm) on 2018-01-26

Changed in tripleo:
milestone:	queens-3 → queens-rc1

Alex Schultz (alex-schultz) on 2018-03-02

Changed in tripleo:
milestone:	queens-rc1 → rocky-1

Alex Schultz (alex-schultz) on 2018-04-20

Changed in tripleo:
milestone:	rocky-1 → rocky-2

Emilien Macchi (emilienm) on 2018-06-05

Changed in tripleo:
milestone:	rocky-2 → rocky-3

Emilien Macchi (emilienm) on 2018-07-26

Changed in tripleo:
milestone:	rocky-3 → rocky-rc1

Alex Schultz (alex-schultz) on 2018-08-14

Changed in tripleo:
milestone:	rocky-rc1 → stein-1

Juan Antonio Osorio Robles (juan-osorio-robles) on 2018-10-30

Changed in tripleo:
milestone:	stein-1 → stein-2

Emilien Macchi (emilienm) on 2019-01-13

Changed in tripleo:
milestone:	stein-2 → stein-3

Alex Schultz (alex-schultz) on 2019-03-14

Changed in tripleo:
milestone:	stein-3 → stein-rc1

Alex Schultz (alex-schultz) on 2019-04-15

Changed in tripleo:
milestone:	stein-rc1 → train-1

Alex Schultz (alex-schultz) on 2019-06-07

Changed in tripleo:
milestone:	train-1 → train-2

Alex Schultz (alex-schultz) on 2019-07-29

Changed in tripleo:
milestone:	train-2 → train-3

Alex Schultz (alex-schultz) on 2019-09-11

Changed in tripleo:
milestone:	train-3 → ussuri-1

Emilien Macchi (emilienm) on 2019-12-19

Changed in tripleo:
milestone:	ussuri-1 → ussuri-2

wes hayutin (weshayutin) on 2020-02-10

Changed in tripleo:
milestone:	ussuri-2 → ussuri-3

wes hayutin (weshayutin) on 2020-04-07

Changed in tripleo:
status:	Triaged → Incomplete

wes hayutin (weshayutin) on 2020-04-13

Changed in tripleo:
milestone:	ussuri-3 → ussuri-rc3

wes hayutin (weshayutin) on 2020-05-26

Changed in tripleo:
milestone:	ussuri-rc3 → victoria-1

Emilien Macchi (emilienm) on 2020-07-28

Changed in tripleo:
milestone:	victoria-1 → victoria-3

Kevin Carter (kevin-carter) on 2020-09-10

Changed in tripleo:
status:	Incomplete → Fix Committed

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1743753

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

redhat-bugs #1507888
[CLOSED ERRATA] Edit
redhat-bugs #1533196
[CLOSED UPSTREAM] Edit

Bug watches keep track of this bug in other bug trackers.