tripleo-ci-centos-7-scenario003-standalone fails tempest.scenario.test_minimum_basic.TestMinimumBasicScenario

Bug #1811004 reported by Ronelle Landy on 2019-01-08
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Critical
Marios Andreou

Bug Description

https://review.openstack.org/#/c/629246/ makes these (2/3) scenarios non-voting for now but they are still failing:

http://logs.openstack.org/77/594077/5/check/tripleo-ci-centos-7-scenario003-standalone/ed641af/logs/tempest.html.gz

http://logs.openstack.org/77/594077/5/check/tripleo-ci-centos-7-scenario003-standalone/ed641af/logs/undercloud/home/zuul/tempest.log.txt.gz

with trace:

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/tempest/common/utils/__init__.py", line 89, in wrapper
    return f(*func_args, **func_kwargs)
  File "/usr/lib/python2.7/site-packages/tempest/scenario/test_network_basic_ops.py", line 409, in test_network_basic_ops
    self._setup_network_and_servers()
  File "/usr/lib/python2.7/site-packages/tempest/scenario/test_network_basic_ops.py", line 119, in _setup_network_and_servers
    server = self._create_server(self.network, port_id)
  File "/usr/lib/python2.7/site-packages/tempest/scenario/test_network_basic_ops.py", line 171, in _create_server
    security_groups=security_groups)
  File "/usr/lib/python2.7/site-packages/tempest/scenario/manager.py", line 214, in create_server
    image_id=image_id, **kwargs)
  File "/usr/lib/python2.7/site-packages/tempest/common/compute.py", line 256, in create_test_server
    server['id'])
  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
    self.force_reraise()
  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
    six.reraise(self.type_, self.value, self.tb)
  File "/usr/lib/python2.7/site-packages/tempest/common/compute.py", line 227, in create_test_server
    clients.servers_client, server['id'], wait_until)
  File "/usr/lib/python2.7/site-packages/tempest/common/waiters.py", line 96, in wait_for_server_status
    raise lib_exc.TimeoutException(message)
tempest.lib.exceptions.TimeoutException: Request timed out
Details: (TestNetworkBasicOps:test_network_basic_ops) Server 703d8b3d-dbdb-4e9a-875a-934767522684 failed to reach ACTIVE status and task state "None" within the required time (500 s). Current status: BUILD. Current task state: scheduling.

Ronelle Landy (rlandy) wrote :

marios, rfolco and team, pls comment

wes hayutin (weshayutin) on 2019-01-08
Changed in tripleo:
importance: Undecided → Critical
milestone: none → stein-2
status: New → Triaged
Marios Andreou (marios-b) wrote :

the trace from errors @ http://logs.openstack.org/98/604298/165/check/tripleo-ci-centos-7-scenario003-standalone/4dd67b6/logs/undercloud/var/log/extra/errors.txt.gz is like

  2019-01-09 09:23:33.834 ERROR /var/log/containers/mistral/mistral-db-manage.log: 12 ERROR mistral.actions.openstack.action_generator.base [-] Failed to create action: nova.certs_convert_into_with_meta: AttributeError: 'Client' object has no attribute 'certs'
  2019-01-09 09:23:33.834 ERROR /var/log/containers/mistral/mistral-db-manage.log: 12 ERROR mistral.actions.openstack.action_generator.base Traceback (most recent call last):
  2019-01-09 09:23:33.834 ERROR /var/log/containers/mistral/mistral-db-manage.log: 12 ERROR mistral.actions.openstack.action_generator.base File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/action_generator/base.py", line 143, in create_actions
  2019-01-09 09:23:33.834 ERROR /var/log/containers/mistral/mistral-db-manage.log: 12 ERROR mistral.actions.openstack.action_generator.base client_method = class_.get_fake_client_method()
  2019-01-09 09:23:33.834 ERROR /var/log/containers/mistral/mistral-db-manage.log: 12 ERROR mistral.actions.openstack.action_generator.base File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/base.py", line 75, in get_fake_client_method
  2019-01-09 09:23:33.834 ERROR /var/log/containers/mistral/mistral-db-manage.log: 12 ERROR mistral.actions.openstack.action_generator.base return cls._get_client_method(cls._get_fake_client())
  2019-01-09 09:23:33.834 ERROR /var/log/containers/mistral/mistral-db-manage.log: 12 ERROR mistral.actions.openstack.action_generator.base File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/base.py", line 59, in _get_client_method
  2019-01-09 09:23:33.834 ERROR /var/log/containers/mistral/mistral-db-manage.log: 12 ERROR mistral.actions.openstack.action_generator.base attribute = getattr(attribute, attr)
  2019-01-09 09:23:33.834 ERROR /var/log/containers/mistral/mistral-db-manage.log: 12 ERROR mistral.actions.openstack.action_generator.base AttributeError: 'Client' object has no attribute 'certs'

Maybe someone from nova or mistral might be able to help at this point (didn't check nova logs yet)

description: updated
Marios Andreou (marios-b) wrote :

spent some more time poking at logs - the main error I see and in almost all container services is socket error like " Socket error exception [Errno 11] Resource temporarily unavailable read_socket_input /usr/lib/python2.7/site-packages/pyngus/sockets.py:53 " e.g. in [1][2][3]
I initially went looking at nova because from the failed tempest [4] we have "Server 5e36afa0-127c-4675-8fd3-01e808327656 failed to reach ACTIVE status and task state "None" within the required time" but the socket error is seen in all containers.

The same tempest test is passing for scenario 4 at [5] so this isn't a general standalone issue it seems - and furthermore we didn't change something in scenario 3 standalone when this issue began to appear (~2 days ago).

[1] http://logs.openstack.org/98/604298/165/check/tripleo-ci-centos-7-scenario003-standalone/4dd67b6/logs/undercloud/var/log/containers/nova/nova-scheduler.log.txt.gz
[2] http://logs.openstack.org/98/604298/165/check/tripleo-ci-centos-7-scenario003-standalone/4dd67b6/logs/undercloud/var/log/containers/mistral/executor.log.txt.gz
[3] http://logs.openstack.org/98/604298/165/check/tripleo-ci-centos-7-scenario003-standalone/4dd67b6/logs/undercloud/var/log/containers/neutron/server.log.txt.gz
[4] http://logs.openstack.org/98/604298/165/check/tripleo-ci-centos-7-scenario003-standalone/4dd67b6/logs/undercloud/home/zuul/tempest.log.txt.gz#_2019-01-09_09_39_50
[5] http://logs.openstack.org/98/604298/167/check/tripleo-ci-centos-7-scenario004-standalone/0af5ca9/logs/undercloud/home/zuul/tempest.log.txt.gz#_2019-01-10_01_26_33

Marios Andreou (marios-b) wrote :
Marios Andreou (marios-b) wrote :

I went looking at mysql logs too [1] but there is nothing of interest there. There are a load of "Warning] Aborted connection 363 to db: 'nova_api' user: 'nova_api' host: '192.168.24.1' (Got an error reading communication packets)" but apparently that is 'normal' because we have the same in a good job.

[1] http://logs.openstack.org/98/604298/165/check/tripleo-ci-centos-7-scenario003-standalone/4dd67b6/logs/undercloud/var/log/containers/mysql/mysqld.log.txt.gz
[2] http://logs.openstack.org/98/604298/167/check/tripleo-ci-centos-7-scenario002-standalone/251c26c/logs/undercloud/var/log/containers/mysql/mysqld.log.txt.gz

Marios Andreou (marios-b) wrote :

finally some more errors from undercloud journal - just looking for anything of interest - fuller trace attached and all from http://logs.rdoproject.org/56/18156/2/check/periodic-tripleo-ci-centos-7-scenario003-standalone/e1659d8/logs/undercloud/var/log/journal.txt.gz

---
"UnicodeDecodeError: 'ascii' codec can't decode byte 0xa9 in position 33: ordinal not in range(128)":

---
"libcontainerd: containerd health check returned error: rpc error: code = 14 desc = grpc: the connection is unavailable"
   Jan 09 21:02:05 upstream-centos-7-rdo-cloud-0000387684 dockerd-current[15126]: time="2019-01-09T21:02:05.338878911Z" level=debug msg="libcontainerd: containerd health check returned error: rpc error: code = 14 desc = grpc: the connection is unavailable"

----
"error msg="Attempting next endpoint for push after error: Get https://192.168.24.1:8787/v2/: http: server gave HTTP response to HTTPS client"

---
"ovsdb_idl|WARN|transaction error: {"details":"Transaction causes multiple rows in \"Manager\" table to have identical values (\"ptcp:6640:127.0.0.1\") for index on column \"target\". First row, with"

---
"error: No such container: designate_central" && others sahara_engine etc etc

Changed in tripleo:
milestone: stein-2 → stein-3
Marios Andreou (marios-b) wrote :

had one last go today before I ping other folks to have a look. Most significant thing I found today was a rabbitmq error like

 =ERROR REPORT==== 14-Jan-2019::09:08:48 ===
 Error on AMQP connection <0.1233.0> (192.168.24.1:52856 -> 192.168.24.1:5672, state: starting):
 AMQPLAIN login refused: user 'guest' - invalid credentials

buried in the rabbitmq-bundle-docker-0 logs [1]. I looked there after I noticed that rabbit container was using high 82% cpu [2]

Also attaching some of the false positive errors - that is errors that are in scenario3 logs but which are also in scenario4, since scen 4 is running the same tempest test and passing.

[1] http://logs.openstack.org/98/604298/176/check/tripleo-ci-centos-7-scenario003-standalone/b2d7fd7/logs/undercloud/var/log/extra<email address hidden>
[2] http://logs.openstack.org/98/604298/176/check/tripleo-ci-centos-7-scenario003-standalone/b2d7fd7/logs/undercloud/var/log/extra/docker/docker_allinfo.log.txt.gz (rabbit is 82dec4629773)

wes hayutin (weshayutin) on 2019-01-14
tags: added: promotion-blocker
Michele Baldessari (michele) wrote :

So the error I am looking at from comment 9 is the following:
 =ERROR REPORT==== 14-Jan-2019::09:08:48 ===
 Error on AMQP connection <0.1233.0> (192.168.24.1:52856 -> 192.168.24.1:5672, state: starting):
 AMQPLAIN login refused: user 'guest' - invalid credentials

From http://logs.openstack.org/98/604298/176/check/tripleo-ci-centos-7-scenario003-standalone/b2d7fd7/logs/undercloud/var/log/config-data/nova/etc/nova/nova.conf.txt.gz we see:
transport_url=rabbit://guest:<email address hidden>:5672/?ssl=0

The password for the guest user in used by nova is 'kSBh78C58WIEt0esxvEfkcRpM'. On the rabbit side of things we have in http://logs.openstack.org/98/604298/176/check/tripleo-ci-centos-7-scenario003-standalone/b2d7fd7/logs/undercloud/var/log/config-data/rabbitmq/etc/rabbitmq/rabbitmq.config.gz the following:
    {default_user, <<"guest">>},
    {default_pass, <<"kSBh78C58WIEt0esxvEfkcRpM">>}

So the password on both sides *seems* correct.

Ah so I think the issue is this in nova.conf:
[oslo_messaging_notifications]
driver=noop
transport_url=rabbit://guest:<email address hidden>:5672/?ssl=0

[DEFAULT]
...
transport_url=amqp://guest:<email address hidden>:31459/?ssl=0

I don't think we should be using 'amqp' here (it's late, and this needs double-checking). We should change the line at https://github.com/openstack/tripleo-heat-templates/blob/master/ci/environments/scenario003-standalone.yaml#L15 to the rabbit counter part

Fix proposed to branch: master
Review: https://review.openstack.org/630751

Changed in tripleo:
assignee: nobody → Michele Baldessari (michele)
status: Triaged → In Progress
wes hayutin (weshayutin) wrote :

Thank you Michele!!!

wes hayutin (weshayutin) wrote :

FYI.. seeing a similar issue in the multinode version of scenario03
https://logs.rdoproject.org/openstack-periodic/git.openstack.org/openstack-infra/tripleo-ci/master/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset018-master/d9a876d/logs/undercloud/home/zuul/tempest.log.txt.gz#_2019-01-14_12_41_10

however the nova and rabbit transport_url have the same passwd afaict.

https://logs.rdoproject.org/openstack-periodic/git.openstack.org/openstack-infra/tripleo-ci/master/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset018-master/d9a876d/logs/undercloud/var/log/config-data/nova/etc/nova/nova.conf.txt.gz

% Template Path: rabbitmq/templates/rabbitmq.config
[
  {rabbit, [
    {loopback_users, [<<"guest">>]},
    {tcp_listen_options, [
         {keepalive, true},
         {backlog, 128},
         {nodelay, true},
         {linger, {true, 0}},
         {exit_on_close, false}
    ]},
    {tcp_listeners, [{"192.168.24.1", 5672}]},
    {cluster_partition_handling, ignore},
    {loopback_users, []},
    {queue_master_locator, <<"min-masters">>},
    {default_user, <<"guest">>},
    {default_pass, <<"YVaA9Ic6QsNqBfcoIBbO0Xnea">>}

# For full details on the fields in the URL see the documentation of
# oslo_messaging.TransportURL at
# https://docs.openstack.org/oslo.messaging/latest/reference/transport.html
# (string value)
#transport_url=rabbit://
transport_url=rabbit://guest:<email address hidden>:5672/?ssl=0

The template for the multinode job could also be changed in the same way though.. thoughts?

https://github.com/openstack/tripleo-heat-templates/blob/master/ci/environments/scenario003-multinode-containers.yaml#L10

Michele Baldessari (michele) wrote :

So after the third iteration of https://review.openstack.org/630751 I do not see any rabbit errors any longer. Yet tempest is still having issues on neutron it seems http://logs.openstack.org/51/630751/3/check/tripleo-ci-centos-7-scenario003-standalone/56cf09b/logs/undercloud/home/zuul/tempest.log.txt.gz#_2019-01-15_11_53_31

Marios Andreou (marios-b) wrote :

thanks very much I missed the dhcp-agent error thus far and indeed I see it in the logs from yesterday too http://logs.openstack.org/98/604298/176/check/tripleo-ci-centos-7-scenario003-standalone/b2d7fd7/logs/undercloud/var/log/containers/neutron/dhcp-agent.log.txt.gz#_2019-01-14_09_08_26_528

digging a bit more i am wondering if this is related https://github.com/openstack/neutron/commit/051b6b40f3921b9db4f152a54f402c402cbf138c so i just posted https://review.openstack.org/631023 ontop of bandini patch which removes the neutron plugins (including port security) ... plus that's the only other neutron config we are doing in the environment file. let's see ;)

yatin (yatinkarel) wrote :

i think qdrouterd is only tested in scenario003 standalone and multinode job, don't know about the history but there must be reason to test it. Also as it started happening recently not sure switching to rabbit is good, as i think that's avoiding the issue.

It looks like issue is caused by hardcoding of 'rabbit' in https://review.openstack.org/#/c/626177/3/docker/services/nova-api.yaml. I am trying to check this locally.

Marios Andreou (marios-b) wrote :

i abandoned https://review.openstack.org/631023 (it was late in the day and i wanted to post something) but adding a note since the premise behind my comment in #16 was wrong. It could not have been an 'external' non tripleo thing, like neutron, because we are looking for something from the 8th. We haven't had a promotion since 1st. What ykarel is pointing at in comment #18 is much more promising hoping its that! thanks!

Fix proposed to branch: master
Review: https://review.openstack.org/631227

Changed in tripleo:
assignee: Michele Baldessari (michele) → Marios Andreou (marios-b)
Changed in tripleo:
assignee: Marios Andreou (marios-b) → yatin (yatinkarel)
Changed in tripleo:
assignee: yatin (yatinkarel) → Marios Andreou (marios-b)

Reviewed: https://review.openstack.org/631227
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=632a184a94fd560e36cf4da4363c1885fe433dc4
Submitter: Zuul
Branch: master

commit 632a184a94fd560e36cf4da4363c1885fe433dc4
Author: Marios Andreou <email address hidden>
Date: Wed Jan 16 16:12:13 2019 +0200

    Fetch scheme/port from hiera instead of hard coding it

    Looks like nit/forgotten in https://review.openstack.org/626177
    Depends-On is so we can see the job run here missing nova in layout

    Co-Authored-By: Yatin Karel <email address hidden>
    Depends-On: https://review.openstack.org/631228
    Change-Id: I4dbebcd3f3f530f21d3afc822084278136e58b4c
    Closes-Bug: #1811004

Changed in tripleo:
status: In Progress → Fix Released

Change abandoned by Michele Baldessari (<email address hidden>) on branch: master
Review: https://review.openstack.org/630751

This issue was fixed in the openstack/tripleo-heat-templates 10.4.0 release.

Reviewed: https://review.openstack.org/644827
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=f441b25be2adb56906579180ee9e72a51cc9e191
Submitter: Zuul
Branch: master

commit f441b25be2adb56906579180ee9e72a51cc9e191
Author: Martin Schuppert <email address hidden>
Date: Wed Mar 20 11:55:52 2019 +0100

    Change scheme/port to template instead of getting from hiera

    In https://review.openstack.org/631227 we had to fetch scheme/port
    from hiera. Since https://bugs.launchpad.net/nova/+bug/1812196 in
    nova is now fixed we could revert back to template those.

    Change-Id: Ifebcd154b46dd78139c05d793d5593d87300c11b
    Related-Bug: #1811004

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers