Unable to connect to AMQP server on undercloud.internalapi.localdomain:5672 after None tries: (0, 0): (403) ACCESS_REFUSED

Bug #1777939 reported by wes hayutin
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
wes hayutin

Bug Description

2018-06-19 19:50:01 | + openstack overcloud node import instackenv.json
2018-06-19 19:50:07 | MessageDeliveryFailure: Unable to connect to AMQP server on undercloud.internalapi.localdomain:5672 after None tries: (0, 0): (403) ACCESS_REFUSED - Login was refused using authentication mechanism AMQPLAIN. For details see the broker logfile.

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-master-upload/9c3c747/undercloud/home/jenkins/overcloud_prep_images.log.txt.gz

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-master-upload/8cc80f2/undercloud/home/jenkins/overcloud_prep_images.log.txt.gz

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-master-upload/7a7f03a/undercloud/home/jenkins/overcloud_prep_images.log.txt.gz

Odd this is the only job out of a number of jobs that use introspection and only this job is consistently failing

Tags: idempotency
Revision history for this message
wes hayutin (weshayutin) wrote :
Download full text (4.8 KiB)

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-master-upload/9c3c747/undercloud/var/log/containers/nova/nova-compute.log.txt.gz

2018-06-20 15:00:51.277 8 ERROR oslo.messaging._drivers.impl_rabbit [req-6ffa11a4-1b55-47a3-9842-921f41477733 - - - - -] Unable to connect to AMQP server on undercloud.internalapi.localdomain:5672 after None tries: (0, 0): (403) ACCESS_REFUSED - Login was refused using authentication mechanism AMQPLAIN. For details see the broker logfile.: AccessRefused: (0, 0): (403) ACCESS_REFUSED - Login was refused using authentication mechanism AMQPLAIN. For details see the broker logfile.
2018-06-20 15:00:51.277 8 CRITICAL nova [req-6ffa11a4-1b55-47a3-9842-921f41477733 - - - - -] Unhandled error: MessageDeliveryFailure: Unable to connect to AMQP server on undercloud.internalapi.localdomain:5672 after None tries: (0, 0): (403) ACCESS_REFUSED - Login was refused using authentication mechanism AMQPLAIN. For details see the broker logfile.
2018-06-20 15:00:51.277 8 ERROR nova Traceback (most recent call last):
2018-06-20 15:00:51.277 8 ERROR nova File "/usr/bin/nova-compute", line 10, in <module>
2018-06-20 15:00:51.277 8 ERROR nova sys.exit(main())
2018-06-20 15:00:51.277 8 ERROR nova File "/usr/lib/python2.7/site-packages/nova/cmd/compute.py", line 57, in main
2018-06-20 15:00:51.277 8 ERROR nova topic=compute_rpcapi.RPC_TOPIC)
2018-06-20 15:00:51.277 8 ERROR nova File "/usr/lib/python2.7/site-packages/nova/service.py", line 259, in create
2018-06-20 15:00:51.277 8 ERROR nova periodic_interval_max=periodic_interval_max)
2018-06-20 15:00:51.277 8 ERROR nova File "/usr/lib/python2.7/site-packages/nova/service.py", line 138, in __init__
2018-06-20 15:00:51.277 8 ERROR nova conductor_api.wait_until_ready(context.get_admin_context())
2018-06-20 15:00:51.277 8 ERROR nova File "/usr/lib/python2.7/site-packages/nova/conductor/api.py", line 67, in wait_until_ready
2018-06-20 15:00:51.277 8 ERROR nova timeout=timeout)
2018-06-20 15:00:51.277 8 ERROR nova File "/usr/lib/python2.7/site-packages/nova/baserpc.py", line 58, in ping
2018-06-20 15:00:51.277 8 ERROR nova return cctxt.call(context, 'ping', arg=arg_p)
2018-06-20 15:00:51.277 8 ERROR nova File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 179, in call
2018-06-20 15:00:51.277 8 ERROR nova retry=self.retry)
2018-06-20 15:00:51.277 8 ERROR nova File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 133, in _send
2018-06-20 15:00:51.277 8 ERROR nova retry=retry)
2018-06-20 15:00:51.277 8 ERROR nova File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
2018-06-20 15:00:51.277 8 ERROR nova call_monitor_timeout, retry=retry)
2018-06-20 15:00:51.277 8 ERROR nova File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 526, in _send
2018-06-20 15:00:51.277 8 ERROR nova msg.update({'_reply_q': self._get_reply_q()})
2018-06-20 15:00:51.277 8 ERROR nova File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 507, in _get_reply_q
2018-06...

Read more...

Revision history for this message
wes hayutin (weshayutin) wrote :
Changed in tripleo:
assignee: nobody → wes hayutin (weshayutin)
Changed in tripleo:
milestone: none → rocky-3
Revision history for this message
Matt Young (halcyondude) wrote :

https://review.rdoproject.org/r/#/c/14391 also being used to diagnose afaict

Revision history for this message
Matt Young (halcyondude) wrote :
tags: added: alert
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Reproduced it, on my env the list of the affected containers is https://pastebin.com/1XFxKqss

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I can see on my env that the rabbit connection tester [0] fails to connect with the guest:<generated_password> from the config's transport_url (it is set from RabbitPassword/RpcPassword in tripleo-undercloud-passwords.yaml).

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

It seems that even the services initially able to connect the rabbitmq, somehow degraded and lost the connectivity, for example here is a sippet for ironic-conductor:

2018-06-22 14:43:33.867 6 INFO oslo.messaging._drivers.impl_rabbit [-] [d4784c4c-0857-4659-aed9-61a5fdca1cf2] Reconnected to AMQP server on undercloud.internalapi.localdomain:5672 via [amqp] client with port 54148.
...
2018-06-22 14:49:40.124 6 ERROR oslo.messaging._drivers.impl_rabbit [req-2aa9f436-87a2-4e16-aaf3-cd3529292b26 - - - - -] Unable to connect to AMQP server on undercloud.internalapi.localdomain:5672 after None tries: (0, 0): (403) ACCESS_REFUSED - Login was refused using authentication mechanism AMQPLAIN. For details see the broker logfile.: AccessRefused: (0, 0): (403) ACCESS_REFUSED - Login was refused using authentication mechanism AMQPLAIN. For details see the broker logfile.

Revision history for this message
Quique Llorente (quiquell) wrote :

Comparing good and bug rabbitmq logs

good: https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002<email address hidden>

bad: https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-master<email address hidden>

Looks like there is a rabbitmq restart and feels like config is different after.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

So the bad state is related to the idempotency check enabled for fs002

tags: added: idempotency
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Rabbitmq didn't update the changes password. Although we have it in the rabbit and openstack services config and the generated passwords files all-set, the guest user seem not updated as needed. We need to inspect the containerized service template for rabbit in t-h-t. As that service template is also used for overcloud rabbitmq, I think we have the same idempotency issue for overcloud?

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The user passwords configuration step http://git.openstack.org/cgit/openstack/tripleo-heat-templates/tree/docker/services/rabbitmq.yaml#n209 binds mount/var/lib/rabbitmq in RO, I think this prevents the puppet changes to come in.

Changed in tripleo:
assignee: wes hayutin (weshayutin) → Bogdan Dobrelya (bogdando)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.openstack.org/577785

Revision history for this message
Alex Schultz (alex-schultz) wrote :

Password idempotency is covered by https://review.openstack.org/#/c/576990/

Revision history for this message
Quique Llorente (quiquell) wrote :

To maintain password in the second install

https://review.openstack.org/#/c/576990

Changed in tripleo:
assignee: Bogdan Dobrelya (bogdando) → Alex Schultz (alex-schultz)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-quickstart (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/577809

Changed in tripleo:
assignee: Alex Schultz (alex-schultz) → Bogdan Dobrelya (bogdando)
Changed in tripleo:
assignee: Bogdan Dobrelya (bogdando) → wes hayutin (weshayutin)
Revision history for this message
wes hayutin (weshayutin) wrote :

Validated this patch fixes the issue https://review.openstack.org/#/c/576990/ thanks Alex!

Revision history for this message
wes hayutin (weshayutin) wrote :

BAH, spoke too soon.

Now I see the following error w/ patch https://review.openstack.org/#/c/576990/

2018-06-26 00:09:48 | + openstack overcloud node introspect --all-manageable
2018-06-26 00:09:51 | Waiting for messages on queue 'tripleo' with no timeout.
2018-06-26 01:34:11 | Waiting for introspection to finish...
2018-06-26 01:34:11 | Introspection of node 13988d4d-85e9-4ef6-b701-16760e3ebe9f timed out.
2018-06-26 01:34:11 | Introspection of node d99bc23e-fe59-46e2-b7c6-d0f9a1925691 timed out.
2018-06-26 01:34:11 | Retrying 2 nodes that failed introspection. Attempt 1 of 3
2018-06-26 01:34:11 | Introspection of node 13988d4d-85e9-4ef6-b701-16760e3ebe9f timed out.
2018-06-26 01:34:11 | Introspection of node d99bc23e-fe59-46e2-b7c6-d0f9a1925691 timed out.
2018-06-26 01:34:11 | Retrying 2 nodes that failed introspection. Attempt 2 of 3
2018-06-26 01:34:11 | Introspection of node 13988d4d-85e9-4ef6-b701-16760e3ebe9f timed out.
2018-06-26 01:34:11 | Introspection of node d99bc23e-fe59-46e2-b7c6-d0f9a1925691 timed out.
2018-06-26 01:34:11 | Retrying 2 nodes that failed introspection. Attempt 3 of 3
2018-06-26 01:34:11 | Introspection of node 13988d4d-85e9-4ef6-b701-16760e3ebe9f timed out.
2018-06-26 01:34:11 | Introspection of node d99bc23e-fe59-46e2-b7c6-d0f9a1925691 timed out.
2018-06-26 01:34:11 | Retry limit reached with 2 nodes still failing introspection
2018-06-26 01:34:11 | Unhandled workflow error
2018-06-26 01:34:11 | Unhandled workflow error
2018-06-26 01:34:11 | Unhandled workflow error

Then the job times out

Revision history for this message
wes hayutin (weshayutin) wrote :

hrm.. seems unrelated to the patch that we thought fixes the issues..

2018-06-25 23:24:15.986 7 ERROR oslo_service.service DriverLoadError: Driver, hardware type or interface pxe_ipmitool could not be loaded. Reason: [Errno 13] Permission denied: '/var/lib/ironic/httpboot/boot.ipxe'.

Going to retry

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-quickstart (master)

Reviewed: https://review.openstack.org/577809
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart/commit/?id=62947ece560c3d0d6ea2b1702022f5185094a929
Submitter: Zuul
Branch: master

commit 62947ece560c3d0d6ea2b1702022f5185094a929
Author: Wes Hayutin <email address hidden>
Date: Mon Jun 25 08:57:11 2018 -0400

    turn off undercloud idempotency check on fs002

    Turns off undercloud idempotency check on fs002
    Turns on undercloud idempotency check in fs20 or
    periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-master

    Related-Bug: #1777939
    Related-Bug: #1777941
    Change-Id: Ib83aed5fa55a9deb6ac8a90817ad05d2f281dcb6

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

So was it fixed with https://review.openstack.org/577809? Let's close/remove alert please?

wes hayutin (weshayutin)
Changed in tripleo:
status: In Progress → Fix Released
tags: removed: alert promotion-blocker
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/577785
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=f14369c08515de9e969ea3b72be1da29772f6124
Submitter: Zuul
Branch: master

commit f14369c08515de9e969ea3b72be1da29772f6124
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Jun 25 13:52:40 2018 +0300

    Fix non-HA rabbitmq user password updates

    Fix the invocation of puppet's rabbitmq_user by allowing the
    produced changes written into /var/lib/rabbitmq state.

    Move the docker_puppet_tasks for non-HA rabbit into the step_2
    as puppet-tripleo expects it for the rabbitmq_user call.

    Related-bug: #1777939
    Related-bug: #1778505

    Depends-on: I55dc83acb2ed5ee07b4cf57e25135e6201589ac4
    Change-Id: I7c7b2bd6d202268fdf4bc08f951ec4c9c4065c08
    Signed-off-by: Bogdan Dobrelya <email address hidden>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.