Unable to connect to AMQP server on undercloud.internalapi.localdomain:5672 after None tries: (0, 0): (403) ACCESS_REFUSED

Bug #1777939 reported by wes hayutin on 2018-06-20
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Critical
wes hayutin

Bug Description

2018-06-19 19:50:01 | + openstack overcloud node import instackenv.json
2018-06-19 19:50:07 | MessageDeliveryFailure: Unable to connect to AMQP server on undercloud.internalapi.localdomain:5672 after None tries: (0, 0): (403) ACCESS_REFUSED - Login was refused using authentication mechanism AMQPLAIN. For details see the broker logfile.

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-master-upload/9c3c747/undercloud/home/jenkins/overcloud_prep_images.log.txt.gz

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-master-upload/8cc80f2/undercloud/home/jenkins/overcloud_prep_images.log.txt.gz

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-master-upload/7a7f03a/undercloud/home/jenkins/overcloud_prep_images.log.txt.gz

Odd this is the only job out of a number of jobs that use introspection and only this job is consistently failing

wes hayutin (weshayutin) wrote :
Download full text (4.8 KiB)

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-master-upload/9c3c747/undercloud/var/log/containers/nova/nova-compute.log.txt.gz

2018-06-20 15:00:51.277 8 ERROR oslo.messaging._drivers.impl_rabbit [req-6ffa11a4-1b55-47a3-9842-921f41477733 - - - - -] Unable to connect to AMQP server on undercloud.internalapi.localdomain:5672 after None tries: (0, 0): (403) ACCESS_REFUSED - Login was refused using authentication mechanism AMQPLAIN. For details see the broker logfile.: AccessRefused: (0, 0): (403) ACCESS_REFUSED - Login was refused using authentication mechanism AMQPLAIN. For details see the broker logfile.
2018-06-20 15:00:51.277 8 CRITICAL nova [req-6ffa11a4-1b55-47a3-9842-921f41477733 - - - - -] Unhandled error: MessageDeliveryFailure: Unable to connect to AMQP server on undercloud.internalapi.localdomain:5672 after None tries: (0, 0): (403) ACCESS_REFUSED - Login was refused using authentication mechanism AMQPLAIN. For details see the broker logfile.
2018-06-20 15:00:51.277 8 ERROR nova Traceback (most recent call last):
2018-06-20 15:00:51.277 8 ERROR nova File "/usr/bin/nova-compute", line 10, in <module>
2018-06-20 15:00:51.277 8 ERROR nova sys.exit(main())
2018-06-20 15:00:51.277 8 ERROR nova File "/usr/lib/python2.7/site-packages/nova/cmd/compute.py", line 57, in main
2018-06-20 15:00:51.277 8 ERROR nova topic=compute_rpcapi.RPC_TOPIC)
2018-06-20 15:00:51.277 8 ERROR nova File "/usr/lib/python2.7/site-packages/nova/service.py", line 259, in create
2018-06-20 15:00:51.277 8 ERROR nova periodic_interval_max=periodic_interval_max)
2018-06-20 15:00:51.277 8 ERROR nova File "/usr/lib/python2.7/site-packages/nova/service.py", line 138, in __init__
2018-06-20 15:00:51.277 8 ERROR nova conductor_api.wait_until_ready(context.get_admin_context())
2018-06-20 15:00:51.277 8 ERROR nova File "/usr/lib/python2.7/site-packages/nova/conductor/api.py", line 67, in wait_until_ready
2018-06-20 15:00:51.277 8 ERROR nova timeout=timeout)
2018-06-20 15:00:51.277 8 ERROR nova File "/usr/lib/python2.7/site-packages/nova/baserpc.py", line 58, in ping
2018-06-20 15:00:51.277 8 ERROR nova return cctxt.call(context, 'ping', arg=arg_p)
2018-06-20 15:00:51.277 8 ERROR nova File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 179, in call
2018-06-20 15:00:51.277 8 ERROR nova retry=self.retry)
2018-06-20 15:00:51.277 8 ERROR nova File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 133, in _send
2018-06-20 15:00:51.277 8 ERROR nova retry=retry)
2018-06-20 15:00:51.277 8 ERROR nova File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
2018-06-20 15:00:51.277 8 ERROR nova call_monitor_timeout, retry=retry)
2018-06-20 15:00:51.277 8 ERROR nova File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 526, in _send
2018-06-20 15:00:51.277 8 ERROR nova msg.update({'_reply_q': self._get_reply_q()})
2018-06-20 15:00:51.277 8 ERROR nova File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 507, in _get_reply_q
2018-06...

Read more...

wes hayutin (weshayutin) wrote :
Changed in tripleo:
assignee: nobody → wes hayutin (weshayutin)
Changed in tripleo:
milestone: none → rocky-3
Matt Young (halcyondude) wrote :

https://review.rdoproject.org/r/#/c/14391 also being used to diagnose afaict

tags: added: alert
Bogdan Dobrelya (bogdando) wrote :

Reproduced it, on my env the list of the affected containers is https://pastebin.com/1XFxKqss

Bogdan Dobrelya (bogdando) wrote :

I can see on my env that the rabbit connection tester [0] fails to connect with the guest:<generated_password> from the config's transport_url (it is set from RabbitPassword/RpcPassword in tripleo-undercloud-passwords.yaml).

Bogdan Dobrelya (bogdando) wrote :

It seems that even the services initially able to connect the rabbitmq, somehow degraded and lost the connectivity, for example here is a sippet for ironic-conductor:

2018-06-22 14:43:33.867 6 INFO oslo.messaging._drivers.impl_rabbit [-] [d4784c4c-0857-4659-aed9-61a5fdca1cf2] Reconnected to AMQP server on undercloud.internalapi.localdomain:5672 via [amqp] client with port 54148.
...
2018-06-22 14:49:40.124 6 ERROR oslo.messaging._drivers.impl_rabbit [req-2aa9f436-87a2-4e16-aaf3-cd3529292b26 - - - - -] Unable to connect to AMQP server on undercloud.internalapi.localdomain:5672 after None tries: (0, 0): (403) ACCESS_REFUSED - Login was refused using authentication mechanism AMQPLAIN. For details see the broker logfile.: AccessRefused: (0, 0): (403) ACCESS_REFUSED - Login was refused using authentication mechanism AMQPLAIN. For details see the broker logfile.

Quique Llorente (quiquell) wrote :

Comparing good and bug rabbitmq logs

good: https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002<email address hidden>

bad: https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-master<email address hidden>

Looks like there is a rabbitmq restart and feels like config is different after.

Bogdan Dobrelya (bogdando) wrote :

So the bad state is related to the idempotency check enabled for fs002

tags: added: idempotency
Bogdan Dobrelya (bogdando) wrote :

Rabbitmq didn't update the changes password. Although we have it in the rabbit and openstack services config and the generated passwords files all-set, the guest user seem not updated as needed. We need to inspect the containerized service template for rabbit in t-h-t. As that service template is also used for overcloud rabbitmq, I think we have the same idempotency issue for overcloud?

Bogdan Dobrelya (bogdando) wrote :

The user passwords configuration step http://git.openstack.org/cgit/openstack/tripleo-heat-templates/tree/docker/services/rabbitmq.yaml#n209 binds mount/var/lib/rabbitmq in RO, I think this prevents the puppet changes to come in.

Changed in tripleo:
assignee: wes hayutin (weshayutin) → Bogdan Dobrelya (bogdando)
status: Triaged → In Progress
Alex Schultz (alex-schultz) wrote :

Password idempotency is covered by https://review.openstack.org/#/c/576990/

Quique Llorente (quiquell) wrote :

To maintain password in the second install

https://review.openstack.org/#/c/576990

Changed in tripleo:
assignee: Bogdan Dobrelya (bogdando) → Alex Schultz (alex-schultz)
Changed in tripleo:
assignee: Alex Schultz (alex-schultz) → Bogdan Dobrelya (bogdando)
Changed in tripleo:
assignee: Bogdan Dobrelya (bogdando) → wes hayutin (weshayutin)
wes hayutin (weshayutin) wrote :

Validated this patch fixes the issue https://review.openstack.org/#/c/576990/ thanks Alex!

wes hayutin (weshayutin) wrote :

BAH, spoke too soon.

Now I see the following error w/ patch https://review.openstack.org/#/c/576990/

2018-06-26 00:09:48 | + openstack overcloud node introspect --all-manageable
2018-06-26 00:09:51 | Waiting for messages on queue 'tripleo' with no timeout.
2018-06-26 01:34:11 | Waiting for introspection to finish...
2018-06-26 01:34:11 | Introspection of node 13988d4d-85e9-4ef6-b701-16760e3ebe9f timed out.
2018-06-26 01:34:11 | Introspection of node d99bc23e-fe59-46e2-b7c6-d0f9a1925691 timed out.
2018-06-26 01:34:11 | Retrying 2 nodes that failed introspection. Attempt 1 of 3
2018-06-26 01:34:11 | Introspection of node 13988d4d-85e9-4ef6-b701-16760e3ebe9f timed out.
2018-06-26 01:34:11 | Introspection of node d99bc23e-fe59-46e2-b7c6-d0f9a1925691 timed out.
2018-06-26 01:34:11 | Retrying 2 nodes that failed introspection. Attempt 2 of 3
2018-06-26 01:34:11 | Introspection of node 13988d4d-85e9-4ef6-b701-16760e3ebe9f timed out.
2018-06-26 01:34:11 | Introspection of node d99bc23e-fe59-46e2-b7c6-d0f9a1925691 timed out.
2018-06-26 01:34:11 | Retrying 2 nodes that failed introspection. Attempt 3 of 3
2018-06-26 01:34:11 | Introspection of node 13988d4d-85e9-4ef6-b701-16760e3ebe9f timed out.
2018-06-26 01:34:11 | Introspection of node d99bc23e-fe59-46e2-b7c6-d0f9a1925691 timed out.
2018-06-26 01:34:11 | Retry limit reached with 2 nodes still failing introspection
2018-06-26 01:34:11 | Unhandled workflow error
2018-06-26 01:34:11 | Unhandled workflow error
2018-06-26 01:34:11 | Unhandled workflow error

Then the job times out

wes hayutin (weshayutin) wrote :

hrm.. seems unrelated to the patch that we thought fixes the issues..

2018-06-25 23:24:15.986 7 ERROR oslo_service.service DriverLoadError: Driver, hardware type or interface pxe_ipmitool could not be loaded. Reason: [Errno 13] Permission denied: '/var/lib/ironic/httpboot/boot.ipxe'.

Going to retry

Reviewed: https://review.openstack.org/577809
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart/commit/?id=62947ece560c3d0d6ea2b1702022f5185094a929
Submitter: Zuul
Branch: master

commit 62947ece560c3d0d6ea2b1702022f5185094a929
Author: Wes Hayutin <email address hidden>
Date: Mon Jun 25 08:57:11 2018 -0400

    turn off undercloud idempotency check on fs002

    Turns off undercloud idempotency check on fs002
    Turns on undercloud idempotency check in fs20 or
    periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-master

    Related-Bug: #1777939
    Related-Bug: #1777941
    Change-Id: Ib83aed5fa55a9deb6ac8a90817ad05d2f281dcb6

Bogdan Dobrelya (bogdando) wrote :

So was it fixed with https://review.openstack.org/577809? Let's close/remove alert please?

wes hayutin (weshayutin) on 2018-06-28
Changed in tripleo:
status: In Progress → Fix Released
tags: removed: alert promotion-blocker

Reviewed: https://review.openstack.org/577785
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=f14369c08515de9e969ea3b72be1da29772f6124
Submitter: Zuul
Branch: master

commit f14369c08515de9e969ea3b72be1da29772f6124
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Jun 25 13:52:40 2018 +0300

    Fix non-HA rabbitmq user password updates

    Fix the invocation of puppet's rabbitmq_user by allowing the
    produced changes written into /var/lib/rabbitmq state.

    Move the docker_puppet_tasks for non-HA rabbit into the step_2
    as puppet-tripleo expects it for the rabbitmq_user call.

    Related-bug: #1777939
    Related-bug: #1778505

    Depends-on: I55dc83acb2ed5ee07b4cf57e25135e6201589ac4
    Change-Id: I7c7b2bd6d202268fdf4bc08f951ec4c9c4065c08
    Signed-off-by: Bogdan Dobrelya <email address hidden>

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers