Failed openstack-zaqar service after undercloud upgrade

Bug #1661227 reported by Marius Cornea on 2017-02-02
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
tripleo
High
Unassigned

Bug Description

After undercloud upgrade from Newton to Ocata Zaqar service shows as failed:

[root@undercloud-0 stack]# systemctl status openstack-zaqar.service
● openstack-zaqar.service - OpenStack Message Queuing Service (code-named Zaqar) Server
   Loaded: loaded (/usr/lib/systemd/system/openstack-zaqar.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Thu 2017-02-02 11:21:48 UTC; 50min ago
 Main PID: 23023 (code=exited, status=1/FAILURE)

Feb 02 11:03:05 undercloud-0.redhat.local systemd[1]: Started OpenStack Message Queuing Service (code-named Zaqar) Server.
Feb 02 11:03:05 undercloud-0.redhat.local systemd[1]: Starting OpenStack Message Queuing Service (code-named Zaqar) Server...
Feb 02 11:21:48 undercloud-0.redhat.local systemd[1]: openstack-zaqar.service: main process exited, code=exited, status=1/FAILURE
Feb 02 11:21:48 undercloud-0.redhat.local systemd[1]: Unit openstack-zaqar.service entered failed state.
Feb 02 11:21:48 undercloud-0.redhat.local systemd[1]: openstack-zaqar.service failed.

zaqar.log:

2017-02-02 11:21:48.137 23023 WARNING keystonemiddleware.auth_token [-] Using the in-process token cache is deprecated as of the 4.2.0 release and may be removed in the 5.0.0 release or the 'O' development cycle. The in-process cache causes inconsistent results and high memory usage. When the feature is removed the auth_token middleware will not cache tokens by default which may result in performance issues. It is recommended to use memcache for the auth_token token cache by setting the memcached_servers option.
2017-02-02 11:21:48.610 23023 CRITICAL zaqar [(None,) - - - - -] [project_id:48f0d1803e48484087e4a5868c279e6e] IOError: [Errno 32] Broken pipe
2017-02-02 11:21:48.610 23023 ERROR zaqar Traceback (most recent call last):
2017-02-02 11:21:48.610 23023 ERROR zaqar File "/usr/bin/zaqar-server", line 10, in <module>
2017-02-02 11:21:48.610 23023 ERROR zaqar sys.exit(run())
2017-02-02 11:21:48.610 23023 ERROR zaqar File "/usr/lib/python2.7/site-packages/zaqar/common/cli.py", line 58, in _wrapper
2017-02-02 11:21:48.610 23023 ERROR zaqar _fail(1, ex)
2017-02-02 11:21:48.610 23023 ERROR zaqar File "/usr/lib/python2.7/site-packages/zaqar/common/cli.py", line 36, in _fail
2017-02-02 11:21:48.610 23023 ERROR zaqar print(ex, file=sys.stderr)
2017-02-02 11:21:48.610 23023 ERROR zaqar IOError: [Errno 32] Broken pipe
2017-02-02 11:21:48.610 23023 ERROR zaqar

Workaround: systemctl restart openstack-zaqar.service

Emilien Macchi (emilienm) wrote :

it's weird we don't see it in TripleO CI (we have an undercloud upgrade job and it works).

Changed in tripleo:
status: New → Triaged
importance: Undecided → High
milestone: none → ocata-rc1
tags: added: upgrade
Thomas Herve (therve) wrote :

OK, I tracked it down and I think I'm getting somewhere: I'm able to reproduce it when journald is restarted. I thought about it because it of the stderr error, and because it seems to happen on upgrades mostly.

https://bugs.freedesktop.org/show_bug.cgi?id=84923 ought to be the culprit, though I wasn't able to verify if the environments have the fix or not.

Ideally, I would track down stdout/stderr usage in Zaqar to remove that issue. It's possible that it doesn't use loggers correctly, so get into that issue more than other services.

In the mean time. I proposed https://review.rdoproject.org/r/#/c/4941/ which ought to workaround the issue by restarting zaqar when it fails.

Thomas Herve (therve) wrote :

I closed bug #1640600 as a duplicate.

Changed in tripleo:
milestone: ocata-rc1 → ocata-rc2
Changed in tripleo:
milestone: ocata-rc2 → pike-1
Changed in tripleo:
milestone: pike-1 → pike-2
Changed in tripleo:
milestone: pike-2 → pike-3
Changed in tripleo:
milestone: pike-3 → pike-rc1
Ben Nemec (bnemec) wrote :

I'm going to close this since from a TripleO perspective https://review.rdoproject.org/r/#/c/4941/ fixes the failure.

Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.