Multiple jujud services stuck with "uniter" manifold worker returned unexpected error: failed to initialize uniter cannot create storage hook source

Bug #1827838 reported by Pedro Guimarães on 2019-05-06
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
juju
High
Joseph Phillips
2.5
High
Joseph Phillips
2.6
High
Joseph Phillips

Bug Description

Juju 2.5.4
MAAS 2.5.2

OpenStack Bionic/Queens

Multiple jujud services seems to be getting stuck with following error message:
ERROR juju.worker.dependency engine.go:636 "uniter" manifold worker returned unexpected error: failed to initialize uniter for "UNIT_NAME": cannot create storage hook source: getting unit attachments: runtime error: invalid memory address or nil pointer dereference

Jujud repeatedly keeps pushing same information to the logs, such as:
https://pastebin.ubuntu.com/p/5QN6HmtHxz/

Work-around at the moment is to monitor deployment and restart any services that seem to be stuck.
I am doing so with a script that restarts any services that keeps same status for too long.

The issue is that I do not know the consequences so many restarts can cause to the overall deployment or if any services will be disrupted. It worked on my last trial.

tags: added: cpe-onsite
Changed in juju:
status: New → Triaged
importance: Undecided → Critical
Changed in juju:
assignee: nobody → Canonical Juju QA Bot (juju-qa-bot)
Changed in juju:
assignee: Canonical Juju QA Bot (juju-qa-bot) → Joseph Phillips (manadart)
Joseph Phillips (manadart) wrote :

Proposed a patch against the 2.5 branch that fixes a missing error check on this code path:
https://github.com/juju/juju/pull/10142

I got pulled off on to other things today, but I will dig into the crash dump first thing and see if there is more that can be gleaned there.

Pedro Guimarães (pguimaraes) wrote :

Hi, just tested your repo on my environment and it failed during juju deploy:
https://pastebin.ubuntu.com/p/FdgVjbr6n2/

Also, juju enable-ha returns empty: https://pastebin.ubuntu.com/p/YyvwsXsCKC/

Joseph Phillips (manadart) wrote :

Thanks for giving it a test.

This does indeed look like the error that was being obfuscated, so now we can set about addressing it.

Pedro Guimarães (pguimaraes) wrote :

Hi, an update, I was running controller model with proxy, that explains issues I've faced on #4. Once proxy configs were set correctly, controller deployment works.

I am currently running on your snap version, here is the full crashdump of latest deployment: https://drive.google.com/open?id=1S0fhIt-BvqX1I_Hau0t2XcRhJr8MQpDL

I am seeing some units failing, like:
https://pastebin.canonical.com/p/zcTwFJrJcS/
https://pastebin.canonical.com/p/Zyqqyc5XFR/

I can see that both units cited above were on the same node. Both had this on its log:
2019-05-07 12:53:43 DEBUG juju.api apiclient.go:888 successfully dialed "wss://REDACTED:17070/model/d389f110-5ed1-46d3-8526-796a9ddf67ca/api"
2019-05-07 12:53:43 INFO juju.api apiclient.go:608 connection established to "wss://REDACTED:17070/model/d389f110-5ed1-46d3-8526-796a9ddf67ca/api"
2019-05-07 12:54:08 DEBUG juju.worker.apicaller connect.go:155 [d389f1] failed to connect
2019-05-07 12:54:08 ERROR juju.worker.apicaller connect.go:204 Failed to connect to controller: Closed explicitly (unauthorized access)

jujud services were marked as "exited".
This may be related to proxy configurations as well, but restarting jujud service on both units and they moved to "active".

Changed in juju:
milestone: none → 2.5.6
Joseph Phillips (manadart) wrote :

Based on the log output in https://pastebin.canonical.com/p/qfxgQpZSqc/, I am bit concerned that we are not getting consistent results.

My 10142 patch will be in the latest. 2.5 edge channel (2.5.5+2.5-a42c953). Can you run with this version? Then we will be sure that the client/controller/agent versions are all congruent.

If we see these errors again, can you get me a crash-dump of the controller model in addition to the one you are deploying into?

Changed in juju:
status: Triaged → In Progress
Changed in juju:
status: In Progress → Incomplete
importance: Critical → High
Joseph Phillips (manadart) wrote :

Are we able to resolve this bug based on the resolution of prior proxy issues?

It will be easier to track further issues in new tickets.

Pedro Guimarães (pguimaraes) wrote :

Hi manadart, yes thanks for all the help.
I do not see the issue or any other issues on the last few deployments.
We can mark as Fix Released.

Changed in juju:
status: Incomplete → Fix Committed

I've just came across the same exact problem yesterday.
Unfortunately I don't have direct access to the environment, but here is my summary:

1. No proxy is set up, direct internet access.
2. The issue happens on different physical nodes.
3. The issue happens with the agents of different charms.

Hope this helps.

Changed in juju:
status: Fix Committed → Fix Released
Changed in juju:
status: Fix Released → Fix Committed
milestone: 2.5.6 → 2.7-beta1
no longer affects: juju/2.7
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers