juju controller 2.5.2 upgrade fails with failed to deserialize conf for application "juju-db"

Bug #1820327 reported by Drew Freiberger
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Won't Fix
High
Tim Penhey

Bug Description

When upgrading my juju controller model from 2.4.4 to 2.5.2 (using the snapped juju for both bootstrap and upgrade), I've encountered a bug in jujud-machine agents stating:

DEBUG juju.worker.dependency engine.go:538 "state" manifold worker stopped: failed to deserialize conf for application "juju-db": strconv.Atoi: parsing "unlimited": invalid syntax.

When investigating /etc/systemd/system/juju-db.service file, I find the following settings:

[Service]
LimitNOFILE=65000
LimitNPROC=20000
LimitFSIZE=unlimited
LimitCPU=unlimited
LimitAS=unlimited
LimitMEMLOCK=unlimited
ExecStart=/usr/lib/juju/mongo3.2/bin/mongod *snip*

When I modify these unlimited lines, the deseralized conf error changes to the new string, so this appears to be a problem with juju processing the juju-db service config file when it encounters a string in a typically integer field. Both of these values are valid, however.

This was discovered in a Xenial Queens FCB cloud deployment.

Revision history for this message
Drew Freiberger (afreiberger) wrote :

Proven workaround is to remove the Limit.*unlimited lines from the /etc/systemd/system/juju-db.service file(s) on all controllers and restart jujud-machine-X service(s).

Revision history for this message
Richard Harding (rharding) wrote :

These were added in 2.5.2 and seems to have hit some Go upgrade code that can't deal with those string values while mongodb is fine with them.

Changed in juju:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.5.3
assignee: nobody → Tim Penhey (thumper)
Revision history for this message
Drew Freiberger (afreiberger) wrote :

All nodes have these juju packages installed.

ii juju-mongo-tools3.2 3.2.4+ds-0ubuntu1 amd64 Tools to administer MongoDB.
ii juju-mongodb3.2 3.2.15-0ubuntu1~16.04.1 amd64 MongoDB object/document-oriented database for Juju

I found that the NFILES and NOPROC settings on jujud-machine-0 were 65000 and 20000, whereas they were 64000 and 64000 on jujud-machine-1 and -2 juju-db.service files. Not sure if this has to do with any sort of leadership decisions, or may be another upgrade related bug.

Revision history for this message
Drew Freiberger (afreiberger) wrote :

after running the workaround, I tried re-running the upgrade using:

juju upgrade-model -m controller --reset-previous-upgrade

This resulted in the juju-db.service files being re-populated with:

LimitNPROC=64000
LimitFSIZE=unlimited
LimitCPU=unlimited
LimitAS=unlimited
LimitMEMLOCK=unlimited
LimitNOFILE=64000

To shortcut the failing of the upgrade again, I quickly ran a loop to remove the "unlimited" entries from the file and restart the services as noted in the workaround.

This allowed the upgrade to not timeout.

Revision history for this message
Tim Penhey (thumper) wrote :

This appears to be due to a failed upgrade. Can you check to the controller to see if it attempted an upgrade which it then failed?

You should be able to check the logs for the lines "running jujud". The 2.5.2 codebase can happily parse the unlimited values, but the 2.4.4 codebase cannot.

Changed in juju:
status: Triaged → Incomplete
Revision history for this message
Drew Freiberger (afreiberger) wrote :

I did see that it rolled back from an unsuccessful upgrade in the logs back to 2.4.4. Unfortunately, that's rolled off my machines before I could capture the logs around the event, but definitely a rollback was what happened.

I feel like I watched it get stuck rolling forward again and I unstuck it with the removal of Limit*=unlimited lines and restarting the jujud-machine-X agents. Is it possible that the juju-db upgrade process is adding the lines into the service file before the jujud-machine is upgraded and then jujud-machine agent possibly restarts with new juju-db config but old agent rev?

Seems unlikely, and I don't have logs to prove the thought.

Changed in juju:
status: Incomplete → New
Revision history for this message
Tim Penhey (thumper) wrote :

Only Juju 2.5.2 knows how to write unlimited limits. So it wouldn't have been 2.4.7 writing that.

However if the controller tried to upgrade to 2.5.2, that would have rewritten mongo, and then fell back to 2.4.7, which unfortunately can't parse unlimited limits.

We should make sure we have clear documentation around the process of upgrading the controllers because sometimes the controllers don't fully stop as they are expected to when upgrading. Juju has got a lot better around this recently, but there is still the potential.

If the controllers don't all check in to do the upgrade together, they get rolled back (kinda). This is the source of this problem.

Changed in juju:
status: New → Won't Fix
Revision history for this message
Xav Paice (xavpaice) wrote :

Just hit this upgrading from 2.4.7 direct to 2.5.8 - didn't touch 2.5.2. Should I be just upgrading direct to 2.6.10 and ignoring the 2.5 series entirely?

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1820327] Re: juju controller 2.5.2 upgrade fails with failed to deserialize conf for application "juju-db"

did you successfully end on 2.5.8? otherwise the same applies that anything
>2.5.2 will rewrite the field that something <2.5.2 can't read. The claim
earlier was that it only failed because of the rollback to 2.4.7 which
means the upgrade was already failing.

John
=:->

On Mon, Nov 4, 2019, 05:10 Xav Paice <email address hidden> wrote:

> Just hit this upgrading from 2.4.7 direct to 2.5.8 - didn't touch 2.5.2.
> Should I be just upgrading direct to 2.6.10 and ignoring the 2.5 series
> entirely?
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1820327
>
> Title:
> juju controller 2.5.2 upgrade fails with failed to deserialize conf
> for application "juju-db"
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1820327/+subscriptions
>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.