jujud services not starting after reboot when /var is on separate partition

Bug #1634390 reported by Sandor Zeestraten
28
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Vinodhini
juju-core
Won't Fix
Critical
Unassigned

Bug Description

# Issue
We have machines in MAAS where we've split /var to a separate partition.
Deploying machines and services with Juju works fine, however the juju agent (jujud) services will not start when a machine restarts as systemd does not find the services on start due to (I believe) them being symlinked from /var/lib/juju/init/

As a workaround, you can reload systemd so it finds the services and then manually enable them, however that is not a proper solution.

# Output from df and systemctl
http://pastebin.com/t4BLGKGx

# Versions
Juju 2.0.0-xenial-amd64
MAAS 2.0.0

Tags: uosci
Changed in juju:
importance: Undecided → Medium
status: New → Triaged
milestone: none → 2.1.0
Revision history for this message
Sandor Zeestraten (szeestraten) wrote :

I managed to reproduce the issue in a fresh MAAS setup.
Deployed two different machines, one with just / and one with split partition / and /var

# Works OK
ubuntu@maas-node06:~$ sudo lsblk -o NAME,FSTYPE,SIZE,MOUNTPOINT,LABEL
NAME FSTYPE SIZE MOUNTPOINT LABEL
vda 20G
└─vda1 LVM2_member 20G
  └─vgroot-root ext4 9.3G /
vdb 10G

# Does not work
ubuntu@maas-node07:~$ sudo lsblk -o NAME,FSTYPE,SIZE,MOUNTPOINT,LABEL
NAME FSTYPE SIZE MOUNTPOINT LABEL
vda 20G
└─vda1 LVM2_member 20G
  ├─vgroot-root ext4 9.3G /
  └─vgroot-var ext4 4.7G /var
vdb 10G

Changed in juju:
importance: Medium → Critical
Revision history for this message
Mick Gregg (macgreagoir) wrote :

@anatasia-macmood I've seen a few comments about the place to do with when systemd [re]mounts /var. I'm guessing some tweaking of the service script might help any race.

Changed in juju:
milestone: 2.1.0 → 2.2.0
importance: Critical → High
Revision history for this message
Anastasia (anastasia-macmood) wrote :

Re-targeting to next milestone - further investigations are needed as this may not be something that can be fixed in Juju.

Revision history for this message
Sandor Zeestraten (szeestraten) wrote :

@anatasia-macmood The offending systemd service files are symlinked from /var/lib/juju/init/ to /etc/systemd/system/ by the Juju agent installer I presume.

Systemd manages to load these files fine when I simply place in /etc/systemd/system/ like Juju already does with the juju-clean-shutdown.service.

As @macgreagoir mentioned, someone with some more systemd knowledge might have a better idea if the service script can be tweaked or if the location of these files should be reconsidered.

Anyway, I think it is safe to say that Juju should not assume that /var is on the same partition as / as splitting these is a relatively common practice. Perhaps it could be added as a test case?

Changed in juju:
assignee: nobody → Richard Harding (rharding)
Revision history for this message
Ryan Beisner (1chb1n) wrote :

I've run into this as a side effect of working around https://bugs.launchpad.net/bugs/1492237 (which was my root symptom: controller disk fills up rapidly).

The model is 1.25.6, and is a long-running production deployment. The controllers (3 in HA) bumped up against > 98% disk space usage and it became impossible to issue juju commands or even get status.

I stopped juju services manually on each of the controller units, added storage, moved contents of /var/lib/juju, updated fstab, rebooted. But then none of the juju-* services would start.

Systemd unit files are read earlier in the boot process than mounts are handled, and since they are symlinks to files on a separate mount, the systemd unit files simply did not load.

I removed the symlinks and just copied the systemd unit files in place, and the controllers are happy once again, with a ton of space available. Juju status and other juju commands are back to normal.

Example, on unit 0:

sudo mv -fv /etc/systemd/system/juju-db.service /etc/systemd/system/juju-db.service.hold.$(date +%s )
sudo mv -fv /etc/systemd/system/multi-user.target.wants/juju-db.service /etc/systemd/system/multi-user.target.wants/juju-db.service.hold.$(date +%s )

sudo cp -fvp /var/lib/juju/init/juju-db/juju-db.service /etc/systemd/system/juju-db.service
sudo cp -fvp /var/lib/juju.hold/init/juju-db/juju-db.service /etc/systemd/system/multi-user.target.wants/juju-db.service

sudo mv -fv /etc/systemd/system/jujud-machine-0.service /etc/systemd/system/jujud-machine-0.service.hold.$(date +%s )
sudo mv -fv /etc/systemd/system/multi-user.target.wants/jujud-machine-0.service /etc/systemd/system/multi-user.target.wants/jujud-machine-0.service.hold.$(date +%s )

sudo cp -fvp /var/lib/juju/init/jujud-machine-0/jujud-machine-0.service /etc/systemd/system/jujud-machine-0.service
sudo cp -fvp /var/lib/juju/init/jujud-machine-0/jujud-machine-0.service /etc/systemd/system/multi-user.target.wants/jujud-machine-0.service

That may or may not be the best approach, and will likely require careful attention on upgrades, but it got us back up and out of quite a snag.

tags: added: uosci
Revision history for this message
Ryan Beisner (1chb1n) wrote :

Added juju-core as this impacts long-running production workloads, which at this point in time are vastly on 1.25.x, with no upgrade path to 2.x.

Changed in juju-core:
status: New → Triaged
importance: Undecided → Critical
milestone: none → 1.25.10
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.25.10 → none
Changed in juju-core:
milestone: none → 1.25.11
Changed in juju:
assignee: Richard Harding (rharding) → nobody
Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.2-beta1 → 2.2-beta2
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.25.11 → none
Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.2-beta2 → 2.2-beta3
Changed in juju:
milestone: 2.2-beta3 → 2.2-beta4
Changed in juju:
milestone: 2.2-beta4 → 2.2-rc1
Revision history for this message
Tim Penhey (thumper) wrote :

Firstly, lets be honest, we aren't going to address this on 1.25.

Changed in juju-core:
status: Triaged → Won't Fix
Revision history for this message
Tim Penhey (thumper) wrote :

Juju shouldn't be storing the systemd files in /var and symlinking. It appears that other apps put them in /lib/systemd/system and symlink into /etc/systemd/system.

Removing the milestone instead of punting down the road.

Changed in juju:
importance: High → Medium
milestone: 2.2-rc1 → none
Revision history for this message
Oddgeir Lingaas Holmen (oddgeir-lingaas-holmen) wrote :

Would be great if this bug can be prioritized as it is a pain point in our dev and prod environments.

Revision history for this message
Sandor Zeestraten (szeestraten) wrote :

Any chance of a look at this for 2.4?

Ian Booth (wallyworld)
Changed in juju:
milestone: none → 2.4-beta1
importance: Medium → High
Ian Booth (wallyworld)
Changed in juju:
assignee: nobody → Vinodhini (vinu-b)
Revision history for this message
Ian Booth (wallyworld) wrote :

We can put the directories currently placed in /var/lib/juju/init into /lib/systemd/juju instead,
The juju related symlinks in /etc/systemd/system would just be retargetted.

We'll need an upgrade step to copy across existing service files.

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1634390] Re: jujud services not starting after reboot when /var is on separate partition

Given that they are in /etc/systemd isn't this just changing one
failure-when-on-a-different-filesystem for another?

John
=:->

On Mon, Apr 16, 2018 at 5:27 AM, Ian Booth <email address hidden> wrote:

> We can put the directories currently placed in /var/lib/juju/init into
> /lib/systemd/juju instead,
> The juju related symlinks in /etc/systemd/system would just be retargetted.
>
> We'll need an upgrade step to copy across existing service files.
>
> --
> You received this bug notification because you are subscribed to juju-
> core.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1634390
>
> Title:
> jujud services not starting after reboot when /var is on separate
> partition
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1634390/+subscriptions
>

Revision history for this message
Ian Booth (wallyworld) wrote :

From what I can see, it seems that many systemd services as set up by a dsitro install are configured by placing the actual service files themselves into /lib/systemd and then linking to /etc/systemd. Si I assume there's an expectation that /etc and /lib are on the same partition. I can see that in many cases /var would be on a different partition, as that files with logs etc

Ian Booth (wallyworld)
Changed in juju:
status: Triaged → In Progress
Changed in juju:
milestone: 2.4-beta1 → none
Ian Booth (wallyworld)
Changed in juju:
milestone: none → 2.4-beta2
Changed in juju:
milestone: 2.4-beta2 → none
Revision history for this message
Vinodhini (vinu-b) wrote :
Ian Booth (wallyworld)
Changed in juju:
milestone: none → 2.4-rc1
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.