iscsid autostarts on all servers when it has nothing to do

Bug #1755858 reported by Steve Langasek on 2018-03-14
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
open-iscsi (Debian)
New
Unknown
open-iscsi (Ubuntu)
High
Unassigned
Bionic
High
Dimitri John Ledkov 🌈

Bug Description

[Impact]

 * Service is running uselessly which is consuming a few cycles/memory as
   well as rasising general concerns e.g. on minimizing attack surfcae of
   a system.

 * This is also the only service in a default server install which pulls in the network-online.target, which has implications for boot ordering and speed in various configurations.

 * Fix by switching to socket activation

[Test Case]

 * After installing open-iscsi (which is default installed) the service
   iscsid is running which is mostly useless
 * After the upgrade only the iscsid.socket should run
 * Ensure that iscsid.service should come up as needed

[Regression Potential]

 * I'm not sure we can/shall SRU this, but was asked to express my
   thoughts in the regression potential. It is not that it would not
   "work", we tested in cosmic and so far all is fine - the tools will
   call the abstract socket and it will spawn.
   So it is not that I see it totally "failing"
 * I'd more be concerned that one would have e.g. scripts and other upper
   level code that does like:
     if service-is-not-running; then break; else do what you should do
   This would give up before socket-triggering it which might be too much
   to SRU. On a Upgrade to a newer release such minor adaptions are usual,
   but for SRUs?
 * But also we don't stop the service on upgrade (for safety of the data),
   so you'd have four different Bionics
   a) old iscsid.service runnign by default
   b) upgraded, but not rebooted iscsid.service still running
   c) upgraded, rebooted iscid.service disabled,
      iscsid.socket running
   d) new deploy after this (e.g. new cloud image) iscid.service disabled,
      iscsid.socket running
   a+b are similar as well as c+d.
 * OTOH there are a few things reducing this impact, first of all this is
   a config, so one can "systemctl enable iscsid.service" and will have
   the old behavior
 * Is this a real blocker, I'm not sure - so I documented as requested and
   would want the SRU team to discuss before an upload.

[Other Info]

 * n/a

---

In bionic, the open-iscsi systemd unit has the following guards to keep it from running on systems with no iscsi targets configured:

# Must have some pre-defined targets to login to
ConditionDirectoryNotEmpty=|/etc/iscsi/nodes
# or have a session to use via iscsid
ConditionDirectoryNotEmpty=|/sys/class/iscsi_session

However, iscsid starts from a separate unit and does not include this check. Thus, iscsid starts on every Ubuntu Server install, whether or not it has anything to do.

We should replicate these unit conditionals to the iscsid unit, to ensure the daemon doesn't run (consuming memory, and slowing boot) when not needed.

Related bugs:
 * bug 1630946: ubuntu-server depends on open-iscsi and runs iscsid

Related branches

Steve Langasek (vorlon) on 2018-04-18
Changed in open-iscsi (Ubuntu):
importance: Undecided → High
status: New → Triaged
assignee: nobody → Mathieu Trudel-Lapierre (cyphermox)
Download full text (3.9 KiB)

Status of a clean install in a VM and/or Container, so testable rather trivial:

$ systemctl status iscsid open-iscsi
● iscsid.service - iSCSI initiator daemon (iscsid)
   Loaded: loaded (/lib/systemd/system/iscsid.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2018-05-23 10:52:11 UTC; 7min ago
     Docs: man:iscsid(8)
  Process: 751 ExecStart=/sbin/iscsid (code=exited, status=0/SUCCESS)
  Process: 721 ExecStartPre=/lib/open-iscsi/startup-checks.sh (code=exited, status=0/SUCCESS)
 Main PID: 759 (iscsid)
    Tasks: 2 (limit: 548)
   CGroup: /system.slice/iscsid.service
           ├─754 /sbin/iscsid
           └─759 /sbin/iscsid

May 23 10:52:11 b1 systemd[1]: Starting iSCSI initiator daemon (iscsid)...
May 23 10:52:11 b1 systemd[1]: iscsid.service: Failed to parse PID from file /run/iscsid.pid: Invalid argument
May 23 10:52:11 b1 iscsid[754]: iSCSI daemon with pid=759 started!
May 23 10:52:11 b1 systemd[1]: Started iSCSI initiator daemon (iscsid).

● open-iscsi.service - Login to default iSCSI targets
   Loaded: loaded (/lib/systemd/system/open-iscsi.service; enabled; vendor preset: enabled)
   Active: inactive (dead)
Condition: start condition failed at Wed 2018-05-23 10:52:11 UTC; 7min ago
           ├─ ConditionDirectoryNotEmpty=|/etc/iscsi/nodes was not met
           └─ ConditionDirectoryNotEmpty=|/sys/class/iscsi_session was not met
     Docs: man:iscsiadm(8)
           man:iscsid(8)

It is important to note that these are not "the same" service twice.
both belong to the same package:
dpkg -S /lib/systemd/system/open-iscsi.service /lib/systemd/system/iscsid.service
open-iscsi: /lib/systemd/system/open-iscsi.service
open-iscsi: /lib/systemd/system/iscsid.service

But the two are doing rather different things:
open-iscsi.service: logs into iSCSI targets if some are configured (Condition stops it from doing so) - this is more a one-shot configure-devices and not a real service
iscsi.service: the basic service daemone, required by the service above

I think here we might learn from Fedora:
Default is:
[root@fedora ~]# systemctl status iscsid.socket iscsid.service
● iscsid.socket - Open-iSCSI iscsid Socket
   Loaded: loaded (/usr/lib/systemd/system/iscsid.socket; enabled; vendor preset: disabled)
   Active: active (listening) since Wed 2018-05-23 11:26:37 UTC; 9s ago
     Docs: man:iscsid(8)
           man:iscsiadm(8)
   Listen: @ISCSIADM_ABSTRACT_NAMESPACE (Stream)

May 23 11:26:37 fedora systemd[1]: Listening on Open-iSCSI iscsid Socket.

● iscsid.service - Open-iSCSI
   Loaded: loaded (/usr/lib/systemd/system/iscsid.service; disabled; vendor preset: disabled)
   Active: inactive (dead)
     Docs: man:iscsid(8)
           man:iscsiadm(8)

May 23 11:26:09 fedora systemd[1]: iscsid.service: Failed to reset devices.list: Operation not permitted
May 23 11:26:09 fedora systemd[1]: Starting Open-iSCSI...

And the related config in /etc/iscsid.conf says:
# Use socket activation, but try to make sure the socket units are listening
iscsid.startup = /bin/systemctl start iscsid.socket iscsiuio.socket
(We don't have the uio, but you get the idea - initially only ensure the socket runs.)

The socket then does:
# syste...

Read more...

Most basic tests cross check on Fedora:
First I tried a safe no-op
  $ iscsiadm -m discovery -t sendtargets -p 127.0.0.1
  This is working fine on fedora to activate the service

Then I tried on the fedora setup if the cmd we call would activate it.
  $ /sbin/iscsiadm -m node --loginall=automatic
    iscsiadm: No records found
And the service is not activated.
But that is ok, I really had no config and /lib/systemd/system/open-iscsi.service explicitly defined RC=21 as ok for just this reason.

If to configure something:
$ targetname="iqn.2016-11.foo.com:target.1.iscsi"
$ truncate --size 100M backingfile1
$ tgtadm --lld iscsi --op new --mode target --tid 1 -T "${targetname}"
$ tgtadm --lld iscsi --op bind --mode target --tid 1 -I ALL
$ portal="127.0.0.1:3260"
$ iscsiadm --mode discovery --type sendtargets --portal 127.0.0.1
$ iscsiadm --mode node --targetname "${targetname}" --portal 127.0.0.1:3260 --login
$ iscsiadm --mode node --targetname "${targetname}" --portal "${portal}" --logout

This should now be usable

Trying a hacky POC:

1. define iscsid.socket
$ cat > /lib/systemd/system/iscsid.socket << EOF
[Unit]
Description=Open-iSCSI iscsid Socket
Documentation=man:iscsid(8) man:iscsiadm(8)

[Socket]
ListenStream=@ISCSIADM_ABSTRACT_NAMESPACE

[Install]
WantedBy=sockets.target
EOF
$ systemctl enable iscsid.socket

2. remove some entries in /lib/systemd/system/open-iscsi.service
- remove iscsid from Wants=
- Remove the line ExecStartPre=/bin/systemctl --quiet is-active iscsid.service

3. disable and reload all files
$ systemctl disable iscsid
$ systemctl daemon-reload

Note: we also have still /etc/init.d/iscsid, but as long as there is a .service it doesn't matter.

=> ok - with that on a restart it doesn't start anymore on Ubuntu

Lets see if it will auto-start ...

Test 1:
iscsiadm -m discovery -t sendtargets -p 127.0.0.1
worked to activate it as needed

Test 2:
/sbin/iscsiadm -m node --loginall=automatic
Not activating it, but that is fine if nothing is defined

Test 3:
- define something with tgt, and configure it
Note: remember to load iscsi_tcp in the host if you are in a container
- after that /etc/iscsi/nodes/iqn.2016-11.foo.com\:target.1.iscsi/127.0.0.1\,3260\,1/default exists
Well it still counts as "no-records found" but that is fine commands are called fine
And if I add a iscsid depending call is in there (added as exec pre) it works to enable it.

Overall I think this could work ...

Interesting, this worked just fine without the iscsid.conf entry as in Fedora:
  iscsid.startup = /bin/systemctl start iscsid.socket

I need to check the code for what this is used.

Ok, this is the code path that tries to connect.
It has a fallback to run this command.
I think we would want anyway that this is not started in a random context, but from the socket/service.
The following connect will then pick and start via the socket.

I have a test branch and build
+ * make iscsid socket activated to only activate it as-needed (LP: #1755858)
+ - debian/open-iscsi.socket: systemd socket file for iscsid
+ - debian/open-iscsi.service: do not start or check iscsid.service
+ - debian/rules: install and enable iscsid.socket
+ - debian/patches/iscid-conf-use-systemd.socket-patch: default to the socket

Building in ppa [1] now.
It is not yet up for final review, but feel free to take a look at [2]

I Won't have time for it anymore for a bit, some tests on the ppa (upgrades and how they affect it as well as new installs - since it is there by default the latter meant purge+ppa+install).

[1]: https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/3268 for further tests.
[2]: https://code.launchpad.net/~paelzer/ubuntu/+source/open-iscsi/+git/open-iscsi/+merge/346739

Fortunately we are so early in the cycle that I'd say if reviews and tests seem good we can change it. And if on any image/maas/cloud tests we break things we didn't expect there is time enough to revert it before cosmic is entering any Freeze.

On Wed, May 23, 2018 at 01:58:05PM -0000,  Christian Ehrhardt  wrote:
> Fortunately we are so early in the cycle that I'd say if reviews and
> tests seem good we can change it. And if on any image/maas/cloud tests
> we break things we didn't expect there is time enough to revert it
> before cosmic is entering any Freeze.

I think we should be concerned about SRUing this change to bionic as well.
iscsid.service declares Wants=network-online.target, which slows down the
boot in cases where the system for whatever reason can't confirm right away
that the network is up.

Lets get it right in Cosmic also to see what we come up with and if you then say this should be safe for an SRU I'll follow your guidance :-)

FYI: on the MP I'm iterating on different degrees of not-fun I have with dh_* in regard to control them to do exactly what I need for open-iscsi.

Scott Moser (smoser) on 2018-05-29
description: updated
description: updated

MP is reviewed and accepted, new version pushed to Cosmic as https://launchpad.net/ubuntu/+source/open-iscsi/2.0.874-5ubuntu3

I ran and adapted the tests, but lets see how migration works if there are any surprises.

Changed in open-iscsi (Ubuntu):
status: Triaged → Fix Committed
assignee: Mathieu Trudel-Lapierre (cyphermox) → nobody
Changed in open-iscsi (Ubuntu Bionic):
assignee: Mathieu Trudel-Lapierre (cyphermox) → nobody

Also reported to Debian and linked up their bug.
It should help them just as much.

And if they accept it maintenance/merges will be easier, while OTOH if they find issues we can fix them in our Delta - so either way it is good to make it known there.

Changed in open-iscsi (Debian):
status: Unknown → New

This worked in https://bileto.ubuntu.com/#/ticket/3268 when tested before (ppa now abandoned due to upload taking the version).
I also had amd64 tests running locally (due to that the fix to the tests that I had added)

But on the final migration it stumbles over it on LP Infra.
So it might be a race?!?
Example:
https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-cosmic/cosmic/amd64/o/open-iscsi/20180530_111813_16285@/log.gz

I need to check what is going on there other than local.
Starting a new run locally ...

To be a bit faster (in case it is a race) than last time running with 4CPU and 2G memory this time.
$ autopkgtest --apt-upgrade --shell-fail --no-built-binaries --apt-pocket=proposed=src:open-iscsi open-iscsi_2.0.874-5ubuntu3.dsc -- qemu --qemu-options='-cpu host' --cpus 4 --ram-size=2048 ~/autopkgtest-cosmic-amd64.img

Ok, I can at least reproduce now ...

Eventually it is easier than I thought.
We have an intentional no-op (that can fail and already is on an || /bin/true).
This might emit to stderr and that breaks the test.
I logged in and all the status it wanted to test after this command is actually good.

Now that Bileto tests work again I'll respin a fix and test on LP infra before the fix-upload to the archive.

Lets see if it stumbles over more after this, to have at least only one fixup-upload.

The hit in "testsuite" is more complex, as it does not reproduce for me.
From the log:
======================================================================
ERROR: test_tgt_boot (__main__.CloudImageTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/autopkgtest.BEvlMr/build.15A/src/debian/tests/test-open-iscsi.py", line 274, in test_tgt_boot
    subprocess.check_call(cmd)
  File "/usr/lib/python2.7/subprocess.py", line 190, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '['/tmp/autopkgtest.BEvlMr/build.15A/src/debian/tests/tgt-boot-test', '-v', '--netdev=user,net=10.1.1.0/24,host=10.1.1.2,dns=10.1.1.4,dnssearch=example.com', '--disk=/tmp/tmpin5Hu3/output-disk.img,serial=output-disk', '--user-data-add=/tmp/tmpin5Hu3/user-data', '/tmp/autopkgtest.BEvlMr/build.15A/src/debian/tests/cosmic.d/disk.img', '/tmp/autopkgtest.BEvlMr/build.15A/src/debian/tests/cosmic.d/kernel', '/tmp/autopkgtest.BEvlMr/build.15A/src/debian/tests/cosmic.d/initrd']' returned non-zero exit status 124

This more seems than it might have timed out, also this is not only not reproducible (locally) but also only affecting x86_64.
I'll have to wait for the LP Infra rerun on Bileto on this.

This now worked a few times locally (4/4) but never on Infra (3/3).

I first thought it would be a timeout for
  qemu-system-x86_64: terminating on signal 15 from pid 25607 (timeout)
  xkvm returned 124 in 3600s
But that is only because it boots into emergency console and hangs there:
  You are in emergency mode. After logging in, type "journalctl -xb" to view

While locally (autopkgtest in VM) it boots&tests fine
  xkvm returned 0 in 200s
  cleaning up tgt mount tgt-boot-test-4mbUyK
  ok
  Ran 3 tests in 230.082s

The other arches are good, because the cloud image based test is only run on amd64.
All other tests run everywhere and are good.

Here a log of one of the many good local runs: http://paste.ubuntu.com/p/cvMMb9GDTV/

Ok, the forth was a charm - so on LP Infra it seems to be flaky which is a separate issue.
This is the working log from the ppa: https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-cosmic-ci-train-ppa-service-3279/cosmic/amd64/o/open-iscsi/20180531_060452_bd5ba@/log.gz

That said the real issue is fixed in the package I will upload.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package open-iscsi - 2.0.874-5ubuntu4

---------------
open-iscsi (2.0.874-5ubuntu4) cosmic; urgency=medium

  * debian/tests/install: ignore the potential stderr of the probing command
    that is meant to activate iscsid indirectly via the socket.

open-iscsi (2.0.874-5ubuntu3) cosmic; urgency=medium

  * make iscsid socket activated to only activate it as-needed (LP: #1755858)
    - debian/iscsid.socket: systemd socket file for iscsid
    - debian/open-iscsi.service: do not start or check iscsid.service
    - debian/rules: install and enable iscsid.socket
    - debian/patches/iscid-conf-use-systemd.socket-patch: default to the socket
    - debian/open-iscsi.postinst:
      + run restart logic only if service is running on upgrade
      + drop no more reachable upgrade path that affects iscsid
      + disable iscsid.service on upgrade
      + handle iscsid.socket to be started if the service is not running yet
    - debian/tests/install: fix tests to work with socket activation

 -- Christian Ehrhardt <email address hidden> Wed, 30 May 2018 15:42:12 +0200

Changed in open-iscsi (Ubuntu):
status: Fix Committed → Fix Released

Done for Cosmic.

Steve has mentioned he'd like to see an SRU of this to Bionic, but I think the change carries too much regression potential for an SRU. Please feel free to discuss.
To make it clear I'm not working on Bionic, I'll set Won't Fix for now.

Changed in open-iscsi (Ubuntu Bionic):
status: Triaged → Won't Fix
Steve Langasek (vorlon) on 2018-06-01
Changed in open-iscsi (Ubuntu Bionic):
status: Won't Fix → Triaged
description: updated
Steve Langasek (vorlon) on 2018-06-07
description: updated
Steve Langasek (vorlon) on 2018-06-07
Changed in open-iscsi (Ubuntu Bionic):
assignee: nobody → Dimitri John Ledkov (xnox)
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.