Shutdown hang on 16.04 with iscsi targets

Bug #1569925 reported by Matt Schulte on 2016-04-13
152
This bug affects 25 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Rafael David Tinoco
Trusty
Medium
Rafael David Tinoco
Xenial
Medium
Rafael David Tinoco
Artful
Medium
Rafael David Tinoco
Bionic
Medium
Rafael David Tinoco
open-iscsi (Ubuntu)
Medium
Rafael David Tinoco
Trusty
Medium
Rafael David Tinoco
Xenial
Medium
Rafael David Tinoco
Artful
Medium
Rafael David Tinoco
Bionic
Medium
Rafael David Tinoco

Bug Description

[Impact]

 * open-iscsi users might face hangs during OS shutdown.
 * hangs can be caused by manual iscsi configuration/setup.
 * hangs can also be caused by bad systemd unit ordering.
 * if transport layer interface vanishes before lun is
   disconnected, then the hang will happen.
 * check comment #89 for the fix decision.

[Test Case]

 * a simple way of reproducing the kernel hang is to disable
   the open-iscsi logouts. this simulates a situation when
   a service has shutdown the network interface, used by
   the transport layer, before proper iscsi logout was done.

   $ log into all iscsi luns

   $ systemctl edit --full open-iscsi.service
   ...
   #ExecStop=/lib/open-iscsi/logout-all.sh
   ...

   $ sudo reboot # this will make server to hang forever
                 # on shutdown

[Regression Potential]

 * the regression is low because the change acts on the iscsi
   transport layer code ONLY when the server is in shutdown
   state.

 * any error in logic would only appear during shutdown and
   would not cause any harm to data.

[Other Info]

 * ORIGINAL BUG DESCRIPTION

I have 4 servers running the latest 16.04 updates from the development branch (as of right now).

Each server is connected to NetApp storage using iscsi software initiator. There are a total of 56 volumes spread across two NetApp arrays. Each volume has 4 paths available to it which are being managed by device mapper.

While logged into the iscsi sessions all I have to do is reboot the server and I get a hang.

I see a message that says:

  "Reached target Shutdown"

followed by

  "systemd-shutdown[1]: Failed to finalize DM devices, ignoring"

and then I see 8 lines that say:

  "connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4311815***, last ping 43118164**, now 4311817***"
  "connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4311815***, last ping 43118164**, now 4311817***"
  "connection3:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4311815***, last ping 43118164**, now 4311817***"
  "connection4:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4311815***, last ping 43118164**, now 4311817***"
  "connection5:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4311815***, last ping 43118164**, now 4311817***"
  "connection6:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4311815***, last ping 43118164**, now 4311817***"
  "connection7:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4311815***, last ping 43118164**, now 4311817***"
  "connection8:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4311815***, last ping 43118164**, now 4311817***"
  NOTE: the actual values of the *'s differ for each line above.

This seems like a bug somewhere but I am unaware of any additional logging that I could turn on to pinpoint the problem.

Note I also have similar setups that are not doing iscsi and they don't have this problem.

Here is a screenshot of what I see on the shell when I try to reboot:

(https://launchpadlibrarian.net/291303059/Screenshot.jpg)

This is being tracked in NetApp bug tracker CQ number 860251.

If I log out of all iscsi sessions before rebooting then I do not experience the hang:

iscsiadm -m node -U all

We are wondering if this could be some kind of shutdown ordering problem. Like the network devices have already disappeared and then iscsi tries to perform some operation (hence the ping timeouts).

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1569925/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
tags: added: xenial
affects: ubuntu → systemd (Ubuntu)
Martin Pitt (pitti) wrote :

My first suspicion is that the interface that the iscsi device is on lives in /etc/network/interfaces (or interfaces.d/*) somewhere and thus is being shut down by networking.service. Ordinarily this network interface is being set up in initramfs by open-iscsi and should not otherwise be configured/touched by the OS (i. e. not by ifupdown, NetworkManager, or networkd).

 - Which network interface is the iscsi device on, and how did you configure this?
 - Please attach /etc/network/interfaces*

It would also be useful to attach a complete shutdown log. Do a reboot and attach /var/log/syslog, that should have sufficient data for the network interface shutdowns (it won't cover the open-iscsi error though as that happens too late, but we have that part in the screenshot).

Changed in systemd (Ubuntu):
status: New → Incomplete
Matt Schulte (gimpyestrada) wrote :

root@ICTM1612S02H1:/etc/network# cat interfaces
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).

source /etc/network/interfaces.d/*

# The loopback network interface
auto lo
iface lo inet loopback

# The primary network interface
auto eno1
iface eno1 inet dhcp

auto enp130s0
iface enp130s0 inet6 static
address fe80::192:12:21:115
netmask 64
mtu 9000

auto enp4s0f0
iface enp4s0f0 inet6 static
address fe80::192:12:22:115
netmask 64
mtu 9000

Matt Schulte (gimpyestrada) wrote :

In the above, the two NICs used for iSCSI traffic are the enp130s0 and the enp4s0f0.

In case you were wondering I have another server running ipv4 and it also hangs.

Martin Pitt (pitti) wrote :

Do you have your root partition on iscsi, or just some auxiliary ones like /opt?

If it's the root partition, then you can't put that on an interface which is "auto" in /etc/network/interfaces -- ifupdown will tear it down during shutdown and sever the connection, explaining the hang. In this case, open-iscsi's initramfs integration will already bring it up. (Hence my question how you set this up).

Matt Schulte (gimpyestrada) wrote :

Here is the requested syslog.

Reboot happens after:

Apr 14 09:49:42 ICTM1612S02H1 root: MDS: Rebooting Now

Matt Schulte (gimpyestrada) wrote :

Nothing for the system is on iscsi. They are just extra volumes that I use for generating block or filesystem IO to our storage.

Martin Pitt (pitti) wrote :

Unfortunately rsyslog gets shut down too early, so we don't see much from this. Can you please re-try with enabling persistent journal:

    sudo mkdir -p /var/log/journal
    sudo systemd-tmpfiles --create --prefix /var/log/journal

then reboot once to actually enable on-disk journal, then reboot again to record the shutdown sequence. After that, please do

   sudo journalctl -b -1 > /tmp/journal.txt

and attach /tmp/journal.txt here.

And again, how exactly did you set this up? Is this reproducible in a VM somehow? Can you please give me /etc/fstab and other files where you configured this?

Thanks!

Matt Schulte (gimpyestrada) wrote :

Here is the requested journal.txt file.

I had a bit of difficulty after performing the requested operations, my server kept booting into the (initramfs) prompt.

The journal log may contain more than one reboot attempt.

Matt Schulte (gimpyestrada) wrote :

This contains the configuration of my iscsi sessions. It is a tar of the entire /etc/iscsi folder, it should show what we've got going on.

Matt Schulte (gimpyestrada) wrote :

And just for good measure this is my /etc/network/interfaces file:

# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).

source /etc/network/interfaces.d/*

# The loopback network interface
auto lo
iface lo inet loopback

# The primary network interface
auto eno1
iface eno1 inet dhcp

auto enp130s0
iface enp130s0 inet6 static
address fe80::192:12:21:115
netmask 64
mtu 9000

auto enp4s0f0
iface enp4s0f0 inet6 static
address fe80::192:12:22:115
netmask 64
mtu 9000

Matt Schulte (gimpyestrada) wrote :

Contents of fstab:

# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
# <file system> <mount point> <type> <options> <dump> <pass>
# / was on /dev/sda1 during installation
UUID=3f62d8b3-3afe-4cd7-8d36-49101b1d7b02 / ext4 errors=remount-ro 0 1
# swap was on /dev/sda5 during installation
UUID=9c54f202-7e90-4633-9134-706e6d7080fc none swap sw 0 0
10.113.242.59:/IOP /opt/iop nfs noauto 0 0

#IOMNT START#
/dev/mapper/3600a09800059d66a0000fa5c56f3e78f /iomnt-3600a09800059d66a0000fa5c56f3e78f ext3 nobarrier,_netdev 0 0
/dev/mapper/3600a09800059d7b40000223656f3ef42 /iomnt-3600a09800059d7b40000223656f3ef42 ext4 nobarrier,_netdev,discard 0 0
/dev/mapper/3600a09800059d66a0000fa5456f3e782 /iomnt-3600a09800059d66a0000fa5456f3e782 xfs nobarrier,_netdev,discard 0 0
/dev/mapper/3600a09800059d66a0000fa5056f3e77b /iomnt-3600a09800059d66a0000fa5056f3e77b btrfs nobarrier,_netdev,discard 0 0
/dev/mapper/360080e50003415e60000147d5605f7fb /iomnt-360080e50003415e60000147d5605f7fb ext3 nobarrier,_netdev 0 0
/dev/mapper/360080e500034153e0000ed065605faa2 /iomnt-360080e500034153e0000ed065605faa2 ext4 nobarrier,_netdev,discard 0 0
#IOMNT END#

sles (slesru) wrote :

I have the same problem , target is HPE MSA, and problem happens not every time, about in 1/5 of reboots- I run 16.04 , so I just tested...
Looks like iscsi not always stopped before network interfaces are down.

sles (slesru) wrote :

>Looks like iscsi not always stopped before network interfaces are down.

No, just reproduced this, iscsi was logged out...

sles (slesru) wrote :

Because it is not always reproduced, I guess this is dependency problem.
Looked into /etc/init.d/iscsid

# Required-Stop: $network $local_fs sendsigs

I guess we need $remote_fs here?

sles (slesru) wrote :

OK, I'm trying reboot and reboot and now I think this is kernel bug and it is multipath related...

Matt Schulte (gimpyestrada) wrote :

I definitely have remote filesystems mounted (using _netdev if that matters) when I experience this problem.

Matt Schulte (gimpyestrada) wrote :

I have finally been able to verify that this is reproducible with the released 16.04 from the ISO.

What else can be provided here to move this bug along?

(This is being tracked internally in CQ 860251)

rc_micha (m-niendorf) wrote :

I have the same problem with my multipath iscsi session. I just logged in without mounting the volumes.
While rebooting I get always the message "systemd-shutdown[1]: Failed to finalize DM devices, ignoring" but in most cases the server just ignores that an reboots.

The iscsi multipath volumes are blacklisted in lvm config and are currently not mounted.

Matt Schulte (gimpyestrada) wrote :

rc_micha: Your volumes aren't mounted, but the iscsi sessions are logged in? If you log out of your iscsi sessions does it shutdown clean?

Andrew Morton (amorton12) wrote :

Seeing what I believe is this bug on a 16.04 install that boots from iSCSI LUN. When I shut down, it goes just past shutting down the network interfaces, then slows down and starts throwing task blocked messages from the kernel. It eventually gets down to "Reached Target Shutdown" before getting completely stuck. I have to manually shut down or reboot the machine from that point.

daigang (daigang0701) wrote :

I got same preblom
look iscsi not logout before Network-manager cut the network

so iscsi can not logout.

Matt Schulte (gimpyestrada) wrote :

There is clearly something wrong here.

FWIW This issue did not exist in 14.04.

Launchpad Janitor (janitor) wrote :

[Expired for systemd (Ubuntu) because there has been no activity for 60 days.]

Changed in systemd (Ubuntu):
status: Incomplete → Expired
Matt Schulte (gimpyestrada) wrote :

If this is still a problem, what do we do now? Do I just keep testing with new releases and file new bugs?

Matt Schulte (gimpyestrada) wrote :

For completeness, I upgraded to 16.04.1, the issue is still present.

I upgraded to the latest xenial-updates packages as of 10/24/16 and the issue is still present.

I then re-installed with 16.10 and the issue is still present. I guess I'll open a new bug for 16.10 and refer back to this one.

Changed in systemd (Ubuntu):
status: Expired → Confirmed
description: updated
Changed in systemd (Ubuntu):
importance: Undecided → High

Seeing a similar issue here with so called "diskless" clients. That is, all of their filesystems are network mounts including the root filesystem. These systems use iscsi targets for the root filesystem and during shutdown the network is turned off before the file I/O operations are completed which causes the system to hang with filesystem IO errors.

I'm currently looking to see if there is some way to make systemd keep the network up no matter what, all the way until the machine physically reboots or shuts down (ie. disable the network shutdown completely).

This worked in 14.04 and got broken in 16.04 apparently due to the switch to systemd as far as I can tell (great tool, isn't it?).

Matt Schulte (gimpyestrada) wrote :

Any idea on this one folks?

Nish Aravamudan (nacc) wrote :

So let's separate out two issues in this bug.

@gimpeystrada, I believe your issue was always with iSCSI disks that were not in-use as the root/boot disk, correct? Looking at your fstab, I think also you don't have multipath over iSCSI, correct?

For anyone else indicating this bug is also affecting them, if @gimpeystrada replies in the affirmative to the above (Not using iSCSI for root disk, not using multipath), then I would like to use separate bugs for those issues. They may all have the same root-cause, but it becomes rather noisy, otherwise.

@gimpestrada, have you tried 16.10 or 17.04 to see if it has been fixed already? Just curious.

Changed in systemd (Ubuntu):
assignee: nobody → Nish Aravamudan (nacc)
Matt Schulte (gimpyestrada) wrote :

iSCSI disks are not the root/boot disks.

Multipath IS in use. Hence the /dev/mapper at the beginning of the disk names.

Yes I have tried 16.10, it is still present and I have another bug for that release. https://bugs.launchpad.net/bugs/1636862

Nish Aravamudan (nacc) wrote :

@gimpeystrada: thank you for the prompt response!

Err, right on multipath. Is it at all possible to test without multipath in use? To help narrow down if it's really in iSCSI or if it's in the multipath layer?

Note additional bugs aren't required (I understand at some point this one may have expired, but we'll end up creating tasks here for all affected releases).

Matt Schulte (gimpyestrada) wrote :

It is possible to test without multipath, this will take some time as I no longer have anything with 16.04 installed. For fun I may go ahead and try with 17.04 so we can figure out if it still exists then we'll kind of kill two birds with one stone.

Nish Aravamudan (nacc) wrote :

No problem -- also, you can just test in 16.10 without multipath. I can spend time reproducing it, etc. later to verify whatever fix we come up with is correct.

Matt Schulte (gimpyestrada) wrote :

Didn't take as long as I thought. Installed the most recent build of Zesty (zesty-server-amd64.iso 2017-02-22 06:54 676M).

I was able to:

1. Confirm the issue still exists.
2. Confirm that the issue is not caused by multipath (i.e. it still occurs during reboot when multipathing is disabled).

Scott (scopa) wrote :

Fresh install 16.04 download ISO from ubuntu
Can confirm this issue as well.
If I manually log out of the iscsi session reboot no issue
If I am logged into iscsi session reboot hangs. Need to physically power cycle server

On Wed, May 24, 2017 at 1:45 PM, Scott <email address hidden> wrote:
> Fresh install 16.04 download ISO from ubuntu
> Can confirm this issue as well.
> If I manually log out of the iscsi session reboot no issue

Interesting data point.

> If I am logged into iscsi session reboot hangs. Need to physically power cycle server

I think the open-iscsi or iscsid services have a killall sessions
portion when the service shuts down. I wonder if that's either a) not
running or b) somehow incorrect. I could imagine for iSCSI root it'd
be an issue, but for non-root, I'm not sure why that's not working.

Nish Aravamudan (nacc) wrote :
Download full text (3.4 KiB)

@gimpeystrada: Thank you for providing your journal logs (in the future, providing the raw journal file is better, as then we can run `journalctl` on it locally). From the logs:

Apr 26 08:52:13 ICTM1612S02H1 iscsiadm[3234]: iscsiadm: initiator reported error (8 - connection timed out)
Apr 26 08:52:13 ICTM1612S02H1 iscsiadm[3234]: iscsiadm: Could not log into all portals
Apr 26 08:52:13 ICTM1612S02H1 iscsiadm[3234]: :0113,3260] successful.
Apr 26 08:52:13 ICTM1612S02H1 systemd[1]: open-iscsi.service: Child 3234 belongs to open-iscsi.service
Apr 26 08:52:13 ICTM1612S02H1 systemd[1]: open-iscsi.service: Main process exited, code=exited, status=8/n/a
Apr 26 08:52:13 ICTM1612S02H1 systemd[1]: open-iscsi.service: Changed start -> failed
Apr 26 08:52:13 ICTM1612S02H1 systemd[1]: Sent message type=signal sender=n/a destination=n/a object=/org/freedesktop/systemd1 interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=936 reply_cookie=0 error=n/a
Apr 26 08:52:13 ICTM1612S02H1 systemd-logind[2801]: Got message type=signal sender=:1.2 destination=n/a object=/org/freedesktop/systemd1 interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=936 reply_cookie=0 error=n/a
Apr 26 08:52:13 ICTM1612S02H1 systemd[1]: open-iscsi.service: Job open-iscsi.service/start finished, result=failed
Apr 26 08:52:13 ICTM1612S02H1 systemd[1]: Failed to start Login to default iSCSI targets.
Apr 26 08:52:13 ICTM1612S02H1 systemd[1]: Sent message type=signal sender=n/a destination=n/a object=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=JobRemoved cookie=937 reply_cookie=0 error=n/a
Apr 26 08:52:13 ICTM1612S02H1 systemd[1]: open-iscsi.service: Unit entered failed state.
Apr 26 08:52:13 ICTM1612S02H1 systemd[1]: open-iscsi.service: Failed with result 'exit-code'.
Apr 26 08:52:13 ICTM1612S02H1 systemd[1]: open-iscsi.service: cgroup is empty

I *think* the issue in this case is that open-iscsi.service is actually in a failed state before shutdown, which I think means it does not run the ExecStop commands? Can you verify that is the case (boot your system, `systemctl is-system-running` should indicate "degraded", `systemctl status open-iscsi.service` should indicate Failed). The problem is that the open-iscsi.service has:

ExecStartPre=/bin/systemctl --quiet is-active iscsid.service
ExecStart=/sbin/iscsiadm -m node --loginall=automatic
ExecStart=/lib/open-iscsi/activate-storage.sh
ExecStop=/lib/open-iscsi/umountiscsi.sh
ExecStop=/bin/sync
ExecStop=/lib/open-iscsi/logout-all.sh

This indicates it, on 'start' of the service, it will try to login to and active all configured iSCSI targets (the two ExecStart lines). However, if either of those fail (as they did in the journal in this case), the ExecStop lines *do not* run. From `man systemd.service`:

           Note that if any of the commands specified in ExecStartPre=,
           ExecStart=, or ExecStartPost= fail (and are not prefixed with "-",
           see above) or time out before the service is fully up, execution
           continues with commands specified in ExecStopPost=, the commands in
           ExecStop= are skipped.

So, I think that we should be using...

Read more...

Matt Schulte (gimpyestrada) wrote :

My test configuration has long since been re-tasked. I will eventually be able to come back and test as you ask, but it may take a few weeks. If anyone else on the bug has their systems up and available, please test and reply.

Changed in systemd (Ubuntu Xenial):
assignee: nobody → Dimitri John Ledkov (xnox)
importance: Undecided → High
Changed in systemd (Ubuntu Xenial):
status: New → Confirmed
Changed in systemd (Ubuntu Xenial):
assignee: Dimitri John Ledkov (xnox) → nobody
Changed in systemd (Ubuntu Xenial):
assignee: nobody → Rafael David Tinoco (inaddy)
status: Confirmed → In Progress
32 comments hidden view all 112 comments
Rafael David Tinoco (inaddy) wrote :
Download full text (6.4 KiB)

Okay, I removed almost everything out of equation:

- removed networking interfaces from systemd
- removed open-iscsi login/logout logic from systemd
- used a single network interface for the logins, on a single iscsi portal

was still able to reproduce the issue by:

- doing a simple login after configuring the network device:
  ./net-start.sh ; sleep 1 ; sudo iscsiadm -m node --login

- shutting down the network device before any logout:
  ./net-stop.sh ; sudo shutdown -h now

There was no systemd service order in play (in between open-iscsi, iscsid and networking / ifupdown scripts) and I was able to cause the same issue 100%, same messages, same symptoms. This tells us that, definitely turning down the network while still logged iscsi sessions is causing the hangs (check #3 for why).

Summary of Causes

1)

network shuts down -> iscsi luns are logged out (attempt) -> iscsid daemon is shutdown
or
network shuts down -> iscsi daemon is shutdown -> iscsi luns are logged out

- logout would fail due to lack of network
- iscsi daemon would die and/or logout wouldn't work
- iscsi sessions would still be there
- system would hang (transport layer can't logout)

2)

iscsi daemon is shutdown -> iscsi luns are logged out -> network shuts down

- this would cause the bug i'm mentioning: daemon dies, logout doesn't work.
- some iscsi sessions would still be there
- system would hang (transport layer can't logout)
- (to check: open-iscsi.service ExecStop= running in parallel ?!?)

3)

I used KVM watchdog virtual device + NMI from host to crash the guest during the hang

http://pastebin.ubuntu.com/25394744/

And, finally, the hang is because kernel is hanged during its shutdown logic.

> 0 0 0 ffffffff81e11500 RU 0.0 0 0 [swapper/0]
> 0 0 1 ffff8801a6a20e00 RU 0.0 0 0 [swapper/1]
> 0 0 2 ffff8801a6a21c00 RU 0.0 0 0 [swapper/2]
> 0 0 3 ffff8801a6a22a00 RU 0.0 0 0 [swapper/3]

ALL CPUs were idling during the hang.

crash> runq
CPU 0 RUNQUEUE: ffff8801ae016e00
  CURRENT: PID: 0 TASK: ffffffff81e11500 COMMAND: "swapper/0"
  RT PRIO_ARRAY: ffff8801ae016fb0
     [no tasks queued]
  CFS RB_ROOT: ffff8801ae016e98
     [no tasks queued]

CPU 1 RUNQUEUE: ffff8801ae096e00
  CURRENT: PID: 0 TASK: ffff8801a6a20e00 COMMAND: "swapper/1"
  RT PRIO_ARRAY: ffff8801ae096fb0
     [no tasks queued]
  CFS RB_ROOT: ffff8801ae096e98
     [no tasks queued]

CPU 2 RUNQUEUE: ffff8801ae116e00
  CURRENT: PID: 0 TASK: ffff8801a6a21c00 COMMAND: "swapper/2"
  RT PRIO_ARRAY: ffff8801ae116fb0
     [no tasks queued]
  CFS RB_ROOT: ffff8801ae116e98
     [no tasks queued]

CPU 3 RUNQUEUE: ffff8801ae196e00
  CURRENT: PID: 0 TASK: ffff8801a6a22a00 COMMAND: "swapper/3"
  RT PRIO_ARRAY: ffff8801ae196fb0
     [no tasks queued]
  CFS RB_ROOT: ffff8801ae196e98
     [no tasks queued]

NO task was scheduled to run.

crash> ps -u
   PID PPID CPU TASK ST %MEM VSZ RSS COMM
      1 0 1 ffff8801a69b8000 UN 0.0 15484 3204 systemd-shutdow

There was just ONE SINGLE user task running (systemd-shutdown)

crash> set ffff8801a69b8000
 ...

Read more...

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1569925

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu Xenial):
status: New → Incomplete
Changed in linux (Ubuntu Zesty):
status: New → Incomplete
Rafael David Tinoco (inaddy) wrote :

## SUMMARY 1

Affects open-iscsi (NOT systemd):

* ISCSID NEEDS TO BE UP & RUNNING (AND LOGGED TO PORTAL) FOR THE LOGOUT TO WORK

https://pastebin.canonical.com/197359/

Daemon shutdown documentation in iscsiadm command says (about stopping iscsid daemon with iscsiadm -k):
""" ... This will immediately stop all iscsid operations and shutdown iscsid. It does not logout any sessions. Running this command is the same as doing "killall iscsid". Neither should normally be used, because if iscsid is doing error recovery or if there is an error while iscsid is not running, the system may not be able to recover. This command and iscsid's SIGTERM handling are experimental. """

Basically meaning you're not able to shutdown the daemon in a clean manner if you have connected sessions. And basically, checking the pastebin, you see the need for running iscsid daemon to logout through iscsiadm command.

So from:

(systemctl edit --full open-iscsi.service)
...
Wants=network-online.target remote-fs-pre.target iscsid.service
After=network-online.target iscsid.service
Before=remote-fs-pre.target
...
ExecStartPre=/bin/systemctl --quiet is-active iscsid.service
ExecStart=/sbin/iscsiadm -m node --loginall=automatic
ExecStart=/lib/open-iscsi/activate-storage.sh
ExecStop=/lib/open-iscsi/umountiscsi.sh
ExecStop=/bin/sync
ExecStop=/lib/open-iscsi/logout-all.sh
...

and

(systemctl edit --full iscsid.service)
...
Wants=network-online.target remote-fs-pre.target
Before=remote-fs-pre.target
After=network.target network-online.target
...
ExecStartPre=/lib/open-iscsi/startup-checks.sh
ExecStart=/sbin/iscsid
ExecStop=/sbin/iscsiadm -k 0 2
...

If umountiscsi.sh (it doesn't kill applications using a mount point) can't umount a mount point (from fstab generator or a mount unit) that is still being used, the logout-all.sh will fail for logging out the session of the disk being used. But this service unit would be considered stopped, allowing "iscsid.service" unit to also be shutdown (killing the iscsid daemon and leaving an opened iscsi session).

This would be problematic for iscsi root disks AND for the kernel issue (comment #73), and this is being taken care by Francis Ginther (fginther) in the next iscsi SRU: https://pastebin.canonical.com/196463/ by creating an iscsi-cleanup.service that will run a script that logs out all remaining iscsi sessions (including /) right before kernel shutdown is called by systemd-shutdown like showed in comment #73.

Changed in open-iscsi (Ubuntu Xenial):
importance: Undecided → Medium
status: New → In Progress
Changed in open-iscsi (Ubuntu Zesty):
importance: Undecided → Medium
status: New → In Progress
Changed in open-iscsi (Ubuntu Artful):
importance: Undecided → Medium
status: New → In Progress
Changed in linux (Ubuntu Xenial):
importance: Undecided → Medium
status: Incomplete → In Progress
Changed in linux (Ubuntu Zesty):
importance: Undecided → Medium
status: Incomplete → In Progress
Changed in linux (Ubuntu Artful):
importance: Undecided → Medium
status: Incomplete → In Progress
no longer affects: systemd (Ubuntu)
Changed in systemd (Ubuntu Xenial):
status: In Progress → Triaged
Changed in systemd (Ubuntu Zesty):
importance: Undecided → Medium
status: New → Triaged
Changed in systemd (Ubuntu Artful):
assignee: Nish Aravamudan (nacc) → nobody
importance: High → Medium
status: Confirmed → Triaged
no longer affects: systemd (Ubuntu Xenial)
no longer affects: systemd (Ubuntu Zesty)
no longer affects: systemd (Ubuntu Artful)
Changed in linux (Ubuntu Xenial):
assignee: nobody → Rafael David Tinoco (inaddy)
Changed in linux (Ubuntu Zesty):
assignee: nobody → Rafael David Tinoco (inaddy)
Changed in linux (Ubuntu Artful):
assignee: nobody → Rafael David Tinoco (inaddy)
3 comments hidden view all 112 comments
Rafael David Tinoco (inaddy) wrote :

2 things for previous comment #75:

1)
I pasted wrong pastebin URL. I'll describe the fix: its a service unit file that runs at the end of shutdown, right before it, re-using /run (tmpfs). Service has a preparation script that copies binaries from initrd image - so there is a minimum execution environment available - and runs the second script. Second script initializes iscsid daemon and waits in a loop for all the portals to be ONLINE and existing sessions to be LOGGED. After that it does the logout on all existing sessions (this case this would be leftovers from my previous comment AND / root disks).

2)
I tested this approach and it works good for iscsi root disks because there is a network interface that stays connected all the time. In the case of leftovers from previous unsuccessful umount attempts, the network.target would be already gone (with networkd / ifupdown interfaces already shutdown). IF I keep my interfaces configured, then this approach works for root disks AND for leftovers. For this approach to be successful for this case, this "cleanup" service would have to raise existing interfaces up again for the iscsid to login and logout to work (OR the kernel hang be resolved, then leftovers would be just left there, without logouts).

Changed in open-iscsi (Ubuntu Xenial):
assignee: nobody → Rafael David Tinoco (inaddy)
Changed in open-iscsi (Ubuntu Zesty):
assignee: nobody → Rafael David Tinoco (inaddy)
Changed in open-iscsi (Ubuntu Artful):
assignee: nobody → Rafael David Tinoco (inaddy)
tags: added: rls-aa-notfixing
Rafael David Tinoco (inaddy) wrote :

I created a workaround for this issue while its being worked on:

http://pastebin.ubuntu.com/25519820/

Create a file: /lib/systemd/system-shutdown/debug.sh and place the contents in it.

This workaround (based on some ideas from sil2100 and slangasek for iscsi / umount) is basically bringing the interfaces up and remounting / so it can finally cleanup everything.

Things to note:

- It ONLY RUNS if there are iscsi leftovers
- It uses ifupdown only for networking (/etc/network/interfaces)
- It has to remount / to do networking and to run iscsid
- If it fails to bring network it will hang like before (kernel issue)
- It waits for iscsi to be logged in again (might take awhile to shutdown)
- If logout fails, it hangs again (unless the network is left configured, i could change it)

What is this script different then altering open-iscsi.service ?

It runs at the very end of systemd shutdown and it is very unlikely that there are any services holding references to iscsi mounts, disallowing them to be logged out.

Now I'll test Debian SID and check if this is found there, to open a bug in Debian project as well.
Before moving on into open-iscsi services - to create a cleanup unit file for the open-iscsi package, like this workaround - i'll dig into the kernel issue. I'm afraid no fix will be as good as making sure kernel let the queued I/O cmd go. I have also to make sure this workaround is changed to allow root iscsi to be logged out.

tags: added: id-598b6543459f8ccf5dfbc04c
Matthijs Hoekstra (mahoekst) wrote :

Even with the script a shutdown/reboot takes a very long time. Any expectations when a fix will be ready? I hit this all the time on 2 installed servers.

Rafael David Tinoco (inaddy) wrote :
Download full text (3.2 KiB)

Hello Matthijs

Unfortunately the best way to make this not to happen is by fixing the kernel hang situation, when kernel calls sd_sync_cache() to every configured device before the shutdown. There is a single I/O cmd hanging in all scsi paths and the I/O error is never propagated to block layer (despite iscsi having proper I/O error settings). I'm finishing analysing some kernel dumps so I can finally understand what is happening in the transport layer (this happens with more recent kernels also).

The workaround was to create a script that would restore the iscsi connection, wait for the login to happen again and the paths are back online, and cleanly logout, allowing the sd_sync_cache() operation to be finalized.

If you are facing this problem, I know for sure that your iscsi connections are not being finalized before the network is off. This means that you have to pay attention on how you configured your iscsi disks:

- guarantee that iscsiadm was configured with "interfaces" so it works on startup:

  sudo iscsiadm -m iface -I ens4 --op=new -n iface.hwaddress -v 52:54:00:b4:21:bb
  sudo iscsiadm -m iface -I ens7 --op=new -n iface.hwaddress -v 52:54:00:c2:34:1b

- the discovery/login has to be made AFTER the iscsiadm had interfaces added

  sudo iscsiadm -m discovery --op=new --op=del --type sendtargets --portal $SERVER1
  sudo iscsiadm -m discovery --op=new --op=del --type sendtargets --portal $SERVER2

  # iscsiadm -m node --loginall=automatic HAS TO WORK or else init scripts will fail

  http://pastebin.ubuntu.com/25894472/

- configure the volumes in /etc/fstab with "_netdev" parameter for systemd unit ordering

  LABEL=BLUE /blue ext4 defaults,_netdev 0 1
  LABEL=GREEN /green ext4 defaults,_netdev 0 1
  LABEL=PURPLE /purple ext4 defaults,_netdev 0 1
  LABEL=RED /red ext4 defaults,_netdev 0 1
  LABEL=YELLOW /yellow ext4 defaults,_netdev 0 1

You have to make sure open-iscsi and iscsid systemd units are started after the network is available and are stopped before they disappear. That might be your problem, if configuration above is correct.

inaddy@iscsihang:~$ systemctl edit --full iscsid.service
inaddy@iscsihang:~$ systemctl edit --full open-iscsi.service

The defaults are:

[Unit]
Description=iSCSI initiator daemon (iscsid)
Documentation=man:iscsid(8)
Wants=network-online.target remote-fs-pre.target
Before=remote-fs-pre.target
After=network.target network-online.target

and

[Unit]
Description=Login to default iSCSI targets
Documentation=man:iscsiadm(8) man:iscsid(8)
Wants=network-online.target remote-fs-pre.target iscsid.service
After=network-online.target iscsid.service
Before=remote-fs-pre.target

So you can see that iscsid.service runs BEFORE open-iscsi.service. In my case, I'm configuring network using rc-local.service (since this is my lab) and I had to guarantee the ordering also:

If, after configuring your system like this, you still face problems, you can use this script:

http://pastebin.ubuntu.com/25894592/

And provide me the DEBUG=/.shutdown.log file, created after its execution, attached to this launchpad case. Its likely that you will have hang iscsi connections for some reason (services ordering, lack of volume...

Read more...

Amanda Gartner (amandagartner) wrote :

I followed all of the setup steps, but was still encountering the ping timeout issue at reboot. I then ran the script http://pastebin.ubuntu.com/25894592/ you provided and am no longer hitting the issue. Was the script supposed to fix the issue? Please let me know. Thank you!

Rafael David Tinoco (inaddy) wrote :

Hello Amanda,

I'm actively working into this, will provide better feedback soon. The script is just a workaround for now, so people can shut down properly, if they are facing this.

Despite what userland does, the kernel will always potentially hang with iscsi sessions left opened during shutdown. If, after following the setup steps, you still faced the issue, its possible that your shutdown systemd unit ordering is wrong (your mount points are being unmounted before they are released). I'm working in the kernel issue, so, it shall cover all the userland-created situations.

Rafael David Tinoco (inaddy) wrote :

Long story short -> https://goo.gl/vPyh8C

Basically the block device enqueues the last request (a SYNC scsi command coming from sd_shutdown) for every scsi device there is on the system. Unfortunately, since the OS is shutting down, in between the block request and its execution, we have userland (systemd) killing iscsid, without proper logout, and/or removing the network.

What happens next is that the mid layer (SCSI) tries to deliver the request through the transport layer (iscsi_tcp_sw) but it fails since the transport layer checks the session status and finds out that the session is not in LOGIN state.

The default behaviour of the transport layer (iscsi_tcp_sw) in such situation is to tell the mid-layer to keep resetting the request timeout timer while it tries to recover (something that will never happen because the network is gone).

Changing that default behavior to state that the scsi command was NOT handled by the transport layer (iscsi_tcp_sw) implies in making the scsi timeout function to try to "abort" the scsi command, which also creates other commands that will timeout because of the transport layer.

Best scenario so far was to change BLK_EH_NOT_HANDLED for BLK_EH_HANDLED in the scsi_times_out function and make the kernel to be able to shutdown. By doing that, I'm confirming to block device something that DID NOT happen, meaning that the command never left the transport layer.

This might be ONE of possible ways to fix this: I can mark in the transport layer that I have timed-out DURING the shutdown procedure, cancelling all the block device requests without having to invoke the scsi error handling mechanism, generating more traffic in transport layer (what would also cause more timeouts, causing a loop in the problematic sequence).

Anyway, I'll get back to this next week and hopefully identify best course of action.

Rafael David Tinoco (inaddy) wrote :

I have proposed a fix to the kernel iscsi transport code. You can follow upstream discussion, if any, in the following address:

https://marc.info/?t=151268397600008&r=1&w=2

The patch proposal is here:

https://marc.info/?l=linux-scsi&m=151268396227285&w=2

If upstream chooses to accept my proposal, which fixes the issue like demonstrated here:

https://marc.info/?l=linux-scsi&m=151274369910906&w=2

Then I'll propose Canonical kernel-team a SRU based on the upstream patch.

For now this case is on hold for upstream discussion/decision.

Thank you
-Rafael

That's awesome sir, thank you for your diligence.

Rafael David Tinoco (inaddy) wrote :

The scsi/iscsi team has accepted by fix proposal and merged the patch for 4.16 merge window:

https://<email address hidden>/msg10111.html

I'll propose the cherry-pick as SRU for Ubuntu kernels as soon as it is merged.

tags: added: sts
Rafael David Tinoco (inaddy) wrote :

Changed open-iscsi to opinion since I choose to fix the kernel instead of fixing Userland. No matter what you do in userland, the kernel had space for freezing and hanging in different scenarios depending on how userland disconnected the transport layer. I kept the "linux" source package as being affected and will SRU it through Ubuntu Kernel Team.

Changed in open-iscsi (Ubuntu Xenial):
status: In Progress → Opinion
Changed in open-iscsi (Ubuntu):
status: In Progress → Opinion
Changed in open-iscsi (Ubuntu Zesty):
status: In Progress → Opinion
Changed in open-iscsi (Ubuntu Artful):
status: In Progress → Opinion
description: updated
Changed in open-iscsi (Ubuntu Trusty):
status: New → Opinion
Changed in linux (Ubuntu Trusty):
status: New → In Progress
assignee: nobody → Rafael David Tinoco (inaddy)
importance: Undecided → Medium
Changed in open-iscsi (Ubuntu Trusty):
assignee: nobody → Rafael David Tinoco (inaddy)
importance: Undecided → Medium
no longer affects: open-iscsi (Ubuntu Zesty)
no longer affects: linux (Ubuntu Zesty)
Rafael David Tinoco (inaddy) wrote :

I have just submitted the SRU proposal to kernel team mailing list. I'll wait for the SRU to be made and this public bug to be marked as Fix Committed before verifying the fix again for the final release.

Seth Forshee (sforshee) on 2018-01-23
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Rafael David Tinoco (inaddy) wrote :

Commit has been merged upstream:

commit d754941225a7dbc61f6dd2173fa9498049f9a7ee
Author: Rafael David Tinoco <email address hidden>
Date: Thu Dec 7 19:59:13 2017 -0200

    scsi: libiscsi: Allow sd_shutdown on bad transport

    If, for any reason, userland shuts down iscsi transport interfaces
    before proper logouts - like when logging in to LUNs manually, without
    logging out on server shutdown, or when automated scripts can't
    umount/logout from logged LUNs - kernel will hang forever on its
    sd_sync_cache() logic, after issuing the SYNCHRONIZE_CACHE cmd to all
    still existent paths.

    PID: 1 TASK: ffff8801a69b8000 CPU: 1 COMMAND: "systemd-shutdow"
     #0 [ffff8801a69c3a30] __schedule at ffffffff8183e9ee
     #1 [ffff8801a69c3a80] schedule at ffffffff8183f0d5
     #2 [ffff8801a69c3a98] schedule_timeout at ffffffff81842199
     #3 [ffff8801a69c3b40] io_schedule_timeout at ffffffff8183e604
     #4 [ffff8801a69c3b70] wait_for_completion_io_timeout at ffffffff8183fc6c
     #5 [ffff8801a69c3bd0] blk_execute_rq at ffffffff813cfe10
     #6 [ffff8801a69c3c88] scsi_execute at ffffffff815c3fc7
     #7 [ffff8801a69c3cc8] scsi_execute_req_flags at ffffffff815c60fe
     #8 [ffff8801a69c3d30] sd_sync_cache at ffffffff815d37d7
     #9 [ffff8801a69c3da8] sd_shutdown at ffffffff815d3c3c

    This happens because iscsi_eh_cmd_timed_out(), the transport layer
    timeout helper, would tell the queue timeout function (scsi_times_out)
    to reset the request timer over and over, until the session state is
    back to logged in state. Unfortunately, during server shutdown, this
    might never happen again.

    Other option would be "not to handle" the issue in the transport
    layer. That would trigger the error handler logic, which would also need
    the session state to be logged in again.

    Best option, for such case, is to tell upper layers that the command was
    handled during the transport layer error handler helper, marking it as
    DID_NO_CONNECT, which will allow completion and inform about the
    problem.

    After the session was marked as ISCSI_STATE_FAILED, due to the first
    timeout during the server shutdown phase, all subsequent cmds will fail
    to be queued, allowing upper logic to fail faster.

    Signed-off-by: Rafael David Tinoco <email address hidden>
    Reviewed-by: Lee Duncan <email address hidden>
    Signed-off-by: Martin K. Petersen <email address hidden>

Applied in Bionic:

https://lists.ubuntu.com/archives/kernel-team/2018-January/089509.html

And Fix Committed.

Ack-ed for Trusty, Xenial, Artful:

https://lists.ubuntu.com/archives/kernel-team/2018-January/089774.html
https://lists.ubuntu.com/archives/kernel-team/2018-February/089809.html

Waiting to be Fix Committed there. A soon as it is released to -proposed I'll verify it and provide feedback so it can be Fix Released.

Changed in linux (Ubuntu Artful):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Trusty):
status: In Progress → Fix Committed
Matt Schulte (gimpyestrada) wrote :

I apologize for my ignorance, but now that tall these check ins have occurred, how do I know when an installable package has been built and is available in the various repos?

And thank you @inaddy for the diligence getting this one resolved.

Rafael David Tinoco (inaddy) wrote :

Hello Matt, this is part of the SRU procedure (https://wiki.ubuntu.com/StableReleaseUpdates). For the kernel, the SRU is a bit different (https://wiki.ubuntu.com/Kernel/Dev/StablePatchFormat, https://wiki.ubuntu.com/Kernel/StableReleaseCadence). Now that the bug has been marked as "Fix Committed" is means that the patch has been merged in the "master-next" branch of every one of the following Ubuntu versions:

Changed in linux (Ubuntu Artful):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Trusty):
status: In Progress → Fix Committed

When the new kernel package is built, it will be placed into -proposed for each of the Ubuntu versions and you will be able to test the kernel and provide feedback here. Since I was the one that fixed it, I'll verify each of the kernels to make sure it fixed the issue. After verification, and at least 7 days in -proposed, the packages shall be moved to -updates.

Hope it helps you understand the fixing steps. Cheers o/

Rafael David Tinoco (inaddy) wrote :

BTW, kernel team usually updates all affecting cases with the version that was built that address this issue (so you know what to text in -proposed).

Matt Schulte (gimpyestrada) wrote :

Thanks, and will notifications appear here in the bug during these stages?

Rafael David Tinoco (inaddy) wrote :

Yep, soon kernel team will advise (here) about this patch been released in a specific kernel version that is in -proposed and will call out for testing/verification. I'll give all feedback here also and will mark, if appropriate, case as "verification-done" so new kernel is marked as solving this issue.

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-artful' to 'verification-done-artful'. If the problem still exists, change the tag 'verification-needed-artful' to 'verification-failed-artful'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-artful
tags: added: verification-needed-xenial

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

Rafael David Tinoco (inaddy) wrote :

Hello Kleber,

Thanks for the feedback. Working in verification.

Rafael David Tinoco (inaddy) wrote :

# xenial - 4.4.0-112

[ OK ] Reached target Shutdown.
[ 70.272171] connection3:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4294907361, last ping 4294908612, now 4294909864
[ 70.274189] connection4:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4294907361, last ping 4294908612, now 4294909864
[ 70.288082] connection5:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4294907362, last ping 4294908614, now 4294909868
[ 70.290066] connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4294907363, last ping 4294908614, now 4294909868
[ 70.292039] connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4294907362, last ping 4294908614, now 4294909868

<hang for more than 5 minutes>

# xenial - 4.4.0-116

[ OK ] Reached target Shutdown.
[ 38.384175] connection3:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4294899386, last ping 4294900640, now 4294901892
[ 38.386107] connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4294899387, last ping 4294900640, now 4294901892
[ 38.388023] connection5:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4294899387, last ping 4294900640, now 4294901892
[ 38.388159] connection4:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4294899386, last ping 4294900640, now 4294901892
[ 38.388159] connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4294899387, last ping 4294900640, now 4294901892
[ 89.825203] reboot: Restarting system

-> xenial-verification-done

tags: added: verification-done-xenial
removed: verification-needed-xenial
Rafael David Tinoco (inaddy) wrote :

# xenial - 4.4.0-112

[ OK ] Reached target Shutdown.
[ 70.272171] connection3:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4294907361, last ping 4294908612, now 4294909864
[ 70.274189] connection4:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4294907361, last ping 4294908612, now 4294909864
[ 70.288082] connection5:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4294907362, last ping 4294908614, now 4294909868
[ 70.290066] connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4294907363, last ping 4294908614, now 4294909868
[ 70.292039] connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4294907362, last ping 4294908614, now 4294909868

<hang for more than 5 minutes>

# xenial - 4.4.0-116

[ OK ] Reached target Shutdown.
[ 38.384175] connection3:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4294899386, last ping 4294900640, now 4294901892
[ 38.386107] connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4294899387, last ping 4294900640, now 4294901892
[ 38.388023] connection5:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4294899387, last ping 4294900640, now 4294901892
[ 38.388159] connection4:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4294899386, last ping 4294900640, now 4294901892
[ 38.388159] connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4294899387, last ping 4294900640, now 4294901892
[ 89.825203] reboot: Restarting system

-> xenial-verification-done

--------

# artful - 4.13.0-32

[ OK ] Reached target Shutdown.
[ 194.277712] connection4:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4294938304, last ping 4294939584, now 4294940864
[ 194.279684] connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4294938304, last ping 4294939584, now 4294940864
[ 194.281483] connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4294938304, last ping 4294939584, now 4294940864
[ 194.283265] connection5:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4294938304, last ping 4294939584, now 4294940864
[ 194.285067] connection3:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4294938304, last ping 4294939584, now 4294940864

<hang for more than 15 minutes>

# artful - 4.13.0-35

[ OK ] Reached target Shutdown.
[ 416.109989] connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4294993856, last ping 4294995136, now 4294996416
[ 416.111922] connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4294993856, last ping 4294995136, now 4294996416
[ 416.113869] connection4:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4294993856, last ping 4294995136, now 4294996416
[ 416.115741] connection3:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4294993856, last ping 4294995136, now 4294996416
[ 416.117650] connection5:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4294993856, last ping 4294995136, now 4294996416
[ 469.614166] reboot: Restarting system

-> artful-verification-done

tags: added: verification-done-artful
removed: verification-needed-artful
Rafael David Tinoco (inaddy) wrote :

Checking if Trusty's fix is in -proposed (no comment given from Kernel Team about it)

Launchpad Janitor (janitor) wrote :
Download full text (20.1 KiB)

This bug was fixed in the package linux - 4.13.0-36.40

---------------
linux (4.13.0-36.40) artful; urgency=medium

  * linux: 4.13.0-36.40 -proposed tracker (LP: #1750010)

  * Rebuild without "CVE-2017-5754 ARM64 KPTI fixes" patch set

linux (4.13.0-35.39) artful; urgency=medium

  * linux: 4.13.0-35.39 -proposed tracker (LP: #1748743)

  * CVE-2017-5715 (Spectre v2 Intel)
    - Revert "UBUNTU: SAUCE: turn off IBPB when full retpoline is present"
    - SAUCE: turn off IBRS when full retpoline is present
    - [Packaging] retpoline files must be sorted
    - [Packaging] pull in retpoline files

linux (4.13.0-34.37) artful; urgency=medium

  * linux: 4.13.0-34.37 -proposed tracker (LP: #1748475)

  * libata: apply MAX_SEC_1024 to all LITEON EP1 series devices (LP: #1743053)
    - libata: apply MAX_SEC_1024 to all LITEON EP1 series devices

  * KVM patches for s390x to provide facility bits 81 (ppa15) and 82 (bpb)
    (LP: #1747090)
    - KVM: s390: wire up bpb feature

  * artful 4.13 i386 kernels crash after memory hotplug remove (LP: #1747069)
    - Revert "mm, memory_hotplug: do not associate hotadded memory to zones until
      online"

  * CVE-2017-5715 (Spectre v2 Intel)
    - x86/feature: Enable the x86 feature to control Speculation
    - x86/feature: Report presence of IBPB and IBRS control
    - x86/enter: MACROS to set/clear IBRS and set IBPB
    - x86/enter: Use IBRS on syscall and interrupts
    - x86/idle: Disable IBRS entering idle and enable it on wakeup
    - x86/idle: Disable IBRS when offlining cpu and re-enable on wakeup
    - x86/mm: Set IBPB upon context switch
    - x86/mm: Only set IBPB when the new thread cannot ptrace current thread
    - x86/entry: Stuff RSB for entry to kernel for non-SMEP platform
    - x86/kvm: add MSR_IA32_SPEC_CTRL and MSR_IA32_PRED_CMD to kvm
    - x86/kvm: Set IBPB when switching VM
    - x86/kvm: Toggle IBRS on VM entry and exit
    - x86/spec_ctrl: Add sysctl knobs to enable/disable SPEC_CTRL feature
    - x86/spec_ctrl: Add lock to serialize changes to ibrs and ibpb control
    - x86/cpu/AMD: Add speculative control support for AMD
    - x86/microcode: Extend post microcode reload to support IBPB feature
    - KVM: SVM: Do not intercept new speculative control MSRs
    - x86/svm: Set IBRS value on VM entry and exit
    - x86/svm: Set IBPB when running a different VCPU
    - KVM: x86: Add speculative control CPUID support for guests
    - SAUCE: turn off IBPB when full retpoline is present

  * Artful 4.13 fixes for tun (LP: #1748846)
    - tun: call dev_get_valid_name() before register_netdevice()
    - tun: allow positive return values on dev_get_valid_name() call
    - tun/tap: sanitize TUNSETSNDBUF input

  * boot failure on AMD Raven + WestonXT (LP: #1742759)
    - SAUCE: drm/amdgpu: add atpx quirk handling (v2)

linux (4.13.0-33.36) artful; urgency=low

  * linux: 4.13.0-33.36 -proposed tracker (LP: #1746903)

  [ Stefan Bader ]
  * starting VMs causing retpoline4 to reboot (LP: #1747507) // CVE-2017-5715
    (Spectre v2 retpoline)
    - x86/retpoline: Fill RSB on context switch for affected CPUs
    - x86/retpoline: Add LFENCE to the retpoline/RSB filling RSB macros
    - x86/retpol...

Changed in linux (Ubuntu Artful):
status: Fix Committed → Fix Released
status: Fix Committed → Fix Released
1 comments hidden view all 112 comments
Launchpad Janitor (janitor) wrote :
Download full text (10.7 KiB)

This bug was fixed in the package linux - 4.4.0-116.140

---------------
linux (4.4.0-116.140) xenial; urgency=medium

  * linux: 4.4.0-116.140 -proposed tracker (LP: #1748990)

  * BUG: unable to handle kernel NULL pointer dereference at 0000000000000009
    (LP: #1748671)
    - SAUCE: net: ipv4: fix for a race condition in raw_sendmsg -- fix backport

linux (4.4.0-115.139) xenial; urgency=medium

  * linux: 4.4.0-115.138 -proposed tracker (LP: #1748745)

  * CVE-2017-5715 (Spectre v2 Intel)
    - Revert "UBUNTU: SAUCE: turn off IBPB when full retpoline is present"
    - SAUCE: turn off IBRS when full retpoline is present
    - [Packaging] retpoline files must be sorted
    - [Packaging] pull in retpoline files

linux (4.4.0-114.137) xenial; urgency=medium

  * linux: 4.4.0-114.137 -proposed tracker (LP: #1748484)

  * ALSA backport missing NVIDIA GPU codec IDs to patch table to
    Ubuntu 16.04 LTS Kernel (LP: #1744117)
    - ALSA: hda - Add missing NVIDIA GPU codec IDs to patch table

  * Shutdown hang on 16.04 with iscsi targets (LP: #1569925)
    - scsi: libiscsi: Allow sd_shutdown on bad transport

  * libata: apply MAX_SEC_1024 to all LITEON EP1 series devices (LP: #1743053)
    - libata: apply MAX_SEC_1024 to all LITEON EP1 series devices

  * KVM patches for s390x to provide facility bits 81 (ppa15) and 82 (bpb)
    (LP: #1747090)
    - KVM: s390: wire up bpb feature
    - KVM: s390: Enable all facility bits that are known good for passthrough

  * CVE-2017-5715 (Spectre v2 Intel)
    - SAUCE: drop lingering gmb() macro
    - x86/feature: Enable the x86 feature to control Speculation
    - x86/feature: Report presence of IBPB and IBRS control
    - x86/enter: MACROS to set/clear IBRS and set IBPB
    - x86/enter: Use IBRS on syscall and interrupts
    - x86/idle: Disable IBRS entering idle and enable it on wakeup
    - x86/idle: Disable IBRS when offlining cpu and re-enable on wakeup
    - x86/mm: Set IBPB upon context switch
    - x86/mm: Only set IBPB when the new thread cannot ptrace current thread
    - x86/kvm: add MSR_IA32_SPEC_CTRL and MSR_IA32_PRED_CMD to kvm
    - x86/kvm: Set IBPB when switching VM
    - x86/kvm: Toggle IBRS on VM entry and exit
    - x86/spec_ctrl: Add sysctl knobs to enable/disable SPEC_CTRL feature
    - x86/spec_ctrl: Add lock to serialize changes to ibrs and ibpb control
    - x86/cpu/amd, kvm: Satisfy guest kernel reads of IC_CFG MSR
    - x86/cpu/AMD: Add speculative control support for AMD
    - x86/microcode: Extend post microcode reload to support IBPB feature
    - KVM: SVM: Do not intercept new speculative control MSRs
    - x86/svm: Set IBRS value on VM entry and exit
    - x86/svm: Set IBPB when running a different VCPU
    - KVM: x86: Add speculative control CPUID support for guests
    - SAUCE: Fix spec_ctrl support in KVM
    - SAUCE: turn off IBPB when full retpoline is present

linux (4.4.0-113.136) xenial; urgency=low

  * linux: 4.4.0-113.136 -proposed tracker (LP: #1746936)

  [ Stefan Bader ]
  * Missing install-time driver for QLogic QED 25/40/100Gb Ethernet NIC
    (LP: #1743638)
    - [d-i] Add qede to nic-modules udeb

  * CVE-2017-5753 (Spectre v1 Intel)
    - x86/cpu/AMD: Make t...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
status: Fix Committed → Fix Released
1 comments hidden view all 112 comments
Launchpad Janitor (janitor) wrote :
Download full text (3.3 KiB)

This bug was fixed in the package linux - 4.15.0-10.11

---------------
linux (4.15.0-10.11) bionic; urgency=medium

  * linux: 4.15.0-10.11 -proposed tracker (LP: #1749250)

  * "swiotlb: coherent allocation failed" dmesg spam with linux 4.15.0-9.10
    (LP: #1749202)
    - swiotlb: suppress warning when __GFP_NOWARN is set
    - drm/ttm: specify DMA_ATTR_NO_WARN for huge page pools

  * linux-tools: perf incorrectly linking libbfd (LP: #1748922)
    - SAUCE: tools -- add ability to disable libbfd
    - [Packaging] correct disablement of libbfd

  * [Artful] Realtek ALC225: 2 secs noise when a headset plugged in
    (LP: #1744058)
    - ALSA: hda/realtek - update ALC225 depop optimize

  * [Artful] Support headset mode for DELL WYSE (LP: #1723913)
    - SAUCE: ALSA: hda/realtek - Add support headset mode for DELL WYSE

  * headset mic can't be detected on two Dell machines (LP: #1748807)
    - ALSA: hda/realtek - Support headset mode for ALC215/ALC285/ALC289
    - ALSA: hda - Fix headset mic detection problem for two Dell machines

  * Bionic update to v4.15.3 stable release (LP: #1749191)
    - ip6mr: fix stale iterator
    - net: igmp: add a missing rcu locking section
    - qlcnic: fix deadlock bug
    - qmi_wwan: Add support for Quectel EP06
    - r8169: fix RTL8168EP take too long to complete driver initialization.
    - tcp: release sk_frag.page in tcp_disconnect
    - vhost_net: stop device during reset owner
    - ipv6: addrconf: break critical section in addrconf_verify_rtnl()
    - ipv6: change route cache aging logic
    - Revert "defer call to mem_cgroup_sk_alloc()"
    - net: ipv6: send unsolicited NA after DAD
    - rocker: fix possible null pointer dereference in
      rocker_router_fib_event_work
    - tcp_bbr: fix pacing_gain to always be unity when using lt_bw
    - cls_u32: add missing RCU annotation.
    - ipv6: Fix SO_REUSEPORT UDP socket with implicit sk_ipv6only
    - soreuseport: fix mem leak in reuseport_add_sock()
    - net_sched: get rid of rcu_barrier() in tcf_block_put_ext()
    - net: sched: fix use-after-free in tcf_block_put_ext
    - media: mtk-vcodec: add missing MODULE_LICENSE/DESCRIPTION
    - media: soc_camera: soc_scale_crop: add missing
      MODULE_DESCRIPTION/AUTHOR/LICENSE
    - media: tegra-cec: add missing MODULE_DESCRIPTION/AUTHOR/LICENSE
    - gpio: uniphier: fix mismatch between license text and MODULE_LICENSE
    - crypto: tcrypt - fix S/G table for test_aead_speed()
    - Linux 4.15.3

  * bnx2x_attn_int_deasserted3:4323 MC assert! (LP: #1715519) //
    CVE-2018-1000026
    - net: create skb_gso_validate_mac_len()
    - bnx2x: disable GSO where gso_size is too big for hardware

  * ethtool -p fails to light NIC LED on HiSilicon D05 systems (LP: #1748567)
    - net: hns: add ACPI mode support for ethtool -p

  * CVE-2017-5715 (Spectre v2 Intel)
    - [Packaging] retpoline files must be sorted
    - [Packaging] pull in retpoline files

  * [Feature] PXE boot with Intel Omni-Path (LP: #1712031)
    - d-i: Add hfi1 to nic-modules

  * CVE-2017-5715 (Spectre v2 retpoline)
    - [Packaging] retpoline -- add call site validation
    - [Config] disable retpoline checks for first upload

  * Do not dup...

Read more...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Stefan Bader (smb) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-trusty' to 'verification-done-trusty'. If the problem still exists, change the tag 'verification-needed-trusty' to 'verification-failed-trusty'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-trusty
Rafael David Tinoco (inaddy) wrote :

I'm verifying for trusty and its taking sometime because looks like system still hangs on shutdown. I have generated a kdump (from a watchdog timer) in the exact moment trusty is hanging on shutdown to see why BUT crash can't handle a kdump from this new kernel and I still don't know why. Testing latest crash tool to see if its because of recent meltdown/spectre patches or something like it.

tags: added: verification-failed-trusty
removed: verification-needed-trusty
Rafael David Tinoco (inaddy) wrote :
Download full text (3.4 KiB)

I'm failing trusty's patch because:

1) It was not backported fully, missing the debug message at its bottom. That led me to initially think, when debugging, that the patch hasn't been included at all. Later I could check kernel tree and saw the it was manually merged, missing that debug message part.

2) Since functional piece of code was backported, it should, still, have worked, and it did not. Something in trusty is not making it to work like expected. I'm still investigating it through a kdump taken during a shutdown hang on trusty, but, I don't want to hold the kernel release because of that:

This is the output for my tests:

https://pastebin.ubuntu.com/p/B9fp5y5gqK/

And you can check (1) with:

[ 144.862511] session4: iscsi_eh_cmd_timed_out return nh

It should be "or shutdown" there, together.

Differently than other tests, in trusty's kernel (3.13) you can see:

[ 144.860010] session4: iscsi_eh_cmd_timed_out scsi cmd ffff880037a9a100 timedout
[ 144.861400] session4: iscsi_eh_cmd_timed_out sc on shutdown, handled
[ 144.862511] session4: iscsi_eh_cmd_timed_out return nh
[ 144.863483] session4: iscsi_queuecommand iscsi: cmd 0x3 is not queued (2)

[ 144.864794] session4: iscsi_eh_device_reset LU Reset [sc ffff880037a9a100 lun 1]
[ 144.865907] session4: iscsi_eh_device_reset dev reset result = FAILED
[ 144.866875] session4: iscsi_eh_target_reset tgt Reset [sc ffff880037a9a100 tgt iqn.2017.tgtd:tid5.lun]
[ 144.868343] session4: iscsi_eh_target_reset tgt iqn.2017.tgtd:tid5.lun reset result = FAILED
[ 144.869580] session4: iscsi_eh_session_reset wait for relogin

The iscsi transport layer, did not queue the other commands, after the first one timed out, but instead of warning the upper layer, it proceeded with a "device reset", causing the "relogin" logic to be stuck since the transport layer was already gone.

This is the hang backtrace:

crash> bt
PID: 5980 TASK: ffff880057e19800 CPU: 0 COMMAND: "halt"
 #0 [ffff880037b53a50] __schedule at ffffffff8173af59
 #1 [ffff880037b53ab8] schedule at ffffffff8173b3e9
 #2 [ffff880037b53ac8] schedule_timeout at ffffffff8173a5d5
 #3 [ffff880037b53b70] io_schedule_timeout at ffffffff8173bacb
 #4 [ffff880037b53ba0] wait_for_completion_io_timeout at ffffffff8173c1e6
 #5 [ffff880037b53bf8] blk_execute_rq at ffffffff8134b9fb
 #6 [ffff880037b53ca8] scsi_execute at ffffffff814f1a77
 #7 [ffff880037b53cf0] scsi_execute_req_flags at ffffffff814f2cfc
 #8 [ffff880037b53d58] sd_sync_cache at ffffffff81500626
 #9 [ffff880037b53dd0] sd_shutdown at ffffffff81500bb9
#10 [ffff880037b53df0] device_shutdown at ffffffff814a4495
#11 [ffff880037b53e20] kernel_power_off at ffffffff81096b75
#12 [ffff880037b53e30] SYSC_reboot at ffffffff81096d4b
#13 [ffff880037b53f70] sys_reboot at ffffffff81096ebe
#14 [ffff880037b53f80] system_call_fastpath at ffffffff81748170
    RIP: 00007ff48aa45bc6 RSP: 00007ffef9bc4be8 RFLAGS: 00010246
    RAX: 00000000000000a9 RBX: ffffffff81748170 RCX: 000000000000001e
    RDX: 000000004321fedc RSI: 0000000028121969 RDI: fffffffffee1dead
    RBP: 0000000000000000 R8: fefefefefefefeff R9: 0000000000000000
    R10: 00007ff48ad13c8c R11: 0000000000000202 R12: ffffffff8...

Read more...

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 3.13.0-144.193

---------------
linux (3.13.0-144.193) trusty; urgency=medium

  * linux: 3.13.0-144.193 -proposed tracker (LP: #1755227)

  * CVE-2017-12762
    - isdn/i4l: fix buffer overflow

  * CVE-2017-17807
    - KEYS: add missing permission check for request_key() destination

  * bnx2x_attn_int_deasserted3:4323 MC assert! (LP: #1715519) //
    CVE-2018-1000026
    - net: Add ndo_gso_check
    - net: create skb_gso_validate_mac_len()
    - bnx2x: disable GSO where gso_size is too big for hardware

  * CVE-2017-17448
    - netfilter: nfnetlink_cthelper: Add missing permission checks

  * CVE-2017-11089
    - cfg80211: Define nla_policy for NL80211_ATTR_LOCAL_MESH_POWER_MODE

  * CVE-2018-5332
    - RDS: Heap OOB write in rds_message_alloc_sgs()

  * ppc64el: Do not call ibm,os-term on panic (LP: #1736954)
    - powerpc: Do not call ppc_md.panic in fadump panic notifier

  * CVE-2017-17805
    - crypto: salsa20 - fix blkcipher_walk API usage

  * [Hyper-V] storvsc: do not assume SG list is continuous when doing bounce
    buffers (LP: #1742480)
    - SAUCE: storvsc: do not assume SG list is continuous when doing bounce
      buffers

  * Shutdown hang on 16.04 with iscsi targets (LP: #1569925)
    - scsi: libiscsi: Allow sd_shutdown on bad transport

  * CVE-2017-17741
    - KVM: Fix stack-out-of-bounds read in write_mmio

  * CVE-2017-5715 (Spectre v2 Intel)
    - [Packaging] pull in retpoline files

 -- Stefan Bader <email address hidden> Thu, 15 Mar 2018 15:08:03 +0100

Changed in linux (Ubuntu Trusty):
status: Fix Committed → Fix Released
status: Fix Committed → Fix Released
Displaying first 40 and last 40 comments. View all 112 comments or add a comment.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Related questions