Ubuntu 14.04.03 LPAR hits kernel oops after serial adapter is removed from profile

Bug #1491494 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Triaged
High
Thadeu Lima de Souza Cascardo

Bug Description

-- Problem Description --

The failure related to the BELL-3 (2 port-Async EIA-232 adapter). Ubuntu always hit exception when the adapter is not present. See my test scenarios below.

Test #1: Boot Ubuntu with BELL-3 adapter
=======

- The Ubuntu LPAR was running with the BELL-3 (2 port-Async EIA-232 adapter) before. So I assigned the BELL-3 adapter to Ubuntu LPAR profile and powered on the LPAR.
=> Ubuntu boot fine this time.

Test #2: Boot Ubuntu with BELL-3 adapter removed from LPAR profile
=======

- I powered down the Ubuntu partition and removed the BELL-3 adapter from LPAR profile then powered on the LPAR.
=> Ubuntu hit the exception.

Elapsed time since release of system processors: 0 mins 9 secs
error: no suitable video mode found.
OF stdout device is: /vdevice/vty@30000000
Preparing to boot Linux version 3.19.0-23-generic (buildd@denneed03) (gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) ) #24~14.04.1-Ubuntu SMP Wed Jul 8 11:17:19 UTC 2015 (Ubuntu 3.19.0-23.24~14.04.1-generic 3.19.8-ckt2)
Detected machine type: 0000000000000101
Max number of cores passed to firmware: 256 (NR_CPUS = 2048)
Calling ibm,client-architecture-support... done
command line: BOOT_IMAGE=/boot/vmlinux-3.19.0-23-generic root=UUID=768190e7-f633-4c63-a1e3-588d12dea265 ro quiet splash vt.handoff=7
memory layout at init:
  memory_limit : 0000000000000000 (16 MB aligned)
  alloc_bottom : 000000000b420000
  alloc_top : 0000000010000000
  alloc_top_hi : 0000000010000000
  rmo_top : 0000000010000000
  ram_top : 0000000010000000
instantiating rtas at 0x000000000ecb0000... done
prom_hold_cpus: skipped
copying OF device tree...
Building dt strings...
Building dt structure...
Device tree strings 0x000000000b430000 -> 0x000000000b4316b1
Device tree struct 0x000000000b440000 -> 0x000000000b470000
Calling quiesce...
returning from prom_init
 -> smp_release_cpus()
spinning_secondaries = 15
 <- smp_release_cpus()
 <- setup_system()
[ 0.661510] /build/linux-lts-vivid-uV14Ja/linux-lts-vivid-3.19.0/drivers/rtc/hctosys.c: unable to open rtc device (rtc0)
[ 0.672826] sd 0:0:1:0: [sda] Assuming drive cache: write through
[ 4.658302] device-mapper: table: 252:0: multipath: error getting device
[ 4.691990] device-mapper: table: 252:0: multipath: error getting device
[ 4.934034] device-mapper: table: 252:0: multipath: error getting device
[ 4.951977] device-mapper: table: 252:0: multipath: error getting device
 * Discovering and coalescing multipaths... [ OK ]
Skipping profile in /etc/apparmor.d/disable: usr.sbin.rsyslogd
 * Starting AppArmor profiles [ OK ]
Loading the saved-state of the serial devices...
[ 5.109665] Unable to handle kernel paging request for data at address 0xd000080000000003
[ 5.109677] Faulting instruction address: 0xc00000000060fec4
[ 5.109685] Oops: Kernel access of bad area, sig: 11 [#1]
[ 5.109691] SMP NR_CPUS=2048 NUMA pSeries
[ 5.109699] Modules linked in: dm_round_robin dm_multipath scsi_dh pseries_rng rtc_generic knem(OE) nfsd auth_rpcgss nfs_acl nfs lockd grace sunrpc fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) configfs ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlx4_ib(OE) ib_sa(OE) ib_mad(OE) ib_core(OE) ib_addr(OE) mlx4_en(OE) mlx4_core(OE) mlx_compat(OE)
[ 5.109759] CPU: 1 PID: 1816 Comm: setserial Tainted: G OE 3.19.0-23-generic #24~14.04.1-Ubuntu
[ 5.109769] task: c0000000f389c880 ti: c0000000f0528000 task.ti: c0000000f0528000
[ 5.109777] NIP: c00000000060fec4 LR: c000000000617498 CTR: c00000000060fe20
[ 5.109785] REGS: c0000000f052b6b0 TRAP: 0300 Tainted: G OE (3.19.0-23-generic)
[ 5.109793] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 84002022 XER: 00000000
[ 5.109814] CFAR: c000000000008468 DAR: d000080000000003 DSISR: 42000000 SOFTE: 1
GPR00: c000000000617498 c0000000f052b930 c00000000144c700 00000000000000bf
GPR04: d000080000000003 00000000000000bf c0000000f3990000 0000000000000141
GPR08: c000000000611d20 c0000000013539e0 d000080000000000 c000000001351ba8
GPR12: c00000000060fe20 c00000000e830900 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 000000000000007d 0000000000000040 0000000000000000 0000000000000000
GPR24: 0000000000000000 c0000000f53cbc00 0000000000000001 0000000000000000
GPR28: c0000000f53cbde0 00000000000000bf 0000000000000003 c000000001754970
[ 5.109916] NIP [c00000000060fec4] io_serial_out+0xa4/0xd0
[ 5.109924] LR [c000000000617498] serial8250_do_startup+0x978/0xe50
[ 5.109931] Call Trace:
[ 5.109936] [c0000000f052b930] [c0000000f052b970] 0xc0000000f052b970 (unreliable)
[ 5.109948] [c0000000f052b970] [c000000000617498] serial8250_do_startup+0x978/0xe50
[ 5.109958] [c0000000f052ba10] [c00000000060eb00] uart_startup.part.7+0xd0/0x310
[ 5.109967] [c0000000f052ba60] [c00000000060f1ac] uart_set_info+0x46c/0x580
[ 5.109976] [c0000000f052bb90] [c00000000060f378] uart_ioctl+0xb8/0x590
[ 5.109986] [c0000000f052bc40] [c0000000005dd89c] tty_ioctl+0x21c/0xf60
[ 5.109995] [c0000000f052bd40] [c0000000002ce680] do_vfs_ioctl+0x4f0/0x7c0
[ 5.110004] [c0000000f052bde0] [c0000000002cea24] SyS_ioctl+0xd4/0xf0
[ 5.110014] [c0000000f052be30] [c000000000009258] system_call+0x38/0xd0
[ 5.110021] Instruction dump:
[ 5.110026] 38210040 e8010010 eba1ffe8 ebc1fff0 ebe1fff8 7c0803a6 4e800020 3d42fff0
[ 5.110040] 392a72e0 e9490000 7c845214 7c0004ac <98640000> 39200001 992d02bc 38210040
[ 5.110057] ---[ end trace 7c597ccc52ffb926 ]---
[ 5.114039]

3) Test 3: DLPAR removed the adapter first then reboot the LPAR
   ======
- I powered down the Ubuntu LPAR.
- I then assigned the BELL-3 adapter back in Ubuntu LPAR profile. Then powered the partition.
- It boot fine with no problem.

root@tul7p07:~# lspci
60:00.0 Serial controller: Digi International Device 00f6

  0000:60:00.0 ttyS0 ttyS1 serial U78CB.001.WZS02NH-P1-C12-T1
                                         serial (1410f600)
        Manufacturer Name.........IBM
        Machine Type-Model........Unknown
        Device Specific.(YC)......0
        Location Code.(YL)........U78CB.001.WZS02NH-P1-C12-T1

  ttyS0 U78CB.001.WZS02NH-P1-C12-T1
                                         Serial Device
        Location Code.(YL)........U78CB.001.WZS02NH-P1-C12-T1

  ttyS1 U78CB.001.WZS02NH-P1-C12-T1
                                         Serial Device
        Location Code.(YL)........U78CB.001.WZS02NH-P1-C12-T1

- I then went into HMC and performed the DLPAR remove adater this time. The operation completed successfully.

- I then powered down and check LPAR profile (No more BELL-3 adapter assigned).

- I then powered up the Ubuntu LPAR again. Still hit exception in this case.

So Ubuntu always hit exception when the adapter is not present.

The system does show a config file originally created on Jul 30. The /etc/init.d/setserial is the startup service that attempts to configure the serial devices either using /etc/serial.conf (there isn't one) or /var/lib/setserial/autoserial.conf which does exist.

root@tul7p07:/etc/init.d# ls -l /var/lib/setserial/autoserial.conf
-rw-r--r-- 1 root root 518 Jul 30 00:27 /var/lib/setserial/autoserial.conf
root@tul7p07:/etc/init.d# ls /etc/serial.conf
ls: cannot access /etc/serial.conf: No such file or directory
root@tul7p07:/etc/init.d# cat /var/lib/setserial/autoserial.conf
###PORT STATE GENERATED USING AUTOSAVE-ONCE###
###AUTOSAVE-ONCE###
###AUTOSAVE-ONCE###
###AUTOSAVE###
#
# If you want to configure this file by hand, use
# dpkg-reconfigure setserial
# and change the configuration mode of the file to MANUAL. If you do not do this, this file may be overwritten automatically the next time you upgrade the
# package.
#
/dev/ttyS0 uart 16950/954 port 0x0000 irq 0 baud_base 4000000 spd_normal skip_test
/dev/ttyS1 uart 16950/954 port 0x0000 irq 0 baud_base 4000000 spd_normal skip_test

I am thinking that if you rename or mv the /var/lib/setserial/autoserial.conf so it doesn't find it (or disable the setserial service might work, too) it may just come up without the adapter.

So, next step is to rename or move that conf file, shutdown the partition, remove the digi adapter from the profile and see what happens when we come back up. If it comes back up the question will be, what should the OS if it has autosaved configuration info on the ports and then the adapter is removed? Should the system ensure those devices are still present before attempting to tell the kernel to configure; should the kernel have more sanity checks?

Thanks to Luciano C. for pointed out the issues. I ran tests and confirmed that what he pointed out is correct.

So now we need to address these questions from his previous comment:

- what should the OS if it has autosaved configuration info on the ports and then the adapter is removed?
- Should the system ensure those devices are still present before attempting to tell the kernel to configure;
- should the kernel have more sanity checks?

Here are the tests I ran:
========================

1) First, I booted Ubuntu with serial adapter.

root@tul7p07:~# lspci
60:00.0 Serial controller: Digi International Device 00f6
root@tul7p07:~#

2) Then I moved /var/lib/setserial/autoserial.conf to a different name. (Per Luciano C. instruction).

root@tul7p07:~# mv /var/lib/setserial/autoserial.conf /var/lib/setserial/autoserial.conf.org
root@tul7p07:~# ls -l /var/lib/setserial/autoserial.conf*
-rw-r--r-- 1 root root 305 Jul 30 00:27 /var/lib/setserial/autoserial.conf.old
-rw-r--r-- 1 root root 518 Jul 30 00:27 /var/lib/setserial/autoserial.conf.org

3) I then Shutdowned the Ubuntu partition and removed serial adapter from partition's profile.
Then I boot it up again. The system came up to the login prompt.

=====
Ubuntu 14.04.3 LTS tul7p07.aus.stglabs.ibm.com hvc0

tul7p07 login: root
Password:

=================

4) I then added the serial adapter back in Ubuntu partition's profile and booted the partition up again.

====
Ubuntu 14.04.3 LTS tul7p07.aus.stglabs.ibm.com hvc0

tul7p07 login: root
Password:
Last login: Wed Sep 2 09:49:21 CDT 2015 on hvc0
Welcome to Ubuntu 14.04.3 LTS (GNU/Linux 3.19.0-27-generic ppc64le)

 * Documentation: https://help.ubuntu.com/
root@tul7p07:~# lspci
60:00.0 Serial controller: Digi International Device 00f6
root@tul7p07:~#
====

5) I checked /var/lib/setserial/. It created a new autoserial.conf like expected.

root@tul7p07:~# ls -l /var/lib/setserial/
total 12
-rw-r--r-- 1 root root 15 Sep 2 09:52 autoserial.conf
-rw-r--r-- 1 root root 15 Sep 2 09:47 autoserial.conf.old
-rw-r--r-- 1 root root 518 Jul 30 00:27 autoserial.conf.org
-rw-r--r-- 1 root root 0 Jul 30 00:27 etc.serial.conf.bkp
root@tul7p07:~#

6) I then shutdown Ubuntu partition without removed or renamed the autoserial.conf file.

7) I removed the serial adapter from Ubuntu partition's profile and booted the partion again. The kernel again tried to configured the serial port memory address which is now a bogu address so it hit the problem again.

==========
Preparing to boot Linux version 3.19.0-27-generic (buildd@fisher04) (gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) ) #29~14.04.1-Ubuntu SMP Sun Aug 16 01:51:48 UTC 2015 (Ubuntu 3.19.0-27.29~14.04.1-generic 3.19.8-ckt5)
Detected machine type: 0000000000000101
Max number of cores passed to firmware: 256 (NR_CPUS = 2048)
Calling ibm,client-architecture-support... done
command line: BOOT_IMAGE=/boot/vmlinux-3.19.0-27-generic root=UUID=768190e7-f633-4c63-a1e3-588d12dea265 ro quiet splash vt.handoff=7
memory layout at init:
  memory_limit : 0000000000000000 (16 MB aligned)
  alloc_bottom : 000000000b400000
  alloc_top : 0000000010000000
  alloc_top_hi : 0000000010000000
  rmo_top : 0000000010000000
  ram_top : 0000000010000000
instantiating rtas at 0x000000000ecb0000... done
prom_hold_cpus: skipped
copying OF device tree...
Building dt strings...
Building dt structure...
Device tree strings 0x000000000b410000 -> 0x000000000b4116b1
Device tree struct 0x000000000b420000 -> 0x000000000b450000
Calling quiesce...
returning from prom_init
 -> smp_release_cpus()
spinning_secondaries = 15
 <- smp_release_cpus()
 <- setup_system()
[ 0.643938] /build/linux-lts-vivid-4KQgBt/linux-lts-vivid-3.19.0/drivers/rtc/hctosys.c: unable to open rtc device (rtc0)
[ 0.656156] sd 0:0:1:0: [sda] Assuming drive cache: write through
 * Discovering and coalescing multipaths... [ OK ]
Skipping profile in /etc/apparmor.d/disable: usr.sbin.rsyslogd
 * Starting AppArmor profiles [ OK ]
Loading the saved-state of the serial devices...
[ 5.276868] Unable to handle kernel paging request for data at address 0xd000080000000003
[ 5.276880] Faulting instruction address: 0xc00000000060f684
[ 5.276888] Oops: Kernel access of bad area, sig: 11 [#1]
[ 5.276894] SMP NR_CPUS=2048 NUMA pSeries
[ 5.276902] Modules linked in: dm_multipath scsi_dh pseries_rng ib_ipoib rdma_ucm rtc_generic rdma_cm iw_cm ib_ucm ib_uverbs ib_cm ib_umad mlx4_ib ib_sa ib_mad ib_core ib_addr mlx4_en vxlan ip6_udp_tunnel udp_tunnel mlx4_core nfsd auth_rpcgss nfs_acl nfs lockd grace sunrpc fscache [last unloaded: mlx5_core]
[ 5.276960] CPU: 8 PID: 1466 Comm: setserial Not tainted 3.19.0-27-generic #29~14.04.1-Ubuntu
[ 5.276969] task: c0000000f2065300 ti: c0000000f21c4000 task.ti: c0000000f21c4000
[ 5.276977] NIP: c00000000060f684 LR: c000000000616c58 CTR: c00000000060f5e0
[ 5.276985] REGS: c0000000f21c76b0 TRAP: 0300 Not tainted (3.19.0-27-generic)
[ 5.276992] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 84002022 XER: 00000000
[ 5.277012] CFAR: c000000000008468 DAR: d000080000000003 DSISR: 42000000 SOFTE: 1
GPR00: c000000000616c58 c0000000f21c7930 c00000000144cc00 00000000000000bf
GPR04: d000080000000003 00000000000000bf c0000000f54b0000 0000000000000141
GPR08: c0000000006114e0 c0000000013539e0 d000080000000000 c000000001351ba8
GPR12: c00000000060f5e0 c00000000e834800 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 000000000000007d 0000000000000040 0000000000000000 0000000000000000
GPR24: 0000000000000000 c0000000f8092000 0000000000000001 0000000000000000
GPR28: c0000000f80921e0 00000000000000bf 0000000000000003 c0000000017549f0
[ 5.277115] NIP [c00000000060f684] io_serial_out+0xa4/0xd0
[ 5.277122] LR [c000000000616c58] serial8250_do_startup+0x978/0xe50
[ 5.277129] Call Trace:
[ 5.277134] [c0000000f21c7930] [c0000000f21c7970] 0xc0000000f21c7970 (unreliable)
[ 5.277145] [c0000000f21c7970] [c000000000616c58] serial8250_do_startup+0x978/0xe50
[ 5.277155] [c0000000f21c7a10] [c00000000060e2c0] uart_startup.part.7+0xd0/0x310
[ 5.277164] [c0000000f21c7a60] [c00000000060e96c] uart_set_info+0x46c/0x580
[ 5.277173] [c0000000f21c7b90] [c00000000060eb38] uart_ioctl+0xb8/0x590
[ 5.277183] [c0000000f21c7c40] [c0000000005dd01c] tty_ioctl+0x21c/0xf60
[ 5.277192] [c0000000f21c7d40] [c0000000002ce7a0] do_vfs_ioctl+0x4f0/0x7c0
[ 5.277201] [c0000000f21c7de0] [c0000000002ceb44] SyS_ioctl+0xd4/0xf0
[ 5.277210] [c0000000f21c7e30] [c000000000009258] system_call+0x38/0xd0
[ 5.277217] Instruction dump:
[ 5.277222] 38210040 e8010010 eba1ffe8 ebc1fff0 ebe1fff8 7c0803a6 4e800020 3d42fff0
[ 5.277236] 392a6de0 e9490000 7c845214 7c0004ac <98640000> 39200001 992d02bc 38210040
[ 5.277252] ---[ end trace d5657031818c6b89 ]---
[ 5.280950]
[ 11.843975] init: openibd pre-start process (1614) terminated with status 3

=======================

Mirroring to Launchpad for Canonical folks to take a look...

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-128602 severity-critical targetmilestone-inin14043
Luciano Chavez (lnx1138)
affects: ubuntu → linux (Ubuntu)
Changed in linux (Ubuntu):
assignee: nobody → Taco Screen team (taco-screen-team)
Changed in linux (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Canonical Kernel Team (canonical-kernel-team)
status: New → Triaged
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2015-10-04 15:15 EDT-------
==== State: Assigned by: bellivea on 04 October 2015 10:07:01 ====

Any updates?

Revision history for this message
Adam Conrad (adconrad) wrote :

In the failure case, does the system literally have no serial ports left (ie: is userspace incorrectly sending syscalls wildly into the ether without checking), or is it that you're shuffling port assignments by adding/removing ports, and we're now trying to configure a *different* driver that doesn't take kindly to the old config?

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2015-10-12 16:50 EDT-------
Paul Nguyen, please respond to Adam's questions especially since this was opened as a blocker issue.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2015-10-12 17:12 EDT-------
Externalizing Comment from Paul Nguyen 2015-10-12 13:09:36 EDT
(In reply to comment #16)
> Paul Nguyen, please respond to Adam's questions especially since this was
> opened as a blocker issue.

The adapter is completely removed from Ubuntu LPAR and it has no serial ports left. But when Ubuntu LPAR booting up, OS incorrectly sending syscalls wildly without checking.

Please see comments #4 and #6 from Luciano that has the analysis. I also ran tests and confirmed his analysis in comment #9.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2015-10-26 18:25 EDT-------
==== State: Assigned by: nguyenp on 26 October 2015 13:14:50 ====

Per sametime with Gabriel this morning,he's working and building a workaround for to the problem.

I'm lowering the severity of the defect since it's not a blocker.

tags: added: severity-high
removed: severity-critical
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2015-10-27 13:55 EDT-------
==== State: Assigned by: mgrosch on 27 October 2015 08:45:48 ====

#=#=# 2015-10-27 08:45:46 (CDT) #=#=#
New Fix_Potential = [GSI_HDW]
New Priority_Justification = [Since there is a workaround not deeming stop ship but needs to be resolved in a future Ubuntu Linux release]
#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#
>

bugproxy (bugproxy)
tags: added: targetmilestone-inin14044
removed: targetmilestone-inin14043
Mathew Hodson (mhodson)
Changed in linux (Ubuntu):
importance: Undecided → High
bugproxy (bugproxy)
tags: added: targetmilestone-inin---
removed: targetmilestone-inin14044
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-12-07 15:36 EDT-------

Changed in linux (Ubuntu):
assignee: Canonical Kernel Team (canonical-kernel-team) → Thadeu Lima de Souza Cascardo (cascardo)
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-12-07 12:38 EDT-------
Rejecting this very old bug. Please reopen if this issue occurs on a current version of Ubuntu.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-12-07 12:55 EDT-------
This CMVC defect is being cancelled by the CDE Bridge because the corresponding CQ Defect [SW317564] was transferred out of the bridge domain.
Here are the additional details:
New Subsystem = ppc_triage
New Release = unspecified
New Component = ubuntu_linux
New OwnerInfo = Chavez, Luciano (<email address hidden>)
To continue tracking this issue, please follow CQ defect [SW317564].

Brad Figg (brad-figg)
tags: added: cscc
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.