Kernel Oops 3b in libc-2.23 unable to handle pointer dereference in kernel virtual address space

Bug #1831899 reported by bugproxy
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu on IBM z Systems
Fix Released
High
Canonical Kernel Team
linux (Ubuntu)
Fix Released
Undecided
Skipper Bug Screeners
Bionic
Fix Released
Undecided
Unassigned

Bug Description

== Comment: #0 - Robert J. Brenneman <email address hidden> - 2019-05-30 11:16:45 ==
---Problem Description---
Kernel Oops 3b in libc-2.23 unable to handle pointer dereference in virtual kernel address space

Contact Information = <email address hidden>

---uname output---
Linux ECOS0018 4.15.0-50-generic #54-Ubuntu SMP Tue May 7 05:57:08 UTC 2019 s390x s390x s390x GNU/Linux

Machine Type = z13 2964 NE1

---System Hang---
 z/VM took a VMDUMP and reIPLed
(the attached and available dumps are Linux dumps)

---Debugger---
A debugger is not configured

---Steps to Reproduce---
 boot system, start jenkins, let it run a couple days

Stack trace output:
 05/29/19 13:24:06 Call Trace:
05/29/19 13:24:06 (?<000000000012b97a>? __tlb_remove_table+0x6a/0xd0)
05/29/19 13:24:06 ?<000000000012ba34>? tlb_remove_table_rcu+0x54/0x70
05/29/19 13:24:06 ?<00000000001f43b4>? rcu_process_callbacks+0x1d4/0x570
05/29/19 13:24:06 ?<00000000008e92d4>? __do_softirq+0x124/0x358
05/29/19 13:24:06 ?<0000000000179d52>? irq_exit+0xba/0xd0
05/29/19 13:24:06 ?<000000000010c412>? do_IRQ+0x8a/0xb8
05/29/19 13:24:06 ?<00000000008e87f0>? ext_int_handler+0x134/0x138
05/29/19 13:24:06 ?<0000000000102cee>? enabled_wait+0x4e/0xe0
05/29/19 13:24:06 (?<0000000000001201>? 0x1201)
05/29/19 13:24:06 ?<000000000010303a>? arch_cpu_idle+0x32/0x48
05/29/19 13:24:06 ?<00000000001c5ae8>? do_idle+0xe8/0x1a8

Oops output:
 05/29/19 13:24:06 User process fault: interruption code 003b ilc:3 in libc-2.23.so?3ffaca00000+185000?
05/29/19 13:24:06 Failing address: 0000000000000000 TEID: 0000000000000800
05/29/19 13:24:06 Fault in primary space mode while using user ASCE.
05/29/19 13:24:06 AS:0000000710b241c7 R3:0000000000000024
05/29/19 13:24:06 Unable to handle kernel pointer dereference in virtual kernel address space
05/29/19 13:24:06 Failing address: 000003dbe0000000 TEID: 000003dbe0000403
05/29/19 13:24:06 Fault in home space mode while using kernel ASCE.
05/29/19 13:24:06 AS:0000000000ea8007 R3:0000000000000024
05/29/19 13:24:06 Oops: 003b ilc:3 ?#1? SMP
05/29/19 13:24:06 Modules linked in: veth xt_nat xt_tcpudp ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_con
05/29/19 13:24:06 ghash_s390 prng aes_s390 des_s390 des_generic sha512_s390 sha256_s390 dasd_fba_mod dasd_eckd_mod sha1_s390 sha_common dasd_mod
05/29/19 13:24:06 CPU: 7 PID: 0 Comm: swapper/7 Not tainted 4.15.0-50-generic #54-Ubuntu
05/29/19 13:24:06 Hardware name: IBM 2964 NE1 798 (z/VM 6.4.0)
05/29/19 13:24:06 Krnl PSW : 00000000dcb002be 0000000072762961 (__tlb_remove_table+0x56/0xd0)
05/29/19 13:24:06 R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
05/29/19 13:24:06 Krnl GPRS: ffffffffffffffba 000002b800000000 000002b800000003 0000000000eacac8
05/29/19 13:24:06 ffffffffffffffba 00000000000000b9 0700000000000000 000000000000000a
05/29/19 13:24:06 0404c00100000000 00000007d2fb5c38 00000007cf71fdf0 000003dbe0000000
05/29/19 13:24:06 000003dbe0000018 00000000008fe740 00000007cf71fd08 00000007cf71fcd8
05/29/19 13:24:06 Krnl Code: 000000000012b956: ec2c002a027f clij %r2,2,12,12b9aa
05/29/19 13:24:06 000000000012b95c: ec26001d037e cij %r2,3,6,12b996
05/29/19 13:24:06 #000000000012b962: 41c0b018 la %r12,24(%r11)
05/29/19 13:24:06 >000000000012b966: e548b0080000 mvghi 8(%r11),0
05/29/19 13:24:06 000000000012b96c: a7390008 lghi %r3,8
05/29/19 13:24:06 000000000012b970: b904002b lgr %r2,%r11
05/29/19 13:24:06 000000000012b974: c0e5000e8f8a brasl %r14,2fd888
05/29/19 13:24:06 000000000012b97a: a718ffff lhi %r1,-1
05/29/19 13:24:06 Call Trace:
05/29/19 13:24:06 (?<000000000012b97a>? __tlb_remove_table+0x6a/0xd0)
05/29/19 13:24:06 ?<000000000012ba34>? tlb_remove_table_rcu+0x54/0x70
05/29/19 13:24:06 ?<00000000001f43b4>? rcu_process_callbacks+0x1d4/0x570
05/29/19 13:24:06 ?<00000000008e92d4>? __do_softirq+0x124/0x358
05/29/19 13:24:06 ?<0000000000179d52>? irq_exit+0xba/0xd0
05/29/19 13:24:06 ?<000000000010c412>? do_IRQ+0x8a/0xb8
05/29/19 13:24:06 ?<00000000008e87f0>? ext_int_handler+0x134/0x138
05/29/19 13:24:06 ?<0000000000102cee>? enabled_wait+0x4e/0xe0
05/29/19 13:24:06 (?<0000000000001201>? 0x1201)
05/29/19 13:24:06 ?<000000000010303a>? arch_cpu_idle+0x32/0x48
05/29/19 13:24:06 ?<00000000001c5ae8>? do_idle+0xe8/0x1a8
05/29/19 13:24:06 virtual machine is placed in CP mode due to a SIGP stop from CPU 01.
05/29/19 13:24:06 virtual machine is placed in CP mode due to a SIGP stop from CPU 04.
05/29/19 13:24:06 virtual machine is placed in CP mode due to a SIGP stop from CPU 05.
05/29/19 13:24:06 virtual machine is placed in CP mode due to a SIGP stop from CPU 00.
05/29/19 13:24:06 virtual machine is placed in CP mode due to a SIGP stop from CPU 06.
05/29/19 13:24:06 virtual machine is placed in CP mode due to a SIGP stop from CPU 02.
05/29/19 13:24:06 virtual machine is placed in CP mode due to a SIGP stop from CPU 03.
05/29/19 13:24:06 virtual machine is placed in CP mode due to a SIGP stop from CPU 07.
05/29/19 13:24:06 virtual machine is placed in CP mode due to a SIGP stop from CPU 07.
05/29/19 13:24:06 virtual machine is placed in CP mode due to a SIGP stop from CPU 07.
05/29/19 13:24:06 virtual machine is placed in CP mode due to a SIGP stop from CPU 07.
05/29/19 13:24:06 virtual machine is placed in CP mode due to a SIGP stop from CPU 07.
05/29/19 13:24:06 virtual machine is placed in CP mode due to a SIGP stop from CPU 07.
05/29/19 13:24:06 virtual machine is placed in CP mode due to a SIGP stop from CPU 07.
05/29/19 13:24:06 ?<00000000001c5d86>? cpu_startup_entry+0x3e/0x48
05/29/19 13:24:06 ?<0000000000117240>? smp_start_secondary+0x120/0x140
05/29/19 13:24:06 ?<00000000008e8c46>? restart_int_handler+0x62/0x78
05/29/19 13:24:06 ?<0000000000000000>? (null)
05/29/19 13:24:06 Last Breaking-Event-Address:
05/29/19 13:24:06 ?<000000000012ba2e>? tlb_remove_table_rcu+0x4e/0x70
05/29/19 13:24:06
05/29/19 13:24:06 Kernel panic - not syncing: Fatal exception in interrupt

System Dump Location:
 I will attach dumps here

*Additional Instructions for <email address hidden>:
-Attach sysctl -a output output to the bug.

bugproxy (bugproxy)
tags: added: architecture-s39064 bugnameltc-177920 severity-high targetmilestone-inin18043
Changed in ubuntu:
assignee: nobody → Skipper Bug Screeners (skipper-screen-team)
affects: ubuntu → kernel-package (Ubuntu)
Frank Heimes (fheimes)
affects: kernel-package (Ubuntu) → linux (Ubuntu)
Revision history for this message
Frank Heimes (fheimes) wrote :

Hi does this happen on Ubuntu 18.04 or 16.04.5 ? (because kernel it's '4.15')
lsb_release -a

Since libc 2.23 is mentioned I assume it's 16.04.5.
apt-cache policy libc6 linux-generic

Please can you share a dump?
As well as the 'dbginfo' output?

The z/VM version seems to be 6.4 - do you know if that happens on LPAR, too?

Changed in ubuntu-z-systems:
importance: Undecided → High
status: New → Incomplete
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2019-06-06 11:54 EDT-------
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.2 LTS
Release: 18.04
Codename: bionic

dbginfo attached to bz

Revision history for this message
bugproxy (bugproxy) wrote : dbginfo

------- Comment (attachment only) From <email address hidden> 2019-06-06 11:56 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2019-06-06 11:59 EDT-------
Ive got two dumps:
dump1 : http://pokgsa.ibm.com/~rjbrenn/public/dump.190529.1324.gz

dump2 : http://pokgsa.ibm.com/~rjbrenn/public/dump.190529.2120.gz

but they are over 2GB each because they are VMDUMP not kdump.
Is there a SFTP server I can use to make them available to you ?

I also have more recent kdumps that can provide as well.

Revision history for this message
Frank Heimes (fheimes) wrote :

Hi Robert,
you may upload them to IBM box, I have an account there and may download them to the Canonical network - that's how I also do it with my IBM project mgr as well in case of dumps or other larger files.

Thanks for sharing the additional information so far.

But I'm still wondering about the libc version.
The libc version in Bionic is 2.27, but I see 2.23 in the above strace which is the one from Xenial.

rmadison --arch=s390x libc6 | egrep 'xenial|bionic'
 libc6 | 2.23-0ubuntu3 | xenial | s390x
 libc6 | 2.23-0ubuntu10 | xenial-security | s390x
 libc6 | 2.23-0ubuntu11 | xenial-updates | s390x
 libc6 | 2.27-3ubuntu1 | bionic | s390x

On your system you should see the following output, pointing to 2.27:

$ apt-cache policy libc6
libc6:
  Installed: 2.27-3ubuntu1
  Candidate: 2.27-3ubuntu1
  Version table:
 *** 2.27-3ubuntu1 500
        500 http://us.ports.ubuntu.com/ubuntu-ports bionic/main s390x Packages
        100 /var/lib/dpkg/status

Please can you double check with 'apt-cache policy libc6'?
It's important to have the correct libc6 and in case it's a 2.23 where it's coming from:
sudo find / -iname "libc-2.*.so"

Revision history for this message
Sean Feole (sfeole) wrote :

will keep an eye out for this during the upcoming SRU-cycle on s390x and monitor the conversation with frank

tags: added: bionic
tags: added: s390x
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2019-06-06 22:30 EDT-------
$ apt-cache policy libc6
libc6:
Installed: 2.27-3ubuntu1
Candidate: 2.27-3ubuntu1
Version table:
*** 2.27-3ubuntu1 500
500 http://us.ports.ubuntu.com/ubuntu-ports bionic/main s390x Packages
100 /var/lib/dpkg/status

I suspect that the team is doing jenkins builds of Docker images that are Xenial based, and its the Xenial libc in the docker container that we're seeing when this error hits.

See if you can get to the dumps here:
https://ibm.box.com/s/845jrpcdobolyrp0dstj9lf9v67w0oxi
https://ibm.box.com/s/ndoeaeav549l37ztn0s9nyexoadgqewz

and some kdumps in one tarball:
https://ibm.box.com/s/g2ecltl3o2be54iksp7du4v9lk0rnp3m

I've got access set to 'anyone in company' so I think that ought to give you access , let me know if not.

Revision history for this message
Frank Heimes (fheimes) wrote :

I get the following msg from IBM box, Robert:
"This shared file or folder link has been removed or is unavailable to you."
I guess you need to explicitly share it with me - since I'm external.
My box account is assigned to: frank.heimes <at> canonical.com

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2019-06-07 09:22 EDT-------
added you specifically as a reader frank - try now ?

Revision history for this message
Frank Heimes (fheimes) wrote :

Thx Robert, worked now.
Downloaded and transferred them to Canonical server 'mombin', folder ~fheimes/1831899

Changed in ubuntu-z-systems:
status: Incomplete → Triaged
Revision history for this message
Frank Heimes (fheimes) wrote :

Discussed with IBM that we need some help evaluating the VM dumps.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2019-06-18 03:36 EDT-------
The dumps are not z/VM dumps. All can be analysed by Linux tools...

------- Comment From <email address hidden> 2019-06-18 03:38 EDT-------
The comend "z/VM took a VMDUMP and IPLd" is not from J. Brenneman

Update informaion:
Linux Kernel is abending with a kernel oops , and z/VM OpsManager watching the virual machine console automatically ran VMDUMP of the virtual machine to dump the whole Linux virtual machine memory to spool. I pulled the 2 large dumps out of spool with the linux 'vmur receive' command with the -c argument to convert VMDUMP format to Linux kdump format. those two large dumps are linux dumps and should be able to be processed with linux tools.

I later enabled linux kdump and that gathered a couple additional dumps on subsequent abends with the same symptom and I've added those to box as well, so both VMDUMP originated huge dumps and more reasonably sized Linux Kdumps are referenced in box links in the BZ.

There is no z/VM CP development based analysis required as far as I can tell - this is entirely a Linux kernel and glibc issue.

Revision history for this message
Frank Heimes (fheimes) wrote :

Tricky combination of glibc/libc6 and kernel.
Dumps are downloaded and bug assigned to kernel team.

Frank Heimes (fheimes)
description: updated
Revision history for this message
Juerg Haefliger (juergh) wrote :

I started to take a look at the dumps and noticed that there are faults in libc-2.17 (and not 2.23 as mentioned in the bug description). What are you running in those docker images? Can you share some details about what the machine was doing when it died?

Revision history for this message
Juerg Haefliger (juergh) wrote :

Also, I'm not able to convert the VMDUMP. What am I missing?

$ vmconvert dump.190529.1324
vmconvert: Input file 'dump.190529.1324' is not a vmdump

Revision history for this message
bugproxy (bugproxy) wrote : sosreport

------- Comment (attachment only) From <email address hidden> 2019-08-14 07:46 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : console message of 2 dumps - correlate timestamps

------- Comment (attachment only) From <email address hidden> 2019-08-14 07:47 EDT-------

Revision history for this message
Frank Heimes (fheimes) wrote :

Thx for the additional data.

However, I can confirm juergh that there is an issue with both dumps:
I downloaded them again from Box, but vmconvert is still not able to handle them:
$ ls -l
total 5735304
-rw-rw-r-- 1 ubuntu ubuntu 2737907971 Aug 14 08:23 dump.190529.1324
-rw-rw-r-- 1 ubuntu ubuntu 3135030835 Aug 14 09:19 dump.190529.2120
$ vmconvert ./dump.190529.1324
vmconvert: Input file './dump.190529.1324' is not a vmdump
$ vmconvert ./dump.190529.2120
vmconvert: Input file './dump.190529.2120' is not a vmdump
$

Please can you try to vmconvert them on YOUR system?
And in case it works let us know which tool/version you used? ('vmconvert -v', 'apt-cache policy s390-tools', 'lsb_release -a' and 'uname -a')
Thx

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2019-08-14 11:14 EDT-------
the 2 vmdump generated dumps were already converted during the read into the linux file.

are you able to get to the kdump dumps here:

https://ibm.box.com/s/g2ecltl3o2be54iksp7du4v9lk0rnp3m

Those may be easier to parse than the vmdump generated dumps.

The docker container images are probably part of a Jenkins build pipeline that is building s390x binaries for other IBM teams. We don't have logs on what they were doing at the time of failure.

Revision history for this message
Frank Heimes (fheimes) wrote :

Hi Robert, yes I can access this file and have it already downloaded.
It was just not clear to me that these are the same or similar dumps , so that the other (huge) files are not really needed...
The tgz seems to be fine, since it's extractable w/o issues:
$ tar xvfz 4_kdumps.tgz
201905311552/
201905311552/dump.201905311552
201905311552/dmesg.201905311552
201906030926/
201906030926/dmesg.201906030926
201906030926/dump.201906030926
201906041107/
201906041107/dmesg.201906041107
201906041107/dump.201906041107
201906061158/
201906061158/dmesg.201906061158
201906061158/dump.201906061158
kexec_cmd
linux-image-4.15.0-50-generic-201905311552.crash
linux-image-4.15.0-50-generic-201906030926.crash
linux-image-4.15.0-50-generic-201906041107.crash

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2019-08-14 15:40 EDT-------
Its the same symptom between the huge dumps and the later kdumps.
We opened with the huge dumps because thats what we had at the time, use the kdumps if they are more consumable.

Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: Triaged → In Progress
Revision history for this message
Juerg Haefliger (juergh) wrote :

Is this easily reproducible? Can you retry with the latest released kernel 4.15.0-58.64? I haven't been able to reproduce the problem locally so far. It would help if you could share the container and workload, if that's possible at all. I'm still looking through the dumpfiles but haven't made much progress yet towards finding the root cause.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2019-09-16 11:05 EDT-------
(In reply to comment #36)
> Is this easily reproducible? Can you retry with the latest released kernel
> 4.15.0-58.64? I haven't been able to reproduce the problem locally so far.
> It would help if you could share the container and workload, if that's
> possible at all. I'm still looking through the dumpfiles but haven't made
> much progress yet towards finding the root cause.

Robert:
Please answer questions from Ubuntu.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2019-09-16 16:29 EDT-------
updated one system to 4.15.0-62.69-generic and we no longer get the crash

Revision history for this message
Frank Heimes (fheimes) wrote :

Thx for re-trying on latest bionic kernel and confirming that the issue is gone.
Closing ticket.

Frank Heimes (fheimes)
Changed in linux (Ubuntu Bionic):
status: New → Fix Released
Changed in linux (Ubuntu):
status: New → Fix Released
Changed in ubuntu-z-systems:
status: In Progress → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2019-09-17 07:40 EDT-------
IBM Bugzilla status-> closed, Fix Released with Bionic...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.