Kernel error/traffic stops during large NFS transfers from VM
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
Juniper Openstack | Status tracked in Trunk | |||||
R1.1 |
Fix Committed
|
Critical
|
Unassigned | |||
Trunk |
Invalid
|
Critical
|
Unassigned |
Bug Description
The issue happens when a VM is run as an NFS server and large data transfer happens between the NFS server VM and other VMs or to the Compute nodes. The Kernel on the compute node that runs the NFS errors and the data transfer from/to the VM stops. The following is the error displayed.
[135588.871016] BUG: unable to handle kernel NULL pointer dereference at
(null)
[135588.895016] IP: [<ffffffff81141
[135588.903465] PGD 0
[135588.911785] Oops: 0000 [#1] SMP
[135588.919796] Modules linked in: vhost_net(F) macvtap(F) macvlan(F) ip6table_f
ilter(F) ip6_tables(F) ebtable_nat(F) ebtables(F) ipt_MASQUERADE(F) iptable_nat(
F) nf_nat_ipv4(F) nf_nat(F) nf_conntrack_
f_conntrack(F) ipt_REJECT(F) xt_CHECKSUM(F) iptable_mangle(F) xt_tcpudp(F) iptab
le_filter(F) ip_tables(F) x_tables(F) bridge(F) stp(F) llc(F) nbd(F) vrouter(OF)
vesafb(F) xfs(F) ib_iser(F) rdma_cm(F) ib_cm(F) iw_cm(F) ib_sa(F) ib_mad(F) ib_
core(F) ib_addr(F) iscsi_tcp(F) libiscsi_tcp(F) libiscsi(F) scsi_transport_
(F) nfsd(F) nfsv4(F) nfs_acl(F) auth_rpcgss(F) nfs(F) coretemp(F) kvm_intel(F) k
vm(F) ghash_clmulni_
lrw(F) aes_x86_64(F) lockd(F) xts(F) gf128mul(F) dm_multipath(F) scsi_dh(F) sunr
pc(F) sb_edac(F) edac_core(F) mei(F) ioatdma(F) gpio_ich(F) joydev(F) microcode(
F) wmi(F) mac_hid(F) lpc_ich(F) lp(F) parport(F) ses(F) enclosure(F) hid_generic
(F) usbhid(F) hid(F) ahci(F) libahci(F) igb(F) ixgbe(F) mpt2sas(F) dca(F) ptp(F)
scsi_transport
) libcrc32c(F)
[135589.059484] CPU 26
[135589.059610] Pid: 9475, comm: vhost-9473 Tainted: GF O 3.8.0-29-gene
ric #42~precise1-Ubuntu Supermicro SSG-6027R-
[135589.089948] RIP: 0010:[<
5/0x40
[135589.110255] RSP: 0018:ffff8806f2
[135589.120535] RAX: 0000000000000140 RBX: ffff88200412d400 RCX: ffff882007260ec
[135589.141816] RDX: 0000000000000000 RSI: ffff882007260e00 RDI: 0000000000000000
[135589.163365] RBP: ffff8806f2e75c08 R08: 0000000000000001 R09: 0000000000001000
[135589.185232] R10: ffff882003d04518 R11: 0000000000000001 R12: 0000000000000012
[135589.207945] R13: 000000000000f362 R14: ffffffff814f354b R15: 0000000000000042
[135589.230686] FS: 000000000000000
[135589.253617] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[135589.265253] CR2: 0000000000000000 CR3: 00000020258f1000 CR4: 00000000000427e0
[135589.288051] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[135589.311538] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[135589.333796] Process vhost-9473 (pid: 9475, threadinfo ffff8806f2e74000, task ffff8806f2c3dd00)
[135589.356366] Stack:
[135589.367390] ffffffff815d57a8 ffff88200412d400 ffff88200412d400 ffff8806f2e75c18
[135589.389623] ffffffff815d58c5 ffff8806f2e75c38 ffffffff815d58ee ffffea006610d140
[135589.412151] ffff88200839c800 ffff8806f2e75c78 ffffffff815d5945 ffff882003d08498
[135589.434931] Call Trace:
[135589.446162] [<ffffffff815d5
[135589.457550] [<ffffffff815d5
[135589.468733] [<ffffffff815d5
[135589.479607] [<ffffffff815d5
[135589.490165] [<ffffffff814f3
[135589.500473] [<ffffffff814f3
[135589.511390] [<ffffffffa055f
[135589.521563] [<ffffffffa055f
[135589.531300] [<ffffffffa055c
[135589.541095] [<ffffffffa055c
[135589.550581] [<ffffffff8107f
[135589.559892] [<ffffffff8107f
[135589.569654] [<ffffffff816fc
[135589.578669] [<ffffffff8107f
[135589.587694] Code: fc 00 00 00 00 e8 ac fe ff ff 48 63 45 fc 65 48 01 04 25 58 08 01 00 c9 c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 <48> f7 07 00 c0 00 00 55 48 89 e5 75 15 f0 ff 4f 1c 0f 94 c0 84
[135589.614638] RIP [<ffffffff81141
[135589.623362] RSP <ffff8806f2e75bf0>
[135589.631844] CR2: 0000000000000000
[135589.652617] ---[ end trace c53738dbfbdc0bdf ]---
The issue is because of an experimental zero copy introduced in the 3.8.0-29 ubuntu Kernel. This issue got fixed in 3.8.0-31 ubuntu kernel. The kernel.org commits for the fix are as below
https:/
https:/
https:/
https:/
information type: | Proprietary → Public |
The issue is fixed by choosing a kernel version that has the fix. The ubuntu kernel version chosen is 3.13.0-34-generic