Activity log for bug #1488035

Date Who What changed Old value New value Message
2015-08-24 10:22:43 Gavin Guo bug added bug
2015-08-24 10:30:07 Brad Figg linux (Ubuntu): status New Incomplete
2015-08-24 10:43:53 Gavin Guo description [Impact] [Fix] [Test Case] [Impact] The node which mounts a ceph rbd volume causes a panic when all OSD daemons on the all ceph nodes are restarted. [642981.871592] ------------[ cut here ]------------ [642981.912255] kernel BUG at /build/buildd/linux-3.13.0/net/ceph/osd_client.c:892! [642981.994517] invalid opcode: 0000 [#1] SMP [642982.037227] Modules linked in: xt_multiport iptable_mangle xt_nat xt_tcpudp veth xfs rbd libceph libcrc32c xt_addrtype xt_conntrack ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 iptable_filter ip_tables x_tables nf_nat nf_conntrack bridge aufs ipmi_devintf joydev gpio_ich x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd hid_generic mei_me ioatdma mei lpc_ich wmi ipmi_si 8021q garp stp mrp llc bonding acpi_power_meter mac_hid lp parport ixgbe usbhid dca tg3 ahci libahci hid ptp megaraid_sas mdio pps_core [642982.528519] CPU: 0 PID: 1062099 Comm: kworker/0:6 Not tainted 3.13.0-45-generic #74-Ubuntu [642982.648057] Hardware name: NEC Express5800/R120f-1M [N8100-2203Y]/MS-S0901, BIOS 5.0.4016 12/17/2014 [642982.775433] Workqueue: ceph-msgr con_work [libceph] [642982.841300] task: ffff881028444800 ti: ffff880d92374000 task.ti: ffff880d92374000 [642982.973255] RIP: 0010:[<ffffffffa025f5be>] [<ffffffffa025f5be>] osd_reset+0x22e/0x2c0 [libceph] [642983.114484] RSP: 0018:ffff880d92375d80 EFLAGS: 00010283 [642983.188540] RAX: ffff8800197f2ca8 RBX: ffff882028194750 RCX: ffff880036bcdc48 [642983.334096] RDX: ffff8800197f2ca8 RSI: ffff8800197f2c10 RDI: 0000000000000286 [642983.485552] RBP: ffff880d92375dd8 R08: 0000000000000000 R09: 0000000000000000 [642983.643277] R10: ffffffff8160afcf R11: ffffea00710cae00 R12: ffff8800197f2c58 [642983.805364] R13: ffff882028194810 R14: ffff880036bcdbf8 R15: ffff880036bcdc18 [642983.968728] FS: 0000000000000000(0000) GS:ffff88103fa00000(0000) knlGS:0000000000000000 [642984.135368] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [642984.220577] CR2: 00007f60d4cb7868 CR3: 0000000001c0e000 CR4: 00000000001407f0 [642984.383051] Stack: [642984.459809] ffff8820281947a8 ffff882028194760 ffff8800197f2800 ffff8800197f2ca8 [642984.618038] ffff880d92375da0 ffff880d92375da0 ffff8800197f2c10 ffff8800197f2830 [Fix] A linked list to manage OSDs in the kernel was corrupted when restarting all OSD daemons on all ceph nodes at the almost same time. The issues must be fixed by the following. libceph: must use new tid when watch is resent http://tracker.ceph.com/issues/8806 This includes two patched and they has been already released. http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/20878 [PATCH 1/2] libceph: abstract out ceph_osd_request enqueue logic [PATCH 2/2] libceph: resend lingering requests with a new tid 3.18 kernel adopts the fixes. libceph: abstract out ceph_osd_request enqueue logic https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=f671b581f1dac61354186b7373af5f97fe420584 libceph: resend lingering requests with a new tid https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=2cc6128ab2afff7864dbdc33a73e2deaa935d9e0 [Test Case] After setting up the ceph environment, repeatedly issued the following command from a node to all ceph nodes. rsh -i key -l ubuntu sn_hostname sudo service ceph-all restart And verify if there is panics.
2015-08-24 10:44:33 Gavin Guo linux (Ubuntu): assignee Gavin Guo (mimi0213kimo)
2015-08-25 16:58:19 Gavin Guo description [Impact] The node which mounts a ceph rbd volume causes a panic when all OSD daemons on the all ceph nodes are restarted. [642981.871592] ------------[ cut here ]------------ [642981.912255] kernel BUG at /build/buildd/linux-3.13.0/net/ceph/osd_client.c:892! [642981.994517] invalid opcode: 0000 [#1] SMP [642982.037227] Modules linked in: xt_multiport iptable_mangle xt_nat xt_tcpudp veth xfs rbd libceph libcrc32c xt_addrtype xt_conntrack ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 iptable_filter ip_tables x_tables nf_nat nf_conntrack bridge aufs ipmi_devintf joydev gpio_ich x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd hid_generic mei_me ioatdma mei lpc_ich wmi ipmi_si 8021q garp stp mrp llc bonding acpi_power_meter mac_hid lp parport ixgbe usbhid dca tg3 ahci libahci hid ptp megaraid_sas mdio pps_core [642982.528519] CPU: 0 PID: 1062099 Comm: kworker/0:6 Not tainted 3.13.0-45-generic #74-Ubuntu [642982.648057] Hardware name: NEC Express5800/R120f-1M [N8100-2203Y]/MS-S0901, BIOS 5.0.4016 12/17/2014 [642982.775433] Workqueue: ceph-msgr con_work [libceph] [642982.841300] task: ffff881028444800 ti: ffff880d92374000 task.ti: ffff880d92374000 [642982.973255] RIP: 0010:[<ffffffffa025f5be>] [<ffffffffa025f5be>] osd_reset+0x22e/0x2c0 [libceph] [642983.114484] RSP: 0018:ffff880d92375d80 EFLAGS: 00010283 [642983.188540] RAX: ffff8800197f2ca8 RBX: ffff882028194750 RCX: ffff880036bcdc48 [642983.334096] RDX: ffff8800197f2ca8 RSI: ffff8800197f2c10 RDI: 0000000000000286 [642983.485552] RBP: ffff880d92375dd8 R08: 0000000000000000 R09: 0000000000000000 [642983.643277] R10: ffffffff8160afcf R11: ffffea00710cae00 R12: ffff8800197f2c58 [642983.805364] R13: ffff882028194810 R14: ffff880036bcdbf8 R15: ffff880036bcdc18 [642983.968728] FS: 0000000000000000(0000) GS:ffff88103fa00000(0000) knlGS:0000000000000000 [642984.135368] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [642984.220577] CR2: 00007f60d4cb7868 CR3: 0000000001c0e000 CR4: 00000000001407f0 [642984.383051] Stack: [642984.459809] ffff8820281947a8 ffff882028194760 ffff8800197f2800 ffff8800197f2ca8 [642984.618038] ffff880d92375da0 ffff880d92375da0 ffff8800197f2c10 ffff8800197f2830 [Fix] A linked list to manage OSDs in the kernel was corrupted when restarting all OSD daemons on all ceph nodes at the almost same time. The issues must be fixed by the following. libceph: must use new tid when watch is resent http://tracker.ceph.com/issues/8806 This includes two patched and they has been already released. http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/20878 [PATCH 1/2] libceph: abstract out ceph_osd_request enqueue logic [PATCH 2/2] libceph: resend lingering requests with a new tid 3.18 kernel adopts the fixes. libceph: abstract out ceph_osd_request enqueue logic https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=f671b581f1dac61354186b7373af5f97fe420584 libceph: resend lingering requests with a new tid https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=2cc6128ab2afff7864dbdc33a73e2deaa935d9e0 [Test Case] After setting up the ceph environment, repeatedly issued the following command from a node to all ceph nodes. rsh -i key -l ubuntu sn_hostname sudo service ceph-all restart And verify if there is panics. [Impact] The node which mounts a ceph rbd volume causes a panic when all OSD daemons on the all ceph nodes are restarted. [642981.871592] ------------[ cut here ]------------ [642981.912255] kernel BUG at /build/buildd/linux-3.13.0/net/ceph/osd_client.c:892! [642981.994517] invalid opcode: 0000 [#1] SMP [642982.037227] Modules linked in: xt_multiport iptable_mangle xt_nat xt_tcpudp veth xfs rbd libceph libcrc32c xt_addrtype xt_conntrack ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 iptable_filter ip_tables x_tables nf_nat nf_conntrack bridge aufs ipmi_devintf joydev gpio_ich x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd hid_generic mei_me ioatdma mei lpc_ich wmi ipmi_si 8021q garp stp mrp llc bonding acpi_power_meter mac_hid lp parport ixgbe usbhid dca tg3 ahci libahci hid ptp megaraid_sas mdio pps_core [642982.528519] CPU: 0 PID: 1062099 Comm: kworker/0:6 Not tainted 3.13.0-45-generic #74-Ubuntu [642982.648057] Hardware name: NEC Express5800/R120f-1M [N8100-2203Y]/MS-S0901, BIOS 5.0.4016 12/17/2014 [642982.775433] Workqueue: ceph-msgr con_work [libceph] [642982.841300] task: ffff881028444800 ti: ffff880d92374000 task.ti: ffff880d92374000 [642982.973255] RIP: 0010:[<ffffffffa025f5be>] [<ffffffffa025f5be>] osd_reset+0x22e/0x2c0 [libceph] [642983.114484] RSP: 0018:ffff880d92375d80 EFLAGS: 00010283 [642983.188540] RAX: ffff8800197f2ca8 RBX: ffff882028194750 RCX: ffff880036bcdc48 [642983.334096] RDX: ffff8800197f2ca8 RSI: ffff8800197f2c10 RDI: 0000000000000286 [642983.485552] RBP: ffff880d92375dd8 R08: 0000000000000000 R09: 0000000000000000 [642983.643277] R10: ffffffff8160afcf R11: ffffea00710cae00 R12: ffff8800197f2c58 [642983.805364] R13: ffff882028194810 R14: ffff880036bcdbf8 R15: ffff880036bcdc18 [642983.968728] FS: 0000000000000000(0000) GS:ffff88103fa00000(0000) knlGS:0000000000000000 [642984.135368] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [642984.220577] CR2: 00007f60d4cb7868 CR3: 0000000001c0e000 CR4: 00000000001407f0 [642984.383051] Stack: [642984.459809] ffff8820281947a8 ffff882028194760 ffff8800197f2800 ffff8800197f2ca8 [642984.618038] ffff880d92375da0 ffff880d92375da0 ffff8800197f2c10 ffff8800197f2830 [Fix] A linked list to manage OSDs in the kernel was corrupted when restarting all OSD daemons on all ceph nodes at the almost same time. The issues must be fixed by the following. libceph: must use new tid when watch is resent http://tracker.ceph.com/issues/8806 This includes two patched and they has been already released. http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/20878 [PATCH 1/2] libceph: abstract out ceph_osd_request enqueue logic [PATCH 2/2] libceph: resend lingering requests with a new tid 3.18 kernel adopts the fixes. libceph: abstract out ceph_osd_request enqueue logic https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=f671b581f1dac61354186b7373af5f97fe420584 libceph: resend lingering requests with a new tid https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=2cc6128ab2afff7864dbdc33a73e2deaa935d9e0 [Test Case] After setting up the ceph environment, repeatedly issued the following command from a node to all ceph nodes. rsh -i key -l ubuntu sn_hostname sudo service ceph-all restart And verify if there is panics. A test kernel with this fix was verified to fix this problem.
2015-08-27 14:46:30 Brad Figg nominated for series Ubuntu Trusty
2015-08-27 14:46:30 Brad Figg bug task added linux (Ubuntu Trusty)
2015-08-27 14:46:38 Brad Figg linux (Ubuntu Trusty): status New Fix Committed
2015-08-27 14:46:59 Brad Figg linux (Ubuntu): status Incomplete Invalid
2015-08-27 18:06:00 Nobuto Murata bug added subscriber Nobuto Murata
2015-09-13 22:37:58 Brad Figg tags sts trusty sts trusty verification-needed-trusty
2015-09-17 10:45:31 Gavin Guo tags sts trusty verification-needed-trusty sts trusty verification-done-trusty
2015-09-28 15:47:08 Launchpad Janitor linux (Ubuntu Trusty): status Fix Committed Fix Released