Brief Description ----------------- Kernel errors in repetitive migration operations (Containerized STX load) Linux version 3.10.0-862.11.6.el7.36.tis.x86_64 (mockbuild@yow-cgts1-lx) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC) ) #1 SMP PREEMPT Wed Jan 30 20:30:03 EST 2019 2019-01-31T19:12:48.459 localhost kernel: info [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-3.10.0-862.11.6.el7.36.tis.x86_64 root=UUID=c597a45c-3bd7-4d12-95aa-c750c859f605 ro module_blacklist=integrity,ima audit=0 tboot=false crashkernel=auto console=ttyS0,115200n8 intel_iommu=off usbcore.autosuspend=-1 loop.max_part=15 selinux=0 enforcing=0 nmi_watchdog=panic,1 softlockup_panic=1 biosdevname=0 user_namespace.enable=1 Severity -------- Standard Steps to Reproduce ------------------ Run instance launch and migrations (live then cold migration) Repeat again (over the course of several hours) instance was created here req-c320289f-13b0-43e9-815a-e47b59562a02 Create Feb. 1, 2019, 9:39 p.m. eeea9bd3d3fb4019bcc6c0884 Live migrated here req-05faa0f4-ddad-48dc-9bd9-baba546297f0 live-migration Feb. 1, 2019, 9:45 p.m. 393c0829f21346f998e165bd25eb4acf - Cold migration here req-64f7de0a-aa64-427a-bfd4-bf52d6aacf00 migrate Feb. 1, 2019, 9:49 p.m. 393c0829f21346f998e165bd25eb4acf Error Cold-Migrate-Confirm issued by tenant2 against instance tenant2-volume-47 owned by tenant2 on host compute-1 Cold-Migrate complete for instance tenant2-volume-47 now enabled on host compute-1 waiting for confirmation Cold migration confirm here req-55876933-4300-4861-9d6e-3a680450988a Confirm Resize Feb. 1, 2019, 9:50 p.m. 393c0829f21346f998e165bd25eb4acf Error Cold-Migrate-Confirm failed for instance tenant2-volume-47 on host compute-1 Instance tenant2-volume-47 owned by tenant2 has failed on host compute-1 Then the instance reboots a few times to try to recover here req-7c055100-d09d-4ee6-9bd2-a7237560452a Reboot Feb. 1, 2019, 10:05 p.m. 393c0829f21346f998e165bd25eb4acf - req-14b5c560-1c71-4cf9-84c3-42d3103c2d02 Reboot Feb. 1, 2019, 10:06 p.m. 393c0829f21346f998e165bd25eb4acf - req-256f317b-0ebd-4721-b2b6-2acfaf01f763 Reboot Feb. 1, 2019, 10:06 p.m. 393c0829f21346f998e165bd25eb4acf - Expected Behavior ------------------ Actual Behavior ---------------- Kernel Call trace hung task (see below) see nfv-vim-events.log log-id = 636 event-id = instance-failed event-type = action-event event-context = admin importance = high entity = tenant=bac8cdbe-598d-46ff-9d06-eab470342834.instance=f43b2460-bc67-4a08-a497-6953591490f0 reason_text = Instance tenant2-volume-47 owned by tenant2 has failed on host compute-1 additional_text = timestamp = 2019-02-01 21:51:39.050201 controller-0 kern.log 2019-02-01T21:52:47.372 controller-0 kernel: err [87714.235526] INFO: task jbd2/rbd1-8:117573 blocked for more than 120 seconds. 2019-02-01T21:52:47.372 controller-0 kernel: err [87714.243402] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2019-02-01T21:52:47.372 controller-0 kernel: info [87714.252189] jbd2/rbd1-8 D ffff908b11838000 0 117573 2 0x00000000 2019-02-01T21:52:47.372 controller-0 kernel: warning [87714.252198] Call Trace: 2019-02-01T21:52:47.372 controller-0 kernel: warning [87714.252208] [] ? blk_mq_run_hw_queue+0x14/0x20 2019-02-01T21:52:47.372 controller-0 kernel: warning [87714.252214] [] ? bit_wait+0x50/0x50 2019-02-01T21:52:47.372 controller-0 kernel: warning [87714.252218] [] schedule+0x29/0x70 2019-02-01T21:52:47.372 controller-0 kernel: warning [87714.252233] [] schedule_timeout+0x2a1/0x330 2019-02-01T21:52:47.372 controller-0 kernel: warning [87714.252238] [] ? bit_wait+0x50/0x50 2019-02-01T21:52:47.372 controller-0 kernel: warning [87714.252243] [] ? blk_flush_plug_list+0xce/0x230 2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252248] [] ? bit_wait+0x50/0x50 2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252253] [] io_schedule_timeout+0xad/0x130 2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252257] [] io_schedule+0x18/0x20 2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252263] [] bit_wait_io+0x11/0x50 2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252273] [] __wait_on_bit+0x67/0x90 2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252288] [] wait_on_page_bit+0x81/0xa0 2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252297] [] ? wake_bit_function+0x40/0x40 2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252306] [] __filemap_fdatawait_range+0x111/0x190 2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252313] [] ? submit_bio+0x70/0x150 2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252318] [] ? bio_alloc_bioset+0xd7/0x220 2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252322] [] filemap_fdatawait_range+0x14/0x30 2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252325] [] filemap_fdatawait+0x27/0x30 2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252349] [] jbd2_journal_commit_transaction+0xb4f/0x1c90 [jbd2] 2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252354] [] ? sched_clock_cpu+0xad/0xf0 2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252358] [] ? __switch_to+0xce/0x5c0 2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252361] [] ? finish_task_switch+0x4e/0x160 2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252367] [] kjournald2+0xcd/0x260 [jbd2] 2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252368] [] ? wake_up_atomic_t+0x30/0x30 2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252372] [] ? commit_timeout+0x10/0x10 [jbd2] 2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252374] [] kthread+0xd1/0xe0 2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.252376] [] ? kthread_create_on_node+0x140/0x140 2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.252380] [] ret_from_fork_nospec_begin+0x7/0x21 2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.252381] [] ? kthread_create_on_node+0x140/0x140 2019-02-01T21:52:47.406 controller-0 kernel: err [87714.252404] INFO: task async_133:118588 blocked for more than 120 seconds. 2019-02-01T21:52:47.406 controller-0 kernel: err [87714.260082] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2019-02-01T21:52:47.406 controller-0 kernel: info [87714.260084] async_133 D ffff908d95ba9660 0 118588 118180 0x00000000 2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260084] Call Trace: 2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260085] [] ? blk_mq_run_hw_queue+0x14/0x20 2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260107] [] ? bit_wait+0x50/0x50 2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260108] [] schedule+0x29/0x70 2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260109] [] schedule_timeout+0x2a1/0x330 2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260113] [] ? ktime_get_ts64+0x52/0xf0 2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260115] [] ? bit_wait+0x50/0x50 2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260116] [] io_schedule_timeout+0xad/0x130 2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260117] [] io_schedule+0x18/0x20 2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260118] [] bit_wait_io+0x11/0x50 2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260121] [] __wait_on_bit+0x67/0x90 2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260124] [] wait_on_page_bit+0x81/0xa0 2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260125] [] ? wake_bit_function+0x40/0x40 2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260126] [] __filemap_fdatawait_range+0x111/0x190 2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260132] [] ? do_writepages+0x21/0x50 2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260135] [] filemap_fdatawait_range+0x14/0x30 2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260139] [] filemap_write_and_wait_range+0x56/0x90 2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260161] [] ext4_sync_file+0xba/0x390 [ext4] 2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260164] [] do_fsync+0x67/0xb0 2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260168] [] ? SyS_write+0xa4/0xf0 2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260182] [] SyS_fsync+0x10/0x20 2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260186] [] system_call_fastpath+0x22/0x27 2019-02-01T21:52:47.406 controller-0 kernel: err [87714.260210] INFO: task async_275:118730 blocked for more than 120 seconds. 2019-02-01T21:52:47.406 controller-0 kernel: err [87714.260211] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2019-02-01T21:52:47.406 controller-0 kernel: info [87714.260213] async_275 D ffff908d96668b30 0 118730 118180 0x00000000 2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260216] Call Trace: 2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260217] [] schedule+0x29/0x70 2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260222] [] jbd2_log_wait_commit+0xc5/0x150 [jbd2] 2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260225] [] ? wake_up_atomic_t+0x30/0x30 2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260229] [] jbd2_complete_transaction+0x6c/0x90 [jbd2] 2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260236] [] ext4_sync_file+0x2f2/0x390 [ext4] 2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260238] [] do_fsync+0x67/0xb0 2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260241] [] ? SyS_write+0xa4/0xf0 2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260244] [] SyS_fsync+0x10/0x20 2019-02-01T21:52:47.407 controller-0 kernel: warning [87714.260246] [] system_call_fastpath+0x22/0x27 Reproducibility --------------- Several hours to happen on the 1st set of migrations Within 2 hours on 2nd reproduction run System Configuration -------------------- standard Branch/Pull Time/Commit ----------------------- Master as of date: 2019-01-30_20-18-00 Timestamp/Logs -------------- see kern.log attached ~ 2019-02-01T21:52:47.372 see also kern2.log attached ~ 2019-02-04T21:50:10.441