Brief Description
-----------------
Kernel errors in repetitive migration operations
(Containerized STX load)
Linux version 3.10.0-862.11.6.el7.36.tis.x86_64 (mockbuild@yow-cgts1-lx) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC) ) #1 SMP PREEMPT Wed Jan 30 20:30:03 EST 2019
2019-01-31T19:12:48.459 localhost kernel: info [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-3.10.0-862.11.6.el7.36.tis.x86_64 root=UUID=c597a45c-3bd7-4d12-95aa-c750c859f605 ro module_blacklist=integrity,ima audit=0 tboot=false crashkernel=auto console=ttyS0,115200n8 intel_iommu=off usbcore.autosuspend=-1 loop.max_part=15 selinux=0 enforcing=0 nmi_watchdog=panic,1 softlockup_panic=1 biosdevname=0 user_namespace.enable=1
Severity
--------
Standard
Steps to Reproduce
------------------
Run instance launch and migrations (live then cold migration)
Repeat again (over the course of several hours)
instance was created here
req-c320289f-13b0-43e9-815a-e47b59562a02 Create Feb. 1, 2019, 9:39 p.m. eeea9bd3d3fb4019bcc6c0884
Live migrated here
req-05faa0f4-ddad-48dc-9bd9-baba546297f0 live-migration Feb. 1, 2019, 9:45 p.m. 393c0829f21346f998e165bd25eb4acf -
Cold migration here
req-64f7de0a-aa64-427a-bfd4-bf52d6aacf00 migrate Feb. 1, 2019, 9:49 p.m. 393c0829f21346f998e165bd25eb4acf Error
Cold-Migrate-Confirm issued by tenant2 against instance tenant2-volume-47 owned by tenant2 on host compute-1
Cold-Migrate complete for instance tenant2-volume-47 now enabled on host compute-1 waiting for confirmation
Cold migration confirm here
req-55876933-4300-4861-9d6e-3a680450988a Confirm Resize Feb. 1, 2019, 9:50 p.m. 393c0829f21346f998e165bd25eb4acf Error
Cold-Migrate-Confirm failed for instance tenant2-volume-47 on host compute-1
Instance tenant2-volume-47 owned by tenant2 has failed on host compute-1
Then the instance reboots a few times to try to recover here
req-7c055100-d09d-4ee6-9bd2-a7237560452a Reboot Feb. 1, 2019, 10:05 p.m. 393c0829f21346f998e165bd25eb4acf -
req-14b5c560-1c71-4cf9-84c3-42d3103c2d02 Reboot Feb. 1, 2019, 10:06 p.m. 393c0829f21346f998e165bd25eb4acf -
req-256f317b-0ebd-4721-b2b6-2acfaf01f763 Reboot Feb. 1, 2019, 10:06 p.m. 393c0829f21346f998e165bd25eb4acf -
Expected Behavior
------------------
Actual Behavior
----------------
Kernel Call trace hung task (see below)
see nfv-vim-events.log
log-id = 636
event-id = instance-failed
event-type = action-event
event-context = admin
importance = high
entity = tenant=bac8cdbe-598d-46ff-9d06-eab470342834.instance=f43b2460-bc67-4a08-a497-6953591490f0
reason_text = Instance tenant2-volume-47 owned by tenant2 has failed on host compute-1
additional_text =
timestamp = 2019-02-01 21:51:39.050201
controller-0 kern.log
2019-02-01T21:52:47.372 controller-0 kernel: err [87714.235526] INFO: task jbd2/rbd1-8:117573 blocked for more than 120 seconds.
2019-02-01T21:52:47.372 controller-0 kernel: err [87714.243402] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2019-02-01T21:52:47.372 controller-0 kernel: info [87714.252189] jbd2/rbd1-8 D ffff908b11838000 0 117573 2 0x00000000
2019-02-01T21:52:47.372 controller-0 kernel: warning [87714.252198] Call Trace:
2019-02-01T21:52:47.372 controller-0 kernel: warning [87714.252208] [<ffffffffaf4f3914>] ? blk_mq_run_hw_queue+0x14/0x20
2019-02-01T21:52:47.372 controller-0 kernel: warning [87714.252214] [<ffffffffaf9bb790>] ? bit_wait+0x50/0x50
2019-02-01T21:52:47.372 controller-0 kernel: warning [87714.252218] [<ffffffffaf9be399>] schedule+0x29/0x70
2019-02-01T21:52:47.372 controller-0 kernel: warning [87714.252233] [<ffffffffaf9bb171>] schedule_timeout+0x2a1/0x330
2019-02-01T21:52:47.372 controller-0 kernel: warning [87714.252238] [<ffffffffaf9bb790>] ? bit_wait+0x50/0x50
2019-02-01T21:52:47.372 controller-0 kernel: warning [87714.252243] [<ffffffffaf4eb57e>] ? blk_flush_plug_list+0xce/0x230
2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252248] [<ffffffffaf9bb790>] ? bit_wait+0x50/0x50
2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252253] [<ffffffffaf9bce7d>] io_schedule_timeout+0xad/0x130
2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252257] [<ffffffffaf9bcf18>] io_schedule+0x18/0x20
2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252263] [<ffffffffaf9bb7a1>] bit_wait_io+0x11/0x50
2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252273] [<ffffffffaf9bb2c7>] __wait_on_bit+0x67/0x90
2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252288] [<ffffffffaf365b01>] wait_on_page_bit+0x81/0xa0
2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252297] [<ffffffffaf2a47d0>] ? wake_bit_function+0x40/0x40
2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252306] [<ffffffffaf365c31>] __filemap_fdatawait_range+0x111/0x190
2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252313] [<ffffffffaf4e9a40>] ? submit_bio+0x70/0x150
2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252318] [<ffffffffaf41e387>] ? bio_alloc_bioset+0xd7/0x220
2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252322] [<ffffffffaf365cc4>] filemap_fdatawait_range+0x14/0x30
2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252325] [<ffffffffaf365d07>] filemap_fdatawait+0x27/0x30
2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252349] [<ffffffffc0ddf3ff>] jbd2_journal_commit_transaction+0xb4f/0x1c90 [jbd2]
2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252354] [<ffffffffaf2ba8dd>] ? sched_clock_cpu+0xad/0xf0
2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252358] [<ffffffffaf21a59e>] ? __switch_to+0xce/0x5c0
2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252361] [<ffffffffaf2b08be>] ? finish_task_switch+0x4e/0x160
2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252367] [<ffffffffc0de56ad>] kjournald2+0xcd/0x260 [jbd2]
2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252368] [<ffffffffaf2a4710>] ? wake_up_atomic_t+0x30/0x30
2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252372] [<ffffffffc0de55e0>] ? commit_timeout+0x10/0x10 [jbd2]
2019-02-01T21:52:47.373 controller-0 kernel: warning [87714.252374] [<ffffffffaf2a3731>] kthread+0xd1/0xe0
2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.252376] [<ffffffffaf2a3660>] ? kthread_create_on_node+0x140/0x140
2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.252380] [<ffffffffaf9c1e1d>] ret_from_fork_nospec_begin+0x7/0x21
2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.252381] [<ffffffffaf2a3660>] ? kthread_create_on_node+0x140/0x140
2019-02-01T21:52:47.406 controller-0 kernel: err [87714.252404] INFO: task async_133:118588 blocked for more than 120 seconds.
2019-02-01T21:52:47.406 controller-0 kernel: err [87714.260082] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2019-02-01T21:52:47.406 controller-0 kernel: info [87714.260084] async_133 D ffff908d95ba9660 0 118588 118180 0x00000000
2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260084] Call Trace:
2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260085] [<ffffffffaf4f3914>] ? blk_mq_run_hw_queue+0x14/0x20
2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260107] [<ffffffffaf9bb790>] ? bit_wait+0x50/0x50
2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260108] [<ffffffffaf9be399>] schedule+0x29/0x70
2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260109] [<ffffffffaf9bb171>] schedule_timeout+0x2a1/0x330
2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260113] [<ffffffffaf2e0ba2>] ? ktime_get_ts64+0x52/0xf0
2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260115] [<ffffffffaf9bb790>] ? bit_wait+0x50/0x50
2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260116] [<ffffffffaf9bce7d>] io_schedule_timeout+0xad/0x130
2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260117] [<ffffffffaf9bcf18>] io_schedule+0x18/0x20
2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260118] [<ffffffffaf9bb7a1>] bit_wait_io+0x11/0x50
2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260121] [<ffffffffaf9bb2c7>] __wait_on_bit+0x67/0x90
2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260124] [<ffffffffaf365b01>] wait_on_page_bit+0x81/0xa0
2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260125] [<ffffffffaf2a47d0>] ? wake_bit_function+0x40/0x40
2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260126] [<ffffffffaf365c31>] __filemap_fdatawait_range+0x111/0x190
2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260132] [<ffffffffaf373441>] ? do_writepages+0x21/0x50
2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260135] [<ffffffffaf365cc4>] filemap_fdatawait_range+0x14/0x30
2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260139] [<ffffffffaf367da6>] filemap_write_and_wait_range+0x56/0x90
2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260161] [<ffffffffc0e094aa>] ext4_sync_file+0xba/0x390 [ext4]
2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260164] [<ffffffffaf415a37>] do_fsync+0x67/0xb0
2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260168] [<ffffffffaf3e2614>] ? SyS_write+0xa4/0xf0
2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260182] [<ffffffffaf415d20>] SyS_fsync+0x10/0x20
2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260186] [<ffffffffaf9c1fdb>] system_call_fastpath+0x22/0x27
2019-02-01T21:52:47.406 controller-0 kernel: err [87714.260210] INFO: task async_275:118730 blocked for more than 120 seconds.
2019-02-01T21:52:47.406 controller-0 kernel: err [87714.260211] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2019-02-01T21:52:47.406 controller-0 kernel: info [87714.260213] async_275 D ffff908d96668b30 0 118730 118180 0x00000000
2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260216] Call Trace:
2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260217] [<ffffffffaf9be399>] schedule+0x29/0x70
2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260222] [<ffffffffc0de5075>] jbd2_log_wait_commit+0xc5/0x150 [jbd2]
2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260225] [<ffffffffaf2a4710>] ? wake_up_atomic_t+0x30/0x30
2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260229] [<ffffffffc0de668c>] jbd2_complete_transaction+0x6c/0x90 [jbd2]
2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260236] [<ffffffffc0e096e2>] ext4_sync_file+0x2f2/0x390 [ext4]
2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260238] [<ffffffffaf415a37>] do_fsync+0x67/0xb0
2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260241] [<ffffffffaf3e2614>] ? SyS_write+0xa4/0xf0
2019-02-01T21:52:47.406 controller-0 kernel: warning [87714.260244] [<ffffffffaf415d20>] SyS_fsync+0x10/0x20
2019-02-01T21:52:47.407 controller-0 kernel: warning [87714.260246] [<ffffffffaf9c1fdb>] system_call_fastpath+0x22/0x27
Reproducibility
---------------
Several hours to happen on the 1st set of migrations
Within 2 hours on 2nd reproduction run
System Configuration
--------------------
standard
Branch/Pull Time/Commit
-----------------------
Master as of date: 2019-01-30_20-18-00
This was seen when running a containerized config using the pike docker images
Timestamp/Logs
--------------
see kern.log attached ~ 2019-02-01T21:52:47.372
see also kern2.log attached ~ 2019-02-04T21:50:10.441
The general steps are as follows, then they are repeated for several hours.
Test Step 1: Clear local storage cache on compute-0 nova/instances/ _base/* vm/drop_ caches
rm -rf /var/lib/
'sync;echo 3 > /proc/sys/
Step 2: Create a flavor with 2 vcpus, dedicated cpu policy, and local_image storage keystone. openstack. svc.cluster. local/v3 --os-user- domain- name Default --os-project- domain- name Default --os-endpoint-type internalURL --os-region-name RegionOne flavor-create local_image auto 1024 2 2'
nova --os-username 'admin' --os-password '<password>' --os-project-name admin --os-auth-url http://
Step 3: Tenant2 launches instance1 keystone. openstack. svc.cluster. local/v3 --os-user- domain- name Default --os-project- domain- name Default --os-endpoint-type internalURL --os-region-name RegionOne boot --key-name keypair-tenant2 --flavor <flavorid> --image <tis-centos-guest image> --nic net-id= 28348793- baef-4987- ae23-0baf020368 19 --nic net-id= 9267348a- c940-41e4- 8a57-ce1046946f 42 --nic net-id= 3de977f7- 94d2-45ff- 907d-977a715c77 82 tenant2- local_image- 1 --poll'
tenant2 boots the instance eg. from local_image on compute-0 (collect startup time info)
'nova --os-username 'tenant2' --os-password '<password>' --os-project-name tenant2 --os-auth-url http://
Send 'ping -c 3 <ip of mgmt network for tenant2>
Step 4: Tenant1 launches instance2 (the observer instance) keystone. openstack. svc.cluster. local/v3 --os-user- domain- name Default --os-project- domain- name Default --os-endpoint-type internalURL --os-region-name RegionOne boot --key-name keypair-tenant1 --flavor <flavorid> --image <tis-centos-guest image> --nic net-id= f6ac6f3d- 45e2-4e1f- 88ad-ac10bbe8cc f2 --nic net-id= 0d3b246c- f0b5-4ec4- bbd4-c76eb39d1b 9b --nic net-id= 3de977f7- 94d2-45ff- 907d-977a715c77 82 tenant1-observer-2 --poll'
Send 'nova --os-username 'tenant1' --os-password '<password>' --os-project-name tenant1 --os-auth-url http://
Send 'ping -c 3 <ip of mgmt network for tenant1>
Step 5 Collects live migrate KPI for vm booted from local_image
Step 6: Setting up traffic for pairs instance1 and instance2
Step 7: Cold migrates the instance & confirm cold migraton (collects KPI for each)
Step 8: Setup traffic for pairs between the instances (the one launched in step 3 and the observer instance launched in step 4)
Step 9/10: Ping instance1 from NatBox on a new thread (collect kpi for rebuild)
Test Step 11: Perform rebuild on instance1 again and ensure it's reachable after that (ping tenant2 mgmt network)
Teardown