reboot_server_hard can trigger corrupted file system

Bug #1847759 reported by Dr. Jens Harbott
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tempest
Expired
Undecided
Unassigned

Bug Description

This is kind of a follow up to https://bugs.launchpad.net/tempest/+bug/1583987 , there still seem to be situations where the test fails with the instance not properly working after the reboot due to a broken file system.

From https://zuul.opendev.org/t/openstack/build/be631c8c3e3d4ad4af48052121d58cfa:

2019-10-11 10:39:22.981727 | primary | ==============================
2019-10-11 10:39:22.981811 | primary | Failed 1 tests - output below:
2019-10-11 10:39:22.981896 | primary | ==============================
2019-10-11 10:39:22.981927 | primary |
2019-10-11 10:39:22.982149 | primary | tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_reboot_server_hard[id-2cb1baf6-ac8d-4429-bf0d-ba8a0ba53e32,smoke]
2019-10-11 10:39:22.982371 | primary | --------------------------------------------------------------------------------------------------------------------------------------------
2019-10-11 10:39:22.982400 | primary |
2019-10-11 10:39:22.982479 | primary | Captured traceback:
2019-10-11 10:39:22.982558 | primary | ~~~~~~~~~~~~~~~~~~~
2019-10-11 10:39:22.982679 | primary | b'Traceback (most recent call last):'
2019-10-11 10:39:22.982896 | primary | b' File "/opt/stack/new/tempest/tempest/lib/common/ssh.py", line 107, in _get_ssh_connection'
2019-10-11 10:39:22.983009 | primary | b' sock=proxy_chan)'
2019-10-11 10:39:22.983274 | primary | b' File "/opt/stack/new/tempest/.tox/tempest/lib/python3.6/site-packages/paramiko/client.py", line 368, in connect'
2019-10-11 10:39:22.983404 | primary | b' raise NoValidConnectionsError(errors)'
2019-10-11 10:39:22.983645 | primary | b'paramiko.ssh_exception.NoValidConnectionsError: [Errno None] Unable to connect to port 22 on 172.24.5.165'
2019-10-11 10:39:22.983712 | primary | b''
2019-10-11 10:39:22.983853 | primary | b'During handling of the above exception, another exception occurred:'
2019-10-11 10:39:22.983892 | primary | b''
2019-10-11 10:39:22.983977 | primary | b'Traceback (most recent call last):'
2019-10-11 10:39:22.984179 | primary | b' File "/opt/stack/new/tempest/tempest/api/compute/servers/test_server_actions.py", line 151, in test_reboot_server_hard'
2019-10-11 10:39:22.984267 | primary | b" self._test_reboot_server('HARD')"
2019-10-11 10:39:22.984462 | primary | b' File "/opt/stack/new/tempest/tempest/api/compute/servers/test_server_actions.py", line 143, in _test_reboot_server'
2019-10-11 10:39:22.984566 | primary | b' new_boot_time = linux_client.get_boot_time()'
2019-10-11 10:39:22.984743 | primary | b' File "/opt/stack/new/tempest/tempest/common/utils/linux/remote_client.py", line 78, in get_boot_time'
2019-10-11 10:39:22.984837 | primary | b' boot_secs = self.exec_command(cmd)'
2019-10-11 10:39:22.985013 | primary | b' File "/opt/stack/new/tempest/tempest/lib/common/utils/linux/remote_client.py", line 60, in wrapper'
2019-10-11 10:39:22.985111 | primary | b' six.reraise(*original_exception)'
2019-10-11 10:39:22.985288 | primary | b' File "/opt/stack/new/tempest/.tox/tempest/lib/python3.6/site-packages/six.py", line 693, in reraise'
2019-10-11 10:39:22.985373 | primary | b' raise value'
2019-10-11 10:39:22.985582 | primary | b' File "/opt/stack/new/tempest/tempest/lib/common/utils/linux/remote_client.py", line 33, in wrapper'
2019-10-11 10:39:22.985685 | primary | b' return function(self, *args, **kwargs)'
2019-10-11 10:39:22.985868 | primary | b' File "/opt/stack/new/tempest/tempest/lib/common/utils/linux/remote_client.py", line 108, in exec_command'
2019-10-11 10:39:22.985966 | primary | b' return self.ssh_client.exec_command(cmd)'
2019-10-11 10:39:22.986117 | primary | b' File "/opt/stack/new/tempest/tempest/lib/common/ssh.py", line 151, in exec_command'
2019-10-11 10:39:22.986210 | primary | b' ssh = self._get_ssh_connection()'
2019-10-11 10:39:22.986372 | primary | b' File "/opt/stack/new/tempest/tempest/lib/common/ssh.py", line 121, in _get_ssh_connection'
2019-10-11 10:39:22.986465 | primary | b' password=self.password)'
2019-10-11 10:39:22.986621 | primary | b'tempest.lib.exceptions.SSHTimeout: Connection to the 172.24.5.165 via SSH timed out.'
2019-10-11 10:39:22.986713 | primary | b'User: cirros, Password: X8_*eC1SJRny6sI'
2019-10-11 10:39:22.986750 | primary | b''

...

2019-10-11 10:39:23.089748 | primary | b'[ 9.395436] EXT4-fs (vda1): ext4_check_descriptors: Block bitmap for group 32 not in group (block 8195)!'
2019-10-11 10:39:23.089870 | primary | b'[ 9.401545] EXT4-fs (vda1): group descriptors corrupted!'
2019-10-11 10:39:23.090054 | primary | b"[ 9.416926] EXT4-fs (vda1): couldn't mount as ext2 due to feature incompatibilities"
2019-10-11 10:39:23.090277 | primary | b'[ 9.429956] EXT4-fs (vda1): ext4_check_descriptors: Block bitmap for group 32 not in group (block 8195)!'
2019-10-11 10:39:23.090405 | primary | b'[ 9.432537] EXT4-fs (vda1): group descriptors corrupted!'
2019-10-11 10:39:23.090614 | primary | b'mount: mounting /dev/vda1 on /newroot failed: Invalid argument'
2019-10-11 10:39:23.090728 | primary | b"umount: can't umount /dev/vda1: Invalid argument"
2019-10-11 10:39:23.090951 | primary | b'mcb [info=LABEL=cirros-rootfs dev=/dev/vda1 target=/newroot unmount=cbfail callback=check_sbin_init ret=1: failed to unmount'
2019-10-11 10:39:23.091170 | primary | b'[ 9.897488] EXT4-fs (vda1): ext4_check_descriptors: Block bitmap for group 32 not in group (block 8195)!'
2019-10-11 10:39:23.091306 | primary | b'[ 9.902610] EXT4-fs (vda1): group descriptors corrupted!'
2019-10-11 10:39:23.091466 | primary | b"[ 9.907058] EXT4-fs (vda1): couldn't mount as ext2 due to feature incompatibilities"
2019-10-11 10:39:23.091653 | primary | b'[ 9.915266] EXT4-fs (vda1): ext4_check_descriptors: Block bitmap for group 32 not in group (block 8195)!'
2019-10-11 10:39:23.091773 | primary | b'[ 9.918281] EXT4-fs (vda1): group descriptors corrupted!'
2019-10-11 10:39:23.091897 | primary | b'mount: mounting /dev/vda1 on /newroot failed: Invalid argument'
2019-10-11 10:39:23.091992 | primary | b'info: copying initramfs to /dev/vda1'
2019-10-11 10:39:23.092091 | primary | b'cp: write error: No space left on device'

Not sure whether the issue with the soft-reboot mentioned in the old bug has been resolved by now, otherwise I would question whether this test is sensible to run at all.

Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :
Download full text (6.5 KiB)

Seems there are variants of the above happening, too, like this kernel bug being triggered:

2019-10-08 19:54:34.768687 | controller | b'[ 13.456237] ------------[ cut here ]------------'
2019-10-08 19:54:34.768849 | controller | b'[ 13.456237] kernel BUG at /build/linux-BvkamA/linux-4.4.0/fs/ext4/ext4.h:2810!'
2019-10-08 19:54:34.768955 | controller | b'[ 13.456237] invalid opcode: 0000 [#1] SMP '
2019-10-08 19:54:34.769328 | controller | b'[ 13.456237] Modules linked in: nls_iso8859_1 isofs ip_tables x_tables pcnet32 8139cp mii ne2k_pci 8390 e1000 virtio_scsi'
2019-10-08 19:54:34.769500 | controller | b'[ 13.456237] CPU: 0 PID: 147 Comm: syslogd Not tainted 4.4.0-28-generic #47-Ubuntu'
2019-10-08 19:54:34.769684 | controller | b'[ 13.456237] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.10.2-1ubuntu1 04/01/2014'
2019-10-08 19:54:34.769853 | controller | b'[ 13.456237] task: ffff8800005a5780 ti: ffff880002adc000 task.ti: ffff880002adc000'
2019-10-08 19:54:34.770681 | controller | b'[ 13.456237] RIP: 0010:[<ffffffff812d32ae>] [<ffffffff812d32ae>] ext4_mb_find_by_goal+0x20e/0x2c0'
2019-10-08 19:54:34.770852 | controller | b'[ 13.456237] RSP: 0018:ffff880002adf790 EFLAGS: 00000246'
2019-10-08 19:54:34.771018 | controller | b'[ 13.456237] RAX: 0000000000000000 RBX: ffff880000e39000 RCX: ffff880002a6f800'
2019-10-08 19:54:34.771177 | controller | b'[ 13.456237] RDX: ffff880002adf958 RSI: ffff880002adf810 RDI: ffff880000e39000'
2019-10-08 19:54:34.771334 | controller | b'[ 13.456237] RBP: ffff880002adf7d8 R08: ffff880002a6f000 R09: ffff880000e39000'
2019-10-08 19:54:34.771490 | controller | b'[ 13.456237] R10: 0000000000000001 R11: ffffea0000009a80 R12: 0000000000000004'
2019-10-08 19:54:34.771654 | controller | b'[ 13.456237] R13: ffff880002a6f800 R14: ffff880002a6f000 R15: ffff880002a6f000'
2019-10-08 19:54:34.771828 | controller | b'[ 13.456237] FS: 00007f0505b036a0(0000) GS:ffff880003800000(0000) knlGS:0000000000000000'
2019-10-08 19:54:34.771962 | controller | b'[ 13.456237] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b'
2019-10-08 19:54:34.772117 | controller | b'[ 13.456237] CR2: 00007f9f7a165160 CR3: 0000000003741000 CR4: 00000000000006f0'
2019-10-08 19:54:34.772187 | controller | b'[ 13.456237] Stack:'
2019-10-08 19:54:34.772347 | controller | b'[ 13.456237] 0000000076cfafa3 ffff880000e6c0e8 ffff880000e6c0e8 0000000076cfafa3'
2019-10-08 19:54:34.772507 | controller | b'[ 13.456237] ffff880000e39000 0000000000000001 ffff880002adf954 ffff880002a6f000'
2019-10-08 19:54:34.772667 | controller | b'[ 13.456237] ffff880002a6f000 ffff880002adf878 ffffffff812d3b4d ffff880002adf818'
2019-10-08 19:54:34.772744 | controller | b'[ 13.456237] Call Trace:'
2019-10-08 19:54:34.772889 | controller | b'[ 13.456237] [<ffffffff812d3b4d>] ext4_mb_regular_allocator+0x6d/0x470'
2019-10-08 19:54:34.773023 | controller | b'[ 13.456237] [<ffffffff812a017c>] ? ext4_dirty_inode+0x5c/0x70'
2019-10-08 19:54:34.773159 | controller | b'[ 13.456237] [<ffffffff812d5846>] ext4_mb_new_blocks+0x416/0x5...

Read more...

Revision history for this message
Matt Riedemann (mriedem) wrote :

Are we sure this isn't a duplicate of bug 1808010? I ask because I see b'cp: write error: No space left on device' in the guest console output above.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Martin Kopec (mkopec) wrote :

I went through https://bugs.launchpad.net/openstack-gate/+bug/1808010 and maybe this got fixed after the RAM increase in m1.micro and m1.nano flavors happened in May 2020? But let's keep this here for a while as a reminder to look for this kind of issue.

Revision history for this message
Martin Kopec (mkopec) wrote :

Let's mark this as Incomplete as there were no updates and no new occurrences of the issue.

Changed in tempest:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for tempest because there has been no activity for 60 days.]

Changed in tempest:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.