reboot_server_hard - can trigger loss of pub key data which fails test

Bug #1583987 reported by Ross Martyn
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tempest
Fix Released
Undecided
Ken'ichi Ohmichi

Bug Description

This test as far as I am aware, starts an instance, logs into the instance via auto generated public keys, and once successful, performs a hard reboot immediately after. Finally it performs a second login attempt.

In our case, the hard reboot is successful, the instance is rebooted on the hypervisor, however more often than not, the second login attempt fails (This triggers a python trace of a timeout included below).

This appears to be because the hard reboot was so quick, that in a lot of cases the public key data is not persisted. Being able to persist data this quickly, is a test of the operating system, filesystem and associated configuration, and not strictly the OpenStack API or the associated reboot functionality.

I am able to replicate this by hand in a variety of ways, and I can see a couple of possible ways to improve this.

We could add a 'sync' (or similar) command during the ssh session. I realise this will not cover all Operating Systems, so a second suggestion could be a soft reboot first, this soft reboot will cause the OS to write its data to disk, before rebooting the system, and then moving on to perform the hard reboot

Another option could just be using ping, to make sure the system is back online, rather than logging in, but I appreciate its not as thorough.

Currently as a workaround I am having to set the interval to 30 seconds or so which means a full 'refstack' run takes over an hour!

----------------

tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_reboot_server_hard[id-2cb1baf6-ac8d-4429-bf0d-ba8a0ba53e32,smoke]
--------------------------------------------------------------------------------------------------------------------------------------------

Captured traceback:
~~~~~~~~~~~~~~~~~~~
    Traceback (most recent call last):
      File "/home/ubuntu/ref/refstack-client/.tempest/tempest/api/compute/servers/test_server_actions.py", line 125, in test_reboot_server_hard
        self._test_reboot_server('HARD')
      File "/home/ubuntu/ref/refstack-client/.tempest/tempest/api/compute/servers/test_server_actions.py", line 117, in _test_reboot_server
        new_boot_time = linux_client.get_boot_time()
      File "/home/ubuntu/ref/refstack-client/.tempest/tempest/common/utils/linux/remote_client.py", line 75, in get_boot_time
        boot_secs = self.exec_command(cmd)
      File "/home/ubuntu/ref/refstack-client/.tempest/tempest/common/utils/linux/remote_client.py", line 44, in exec_command
        return self.ssh_client.exec_command(cmd)
      File "/home/ubuntu/ref/refstack-client/.tempest/tempest/lib/common/ssh.py", line 118, in exec_command
        ssh = self._get_ssh_connection()
      File "/home/ubuntu/ref/refstack-client/.tempest/tempest/lib/common/ssh.py", line 95, in _get_ssh_connection
        time.sleep(bsleep)
      File "/home/ubuntu/ref/refstack-client/.tempest/.venv/local/lib/python2.7/site-packages/fixtures/_fixtures/timeout.py", line 52, in signal_handler
        raise TimeoutException()
    fixtures._fixtures.timeout.TimeoutException

 2016-05-20 08:55:39,571 3921 INFO [tempest.lib.common.ssh] ssh connection to ubuntu@78.31.109.199 successfully created
    2016-05-20 08:55:43,483 3921 INFO [tempest.lib.common.rest_client] Request (ServerActionsTestJSON:test_reboot_server_hard): 202 POST https://x.x.x.com:8774/v2/fec2b42806364f40bc3255c40daee034/servers/38806a0a-6127-4c54-a8f3-9542a4de5647/action 0.243s
    2016-05-20 08:55:43,606 3921 INFO [tempest.lib.common.rest_client] Request (ServerActionsTestJSON:test_reboot_server_hard): 200 GET https://x.x.x.com:8774/v2/fec2b42806364f40bc3255c40daee034/servers/38806a0a-6127-4c54-a8f3-9542a4de5647 0.122s
    2016-05-20 08:55:44,734 3921 INFO [tempest.lib.common.rest_client] Request (ServerActionsTestJSON:test_reboot_server_hard): 200 GET https://x.x.x.com:8774/v2/fec2b42806364f40bc3255c40daee034/servers/38806a0a-6127-4c54-a8f3-9542a4de5647 0.115s
    2016-05-20 08:55:44,744 3921 INFO [tempest.common.waiters] State transition "HARD_REBOOT/rebooting_hard" ==> "HARD_REBOOT/reboot_started_hard" after 1 second wait
    2016-05-20 08:55:45,877 3921 INFO [tempest.lib.common.rest_client] Request (ServerActionsTestJSON:test_reboot_server_hard): 200 GET https://x.x.x.x:8774/v2/fec2b42806364f40bc3255c40daee034/servers/38806a0a-6127-4c54-a8f3-9542a4de5647 0.130s
    2016-05-20 08:55:45,888 3921 INFO [tempest.common.waiters] State transition "HARD_REBOOT/reboot_started_hard" ==> "ACTIVE/None" after 2 second wait
    2016-05-20 08:56:15,921 3921 INFO [tempest.lib.common.ssh] Creating ssh connection to 'x.x.x.x' as 'ubuntu' with public key authentication
    2016-05-20 08:56:15,938 3921 INFO [paramiko.transport] Connected (version 2.0, client OpenSSH_6.6.1p1)
    2016-05-20 08:56:15,942 3921 WARNING [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to ubuntu@x.x.x.x (). Number attempts: 1. Retry after 2 seconds.
    2016-05-20 08:56:18,455 3921 INFO [paramiko.transport] Connected (version 2.0, client OpenSSH_6.6.1p1)
    2016-05-20 08:56:18,464 3921 WARNING [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to ubuntu@x.x.x.x (). Number attempts: 2. Retry after 3 seconds.

... and so on.

Appreciate any feed back, and please let me know if you need any more information.

Revision history for this message
Joseph Lanoux (joseph-lanoux) wrote :

I was about to raise this bug. I'm observing the exact same issue with any non-Cirros images servers or with a big flavor Cirros image server.

However, this issue appeared sometimes in the last several months and it used to work without problem.

I didn't have much time to debug this issue but I'm not sure it's a Tempest bug. Even if the use case of the test is really not realistic, I'm not too much in favor of modifying the test. We need to find out what is the root cause of this race condition.

description: updated
information type: Public → Public Security
information type: Public Security → Public
Revision history for this message
Ross Martyn (rossmartyn04) wrote :

Thanks for your comments Joseph.

Apologies if I am missing something obvious, but I am not sure how this could be a race condition?

I'd imagine that whilst adding a 'soft' reboot first, the test might take a little longer to complete, but the validity of the test would not be affected. I'd expect that 99.9% of operating systems would write their buffers to disk during the soft reboot, and that should cover most cases.

Revision history for this message
Joseph Lanoux (joseph-lanoux) wrote :

Ross,

Soft reboot in Tempest is broken: https://bugs.launchpad.net/tempest/+bug/1014647. I don't know the root cause of this issue. It might be a race condition. As I said, that was working before. Changing the test would just be a workaround. This issue is somewhere else.

Revision history for this message
Ross Martyn (rossmartyn04) wrote :

Hi Joseph,

OK, I see your point now. I was unaware that soft reboot was broken too!

I've noticed some other tests in refstack/defcore are commented with bug tickets, I guess I should report this to them also.

Thanks for your information.

Ross

Revision history for this message
Jordan Pittier (jordan-pittier) wrote :

Could you find a good logstash query for this ? (http://logstash.openstack.org/)

It would help to know if this bug happens often or not.

Revision history for this message
Ken'ichi Ohmichi (oomichi) wrote :

In general, I operate sync before rebooting a system on ubuntu/red hat linux to avoid the loss of data.
'sync' way seems simple and good for me, I don't know what linux distributions don't have sync command.

Revision history for this message
Ken'ichi Ohmichi (oomichi) wrote :

The scenario test manager of Tempest depends on sync command like

https://github.com/openstack/tempest/blob/master/tempest/scenario/manager.py#L631

 624 def create_timestamp(self, ip_address, dev_name=None, mount_path='/mnt',
 625 private_key=None):
 626 ssh_client = self.get_remote_client(ip_address,
 627 private_key=private_key)
 628 if dev_name is not None:
 629 ssh_client.make_fs(dev_name)
 630 ssh_client.mount(dev_name, mount_path)
 631 cmd_timestamp = 'sudo sh -c "date > %s/timestamp; sync"' % mount_path
 632 ssh_client.exec_command(cmd_timestamp)

So the sync way still is good for me.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tempest (master)

Fix proposed to branch: master
Review: https://review.openstack.org/340723

Changed in tempest:
assignee: nobody → Ken'ichi Ohmichi (oomichi)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tempest (master)

Reviewed: https://review.openstack.org/340723
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=05ab856124df141df3becf7baf433e7ab7f415a6
Submitter: Jenkins
Branch: master

commit 05ab856124df141df3becf7baf433e7ab7f415a6
Author: Ken'ichi Ohmichi <email address hidden>
Date: Tue Jul 12 00:06:18 2016 -0700

    Add sync to avoid the loss of pub key data

    On the hard rebooting test, the pub key data can be loss due to the
    immediate rebooting. In addition, the existing scenario test manager
    is using 'sync' command to avoid the other data loss.
    So this patch adds a 'sync' command call.

    Change-Id: Ie55ad394cdbbde100eee45026c1e52a6f2825b77
    Closes-Bug: #1583987

Changed in tempest:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.