sethostname cause reboot make sysprep failed and vm crashed

Bug #1407842 reported by Hao Li on 2015-01-06
24
This bug affects 5 people
Affects Status Importance Assigned to Milestone
cloudbase-init
Undecided
Adrian Vladu

Bug Description

I have read cloudbase-init source code, I found it will work until "HKEY_LOCAL_MACHINE\\SYSTEM\\Setup\\Status\\SysprepStatus" is 7. This may be cause some problem:

sysprep process:
 finish making image,then run sysprep.exe /oobe /generalize /shutdown /unattend:path-------->boot instance from this image----->sysprep will work(find device,and install device driver,then set SysprepStatus 7)------------>sysprep do other things,then reboot vm--------->finish user settings------->login interface

because I have installed cloudbase-init, it start when SysprepStatus is 7, In this moment, sysprep is working and cloudbase-init is working too, when cloudbase-init finish running sethostname plugin, it will reboot vm, but this moment, sysprep is doing important system setting. So the VM will crash.

environment:
vm: windows 2012 R2 64bit, installed cloudbase-init (2014.12.08)
host: RHEL 6.4 64bit

I propose a way to solve this problem, set a flag in a flag_file, if flag_file is not exist or the content is 0, cloudbase-init should not run. set the flag 1, and cloudbase-init stop. if flag is 1, It can work. I have modify the source code:
def wait_for_boot_complete(self):
        path = CONF.sysprep_status_path
        print path

        is_exist = os.path.exists(path)
        if not is_exist:
            try:
                fp = os.open(path, os.O_CREAT | os.O_RDWR)
                os.write(fp, "1")
            except Exception, e:
                LOG.warning(e)
            finally:
                os.close(fp)
            return -1
        else:
            try:
                fp = os.open(path, os.O_RDWR)
                data = os.read(fp, 1)
            except Exception, e:
                LOG.warning(e)
                return -1

            if data == "0":
                os.lseek(fp,0,0)
                os.write(fp, "1")
                os.close(fp)
                return -1
            else:
                os.close(fp)

Hao Li (lihaosz) on 2015-01-06
description: updated

thanks for the report, we're checking it

Changed in cloudbase-init:
assignee: nobody → Adrian Vladu (avladu)

Hi, cloudbase-init, when installed through the provided MSI, it runs twice:

1) During specialize it is executed as a command, setting the hostname (avoiding one extra reboot).
2) During a regular boot it's executed as a service (checking for SysprepStatus == 7).

From the description of your issue, unless the service itself is started during specialize, there should be no uncontrolled reboot caused by cloudbase-init execution.

One are that deserves investigation is that we assume that the CLoudbase-Init service is started only after specialize, can you please provide your cloudbase-init logs for further investigation?

Thanks

PS marking as incomplete while waiting for the logs or a reply.

Changed in cloudbase-init:
status: New → Incomplete
Curt Moore (jcmoore) wrote :

I can confirm I have a very similar issue with Windows 7 and the most recent cloudbase-init x64 from 2015-03-06.

The issue appears to be intermittent as it may be a race condition where if sysprep happens to be far enough along by the time the reboot due to the hostname change happens, things are ok. If sysprep is not far enough along when the reboot happens, the sysprep process will fail with the attached error dialog.

Your comment about the cloudbase-init service being started prematurely is interesting and could explain the intermittent nature of the bug, depending upon the timing of the service starting during the sysprep boot.

I have attached the cloudbase-init output log.

Curt Moore (jcmoore) wrote :
Curt Moore (jcmoore) wrote :
Download full text (3.4 KiB)

After some further investigation, I believe I have proven that there is indeed contention between the unattended instance of cloudbase-init run during sysprep and the one run as a Windows service. Depending upon how quickly the Windows services come online during sysprep and how quickly the specialize phase executes, they can both try to execute at the same time causing the behavior described in this bug report.

After some modifications to the Cloudbase-Init installation process (no code modifications), I've been able to spin up many tens of Windows 7 instances and they've all worked perfectly, something which was previously impossible as I'd see at least 20-30% of nodes fail.

My method is as follows:

1) Install Cloudbase-Init using the MSI as normal, except DO NOT have it perform an automatic sysprep during the installation
2) After the Cloudbse-init installation is complete, run the following command to disable automatic startup of the Windows service version of cloudbase-init:
    sc config cloudbase-init start= disabled
3) Edit the unattend.xml file and add another RunSynchronousCommand node to re-enable automatic start of the cloudbase-init Windows service. This command _must_ be set as <Order>1</Order> and the existing unattended cloudbase-init node set to <Order>2</Order> so that the re-enable command runs prior to the unattended instance of cloudbase-init, which requires a reboot. The idea is that if we re-enable the cloudbase-init service (but do not restart it at this time) upon the reboot required by the unattended cloudbase-init, the Windows service instance will start as desired and will not be in contention with the unattended instance since it will have already executed.
Example:
        <RunSynchronousCommand wcm:action="add">
          <Order>1</Order>
          <Path>sc config cloudbase-init start= auto</Path>
          <Description>Re-enable auto start of cloudbase-init</Description>
          <WillReboot>Never</WillReboot>
        </RunSynchronousCommand>
4) After editing the unattend.xml file, manually run sysprep and reference this modified XML file:
    "%SYSTEMROOT%\system32\sysprep\sysprep.exe" /generalize /oobe /quit /unattend:C:\Unattend.xml
5) At this point, the VM could be manually shutdown and uploaded into Glance but after some analysis with tcpdump on the hypervisor's bridge interface with previous images, it appears that the Windows TCP stack tries to be "helpful" and will try to re-DISCOVER the same IP address the VM had prior to being sysprepped. This is troublesome when booting a VM in OpenStack as the network segments are totally different and it takes Windows ~10-20 seconds to stop trying to re-DISCOVER the old address and just issue a new DHCP REQUEST. My solution to this was to simply release the Windows IP configuration after the sysprep and then immediately shutdown the VM:
    ipconfig /release
   shutdown /s /f /t 1

At this point the Windows VM is ready to upload into Glance and be used to spool up as many nodes as desired. There could be more eloquent solutions but this approach seems to be working for me. Id' be happy to work toward implementing and testing a fix which we could get int...

Read more...

Hi Curt,

This is very valuable feedback, thanks!!

We've been doing similar tests and investigating the same area looking for the race condition cause.

The network rediscovery issue should be treated as a separate bug, your analysis on that is definitely valuable as well.

Hi Curt,

Looking at your description, we are already performing similar steps, when in the cloudbase-init installer the user chooses to sysprep. Here's the content of SetSetupComplete.cmd:

mkdir "%SYSTEMROOT%\Setup\Scripts"
sc config cloudbase-init start= demand
echo sc config cloudbase-init start= auto ^&^& net start cloudbase-init > "%SYSTEMROOT%\Setup\Scripts\SetupComplete.cmd"

Here's how the same script is called during the automated generation process:

https://github.com/cloudbase/windows-openstack-imaging-tools/blob/master/UnattendResources/Logon.ps1#L77

Would you be available for a chat on IRC or Skype? My nick is alexpilotti (FreeNode) on #openstack-dev

Thanks,

Alessandro

Curt Moore (jcmoore) wrote :

Alessandro,

I believe I see where our processes differ. I am not using the automatic sysprep option within the cloudbase-init installer so I do not get the benefit of the SetupComplete.cmd script setting the cloudbase-init service to start= demand. I am performing a manual sysprep after the cloudbase-init completes using my own unattend.xml as I need to perform a few additional customizations. In my case, it appears that cloudbase-init is set to start= auto during the cloudbase-init installation. Without the extra commands in SetupComplete.cmd to delay the service start upon first boot, I get the race condition between the standalone and service instances of cloudbase-init during sysprep.

I think the current approach is reasonable where we assume that if a user does not choose to initiate a sysprep during the installation process, that they would want cloudbase-init set to start= auto upon next boot. We may want to somehow document what is happening in the background with the enable/delay options for the cloudbase-init service in these cases so others do not encounter the same error. Once you explained what is happening with the cloudbase-init service startup options in the two scenarios, the behavior made complete sense. :-)

Thanks!
-Curt

Iwao Endo (1wao) on 2016-04-08
Changed in cloudbase-init:
status: Incomplete → Opinion
Iwao Endo (1wao) on 2016-04-08
Changed in cloudbase-init:
status: Opinion → Incomplete
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers