Comment 5 for bug 1407842

Revision history for this message
Curt Moore (jcmoore) wrote :

After some further investigation, I believe I have proven that there is indeed contention between the unattended instance of cloudbase-init run during sysprep and the one run as a Windows service. Depending upon how quickly the Windows services come online during sysprep and how quickly the specialize phase executes, they can both try to execute at the same time causing the behavior described in this bug report.

After some modifications to the Cloudbase-Init installation process (no code modifications), I've been able to spin up many tens of Windows 7 instances and they've all worked perfectly, something which was previously impossible as I'd see at least 20-30% of nodes fail.

My method is as follows:

1) Install Cloudbase-Init using the MSI as normal, except DO NOT have it perform an automatic sysprep during the installation
2) After the Cloudbse-init installation is complete, run the following command to disable automatic startup of the Windows service version of cloudbase-init:
    sc config cloudbase-init start= disabled
3) Edit the unattend.xml file and add another RunSynchronousCommand node to re-enable automatic start of the cloudbase-init Windows service. This command _must_ be set as <Order>1</Order> and the existing unattended cloudbase-init node set to <Order>2</Order> so that the re-enable command runs prior to the unattended instance of cloudbase-init, which requires a reboot. The idea is that if we re-enable the cloudbase-init service (but do not restart it at this time) upon the reboot required by the unattended cloudbase-init, the Windows service instance will start as desired and will not be in contention with the unattended instance since it will have already executed.
Example:
        <RunSynchronousCommand wcm:action="add">
          <Order>1</Order>
          <Path>sc config cloudbase-init start= auto</Path>
          <Description>Re-enable auto start of cloudbase-init</Description>
          <WillReboot>Never</WillReboot>
        </RunSynchronousCommand>
4) After editing the unattend.xml file, manually run sysprep and reference this modified XML file:
    "%SYSTEMROOT%\system32\sysprep\sysprep.exe" /generalize /oobe /quit /unattend:C:\Unattend.xml
5) At this point, the VM could be manually shutdown and uploaded into Glance but after some analysis with tcpdump on the hypervisor's bridge interface with previous images, it appears that the Windows TCP stack tries to be "helpful" and will try to re-DISCOVER the same IP address the VM had prior to being sysprepped. This is troublesome when booting a VM in OpenStack as the network segments are totally different and it takes Windows ~10-20 seconds to stop trying to re-DISCOVER the old address and just issue a new DHCP REQUEST. My solution to this was to simply release the Windows IP configuration after the sysprep and then immediately shutdown the VM:
    ipconfig /release
   shutdown /s /f /t 1

At this point the Windows VM is ready to upload into Glance and be used to spool up as many nodes as desired. There could be more eloquent solutions but this approach seems to be working for me. Id' be happy to work toward implementing and testing a fix which we could get into the formal installation process as this bug was very frustrating and difficult to track down due to the inability to reproduce it on demand.

If it makes any difference, I was running my Windows VMs on Icehouse with no problems and only started seeing this behavior when we upgraded to Juno.