StarlingX Sanity Fail During Provisioning 2nd controller

Bug #1790213 reported by Elio Martinez on 2018-08-31
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Critical
Erich Cordoba

Bug Description

ISO:stx-2018-08-30-97.iso
During Sanity testing for latest iso, the following message appears (Virtual Environment):

Test_Install_And_Unlock_Second_Controller :: Validates that second... | FAIL |
Keyword 'Check Property Value' failed after retrying for 20 minutes. The last error was: failed != completed

This is a consequence for timming in order to catch second controller is alive, from:
KEYWORD BuiltIn . Run Keyword If ${rc} == 0, Set To Dictionary, ${result}, stdout=${output.strip()}, ELSE IF, ${fail_if_error} == True, FAIL, ${output}, ELSE, Set To Dictionary, ${result}, stderr=${output}
00:00:00.001 KEYWORD Collections . Set To Dictionary ${result}, rc=${rc}
00:00:00.000 KEYWORD Collections . Log Dictionary ${result}
Documentation:

Logs the size and contents of the dictionary using given level.
Start / End / Elapsed: 20180830 15:48:25.202 / 20180830 15:48:25.202 / 00:00:00.000
15:48:25.202 INFO Dictionary size is 3 and it contains following items:
rc: 0
stderr:
stdout: +---------------------+-----------------------------------------+
| Property | Value |
+---------------------+-----------------------------------------+
| action | none |
| administrative | locked |
| availability | offline |
| bm_ip | None |
| bm_type | None |
| bm_username | None |
| boot_device | sda |
| capabilities | {u'Personality': u'Controller-Standby'} |
| config_applied | None |
| config_status | None |
| config_target | None |
| console | ttyS0,115200 |
| created_at | 2018-08-30T20:39:28.603776+00:00 |
| hostname | controller-1 |
| id | 4 |
| install_output | text |
| install_state | None |
| install_state_info | None |
| invprovision | None |
| location | {} |
| mgmt_ip | 192.168.204.4 |
| mgmt_mac | 52:54:00:99:17:33 |
| operational | disabled |
| personality | controller |
| reserved | False |
| rootfs_device | sda |
| serialid | None |
| software_load | 18.08 |
| task | None |
| tboot | false |
| ttys_dcd | None |
| updated_at | 2018-08-30T20:48:23.041593+00:00 |
| uptime | 0 |
| uuid | 0d854ce9-40a2-4e8d-8747-7b9651d01a01 |
| vim_progress_status | None |
+---------------------+-----------------------------------------+

And the following conditions never is reached
KEYWORD BuiltIn . Wait Until Keyword Succeeds 20 min, 20 sec, Check Property Value, ${second_controller}, install_state, completed
Documentation:

Runs the specified keyword and retries if it fails.
Start / End / Elapsed: 20180830 15:48:25.202 / 20180830 16:08:27.909 / 00:20:02.707
00:00:01.266 KEYWORD Utils . Check Property Value ${second_controller}, install_state, completed
Documentation:

Validates that property is set correctly to the expected value
Start / End / Elapsed: 20180830 15:48:25.203 / 20180830 15:48:26.469 / 00:00:01.266

*********STEPS TO REPRODUCE**********
1. Get iso from http://gaponcec-desk.zpn.intel.com/ISO/
2. Install VM's with 2 controllers 2 computes
3. Provision 1st controller
4. Provision 2nd controller, system should be alive

*******CURRENT DEBUG******
Checking latest ISOS i found that http://gaponcec-desk.zpn.intel.com/ISO/stx-2018-08-22-76.iso is not having that problem, will find latest functional image and share results.
Seems that the first ISO with the issue is stx-2018-08-23-77.iso

Elio Martinez (elio1979) wrote :
Elio Martinez (elio1979) on 2018-08-31
description: updated
Elio Martinez (elio1979) on 2018-09-02
description: updated
Ghada Khalil (gkhalil) on 2018-09-04
tags: removed: stx.test

Besides the installation of computes fails frequently, in consequence instances cannot be created.

Ricardo Perez (richomx) on 2018-09-04
tags: added: stx.test
Ghada Khalil (gkhalil) wrote :

Elio,
Can you please verify that your ISO has the following fix: https://review.openstack.org/597698?

tags: removed: stx.test
Ricardo Perez (richomx) on 2018-09-04
tags: added: stx.test
tags: removed: stx.test
Ghada Khalil (gkhalil) on 2018-09-05
Changed in starlingx:
status: New → Incomplete
Ghada Khalil (gkhalil) on 2018-09-05
Changed in starlingx:
importance: Undecided → Critical

I found the instructions for booting under libvirt/qemu, per the stx-tools repo and link listed below, to have a few gaps such that I think the following information would contribute to a triage:

- Output of 'virsh dumpxml $node' for each node in the cluster
- Confirmation of the instructions being followed to "reproducibility": https://wiki.openstack.org/wiki/StarlingX/Installation_Guide_Virtual_Environment/Controller_Storage
- Quote the head of stx-tools and automation git repos being used

The bug description shows controller-1 as locked/offline/disabled.

When I followed the instructions on StarlingX wiki page for "StarlingX/Installation Guide Virtual Environment/Controller Storage" the controller-1 reached unlocked/enabled/available. The load I used was built 2018-09-04_20-18-01.

Ghada Khalil (gkhalil) on 2018-09-07
tags: added: stx.config
Erich Cordoba (ericho) wrote :

These steps could help to get more information from the VM while are installing. This is to discard some kernel or installer crash.

- Go into /opt/platform/config/18.08/pxelinux.cfg. This folder has the boot options/menu for each controller/compute added in the system. You can see files where the mac address is the filename.
  - Select open the file that matches the node having problems and modify:
       1) Replace console=ttyS0,115200 to console=tty0. This way you can see in the screen of the VM what was going on in the installation
       2) In the APPEND line add at the end: debug loglevel=8. This is to get more output from the kernel in case there is a crash somewhere.
   - Reboot the VM, if needed change the boot order of the VM to boot first using network.

I added an "Accessing Server Consoles" section to the wiki StarlingX/Installation_Guide_Virtual_Environment/Controller_Storage (wiki.openstack.org/wiki/StarlingX/Installation_Guide_Virtual_Environment/Controller_Storage)

The text is as follows:

The xml for domains in stx-tools repo, deployment/libvirt, provides both graphical and text consoles.

Access the graphical console in virt-manager by right-click on the domain (the server) and selecting "Open".

Access the textual console with the command "virsh console $DOMAIN", where DOMAIN is the name of the server shown in virsh.

When booting the controller-0 for the first time, both the serial and graphical consoles will present the initial configuration menu for the cluster. One can select serial or graphical console for controller-0. For the other nodes however only serial is used, regardless of which option is selected.

Open the graphic console on all servers before powering them on to observe the boot device selection and PXI boot progress. Run "virsh console $DOMAIN" command promptly after power on to see the initial boot sequence which follows the boot device selection. One has a few seconds to do this.

Hi,

Until now there is no way to check the log of commits that a ISO contains, a script is being developed to get such info to compare the differences between two ISOS.

Michael, there is a VM set where I upload the ISO stx-2018-08-28-88.iso, such ISO faced issues 1790213 and 1790716. Will sent you the credential via email.

Next steps are:
- reproduce the steps in https://wiki.openstack.org/wiki/StarlingX/Installation_Guide_Virtual_Environment and https://wiki.openstack.org/wiki/StarlingX/Installation_Guide_Virtual_Environment/Controller_Storage with our ISO and the ISO share from Michael.
- Check the differences between ISOS with the script being developed.

Hi, I uploaded the iso image I used. md5sum 35a2cbbecce73385f034603503521622

I downloaded the images you had in jc, md5sums:
a19d7d4885b66c80a7da02c3f512ff71 stx-2018-08-28-88.iso
99183d8a38c6bbf06c1414ae202d29b0 stx-2018-08-25-81.iso

stx-2018-08-28-88.iso is only 400G so I'd like to assume it is not valid for boot; I'll ignore it.

That's an interesting approach to file sharing... I'll have to chat with our IT about that.

M

I approximated the commits between your build and mine (attached commit-diff.txt)
old=2018-08-24_11-22-14
new=2018-09-04_20-18-01

I did not "spot" anything in particular, there's a lot of changes.

Cindy Xie (xxie1) wrote :

old=2018-08-24_11-22-14, this bug not repro; and new=2018-09-04_20-18-01, this bug repro?
if the first time we have the bug is stx-2018-08-30-97.iso, shall we use that manifest to compare w/ 08-24's ISO?

No, the issue is reported by intel at 2018-08-30-97. 2018-08-24_11-22-14 and 2018-09-04_20-18-01 are references to builds of StarlingX upstream at Wind River systems. The text of the original report says the issue is not present in intel build 2018-08-22-76

2018-08-24 approximates the state of git repos for intel build 2018-08-25-81, since the dump of git repo HEADs is not available from intel. The attached commit-diff.txt is the list of commits since 2018-08-24_11-22-14 up to 2018-09-04_20-18-01.

I'm starting to get confused by all the build numbers :/

I am able to provision controllers and compute using stx-2018-08-25-81.iso.

I have received stx-2018-08-28-88.iso, md5sum:

6aab54e5b6b9527f89d5a50cff4c2e98 stx-2018-08-28-88.iso

I will try that one next.

I see a stx-2018-08-30-97.iso is not yet uploaded. I will check again tomorrow.

Hi Michel,

The ISO you shared was downloaded and tested in our test framework, no issues found.
The ISOs generated by Intel are still failing, but such generated at Wind River don't.
We started to find issues since stx-2018-08-25-81.iso (1790716).

I guess there are differences between the way an ISO is built by Intel and Wind River since Intel got issues but Wind River does not. I think we should sync about this.

In the 'commit-diff.txt' file, the content show the commits that the 'new' contains and 'old' does not, or vice versa?

Since you could not find 'LAST_COMMITS' file in our ISO, I guess you cannot make the comparision between your ISO and ours, right?

BTW Michel, did you run with our ISO or did you just get the commits differences?

I'm not sure that I understand the question JC. I previously received and tested stx-2018-08-25-81. I had also received stx-2018-08-28-88, but it is only tested now.

I have not yet received stx-2018-08-30-97, quoted in the issue description. However, stx-2018-08-28-88 presented the attached log (stx-2018-08-28-88.console.log) on the second controller's console.

Is this the issue that is being seen in this bug report? The installer is broken.

Yes, the same error related to such package: Packages/python2-qpid-proton-0.24.0-1.el7.x86_64.rpm

After get rpm packages from "stx-2018-08-25-81.iso", "stx-2018-08-28-88.iso" and "bootimage_michel.iso", and to compare them:

stx-2018-08-25-81.iso (1364 rpms):
  - python2-qpid-proton-0.22.0-1.el7.x86_64.rpm

stx-2018-08-28-88.iso (1374 rpms):
  - python2-qpid-proton-0.24.0-1.el7.x86_64.rpm

bootimage_michel.iso (1379 rpms):
  - python2-qpid-proton-0.24.0-1.el7.x86_64.rpm

Seems that the changes from v0.22 to v0.24 is causing the installation fail in our side.
Also, seems that your iso contains somethig else that helps to iso works well, maybe another packages or dependencies.

Hi,

Please validate your latest build and attach the following to this bug if the condition we've agreed on persists for that build:
 - console log of controller-1
 - 'collect' from controller-0
 - output of 'virsh dumpxml ...' for the two controllers

Please also run the following commands and post the output to this bug

After that I would intend to defer to someone with more expertise with the installer and/or build process.

Please ignore "Please also run the following commands and post the output to this bug"... it was an unfinished thought that I neglected to edit out.

I uploaded a "latest" build from WR to your VM machine. It is sanitised by automation (on hardware), and by me for the virt scenario at least as far as this bug is concerned. I will following through tomorrow with more complete sanity.

md5sum:
559d5d7f6ecbc6e81e3dddfdd168e2fc /root/WR-latest-2018-09-10_20-18-00/2018-09-10_20-18-00.bootimage.iso

Erich Cordoba (ericho) wrote :

I have narrowed down a little bit the issue, these are my findings.

  - I compared the bootimage_michel.iso with the stx-2018-09-11-
111.iso, which is today's ISO.
  - The are not significant package differences between them, minor
versions has changed in some packages.
  - The issue is that once the kernel has been loaded through PXE,
anaconda is unable to find python2-qpid package. Not sure if this is
the first package that is requested.
  - From the anaconda shell I can see the http://pxecontroller/ serving
the files and the repodata seems to be updated there. I installed yum
manually and I was able to install python2-qpid package, (and any other
available).
  - The content of the pxe-network-installer package is not the same,
WR squashfs.img file has the kernel 3.10.0-693.2.2 but ours has 3.10.0-
862.6.3. We regenerate those files when the kernel was upgraded. (maybe
this matches with the date where the issue started to arise..)

So, my guess now is that the problem could be caused by the pxe-
network-installer package. We create those files with the update-
network-installer script. That's the same used in WR to generate these
files?

I'll keep diging in the generation of these files to see if there is
any problem.

Ghada Khalil (gkhalil) on 2018-09-12
tags: added: stx.2018.10

I finished my sanity of WR build 2018-09-10_20-18-00.bootimage.iso under virtualization, which was upload to host JC setup.

I have received two new builds with and without an installer that is described as "I shall be producing a new build of the installer and doing a new build of the iso using the new installer."

I understand these builds are also related to this bug, where the "new installer" is a reference against what we expect intel team to be seeing.

I have validated the latest build "stx-2018-09-12-112.iso", got the same error:

Starting package installation process
 Error populating transaction, retrying (1/10)
 Error populating transaction, retrying (2/10)
 Error populating transaction, retrying (3/10)
 Error populating transaction, retrying (4/10)
 Error populating transaction, retrying (5/10)
 Error populating transaction, retrying (6/10)
 Error populating transaction, retrying (7/10)
 Error populating transaction, retrying (8/10)
 Error populating transaction, retrying (9/10)
 Error populating transaction, retrying (10/10)
================================================================================
================================================================================
Error

The following error occurred while installing. This is a fatal error and
installation will be aborted.

 Error populating transaction after 10 retries: failure: Packages/python2-qpid-
proton-0.24.0-1.el7.x86_64.rpm from anaconda: [Errno 256] No more mirrors to
try.

Also attached the files you requested.

I downloaded "2018-09-10_20-18-00.bootimage.iso" and will start to test it.

Did you reproduce the issues in your side?

Hi JC, some engineers who are better experts than I are looking at the installer and related tech.

Thank you for the confirmations on latest build stx-2018-09-12-112.iso and for the debug info Documents.zip

Tested WR ISO "2018-09-10_20-18-00.bootimage.iso", no issues found, controllers and computes were installed and provisoned correctly.

I will not have time to finish testing this before I leave, but I will follow-up in the morning. Our engineer found that the lighttpd server was reporting 403, permissions error.

host$ sudo mount -t auto stx-2018-08-28-88.iso /mnt/iso
host$ find /mnt/iso | grep qpid | xargs ls -l
-rw------- 1 1001 users 258624 Aug 28 08:04 /mnt/iso/Packages/python2-qpid-proton-0.24.0-1.el7.x86_64.rpm
-rw------- 1 1001 users 190128 Aug 28 08:04 /mnt/iso/Packages/qpid-proton-c-0.24.0-1.el7.x86_64.rpm

controller-0# find /www | grep "qpid.*.rpm" | xargs ls -l
-rw-------. 1 root root 258624 Sep 12 21:54 /www/pages/feed/rel-18.08/Packages/python2-qpid-proton-0.24.0-1.el7.x86_64.rpm
-rw-------. 1 root root 190128 Sep 12 21:54 /www/pages/feed/rel-18.08/Packages/qpid-proton-c-0.24.0-1.el7.x86_64.rpm

controller-0# chmod ugo+r \
/www/pages/feed/rel-18.08/Packages/python2-qpid-proton-0.24.0-1.el7.x86_64.rpm \
/www/pages/feed/rel-18.08/Packages/qpid-proton-c-0.24.0-1.el7.x86_64.rpm

Actually, the testing passed before I left. install of controller-1 based on stx-2018-08-28-88.iso is progressing after the file permissions are fixed for those to rpm file:

<snip>
...
Running pre-installation scripts
.
Starting package installation process
Preparing transaction from installation source
Installing libgcc (1/1333)
Installing web-assets-filesystem (2/1333)
Installing grub2-common (3/1333)
Installing fontpackages-filesystem (4/1333)
...
<snip>

With load stx-2018-08-28-88.iso, when the permissions of those two RPMs are corrected in the rpm feed, both the controller-1 and compute install.

Tested last iso "stx-2018-09-13-113.iso", controller-1 and computes correctly installed, also the issue related to Alarms Events messages is not present (#1790718)

Michael,

What is the state of the controller-1 in your site?

In my site controller-1 was installed and provisioned correctly, but the Availabilty State is Degraded. I guess this state is not correct, right?

[wrsroot@controller-0 ~(keystone_admin)]$ system host-show controller-1
+---------------------+-----------------------------------------+
| Property | Value |
+---------------------+-----------------------------------------+
| action | none |
| administrative | unlocked |
| availability | degraded |

Hi JC,

Degraded is probably because the controller's are data-syncing. Use 'fm alarm-list' to confirm this status. Data syncing can take a while... it is especially slow if you have all your disks on the virt host's root disk, or if the management network is set to capacity 1000 instead of 10000. I added a note to the Controller_storage wiki about data-sync.

Ghada Khalil (gkhalil) on 2018-09-14
tags: added: stx.build
removed: stx.config
Changed in starlingx:
assignee: nobody → Erich Cordoba (ericho)
Ghada Khalil (gkhalil) on 2018-09-14
Changed in starlingx:
status: Incomplete → Triaged
Erich Cordoba (ericho) wrote :

It has been found out that the issue was some bad permission in the RPMs from the mirror.

These RPMs are stored into the ISO and then put it into /www/pages/feed/<release>/Packages/ to be used by anaconda in the provision of an additional controller or compute. As the http server that runs there runs as non-root, the files couldn't be served, thus causing a failure in the installation of the packages. See[0] and [1]

After changing the permissions to 644, now the system is able to boot. A proposed next step is to ensure that the files under /www/pages/feed/<release>/Packages has the correct permission to be read by the web server, this could be done through a kickstart script.

However, as this has been root caused I recommend to lower the importance (maybe Medium) as this is not a blocking issue anymore.

[0] http://lists.starlingx.io/pipermail/starlingx-discuss/2018-September/001057.html
[1] http://lists.starlingx.io/pipermail/starlingx-discuss/2018-September/001056.html

Ghada Khalil (gkhalil) wrote :

My recommendation is to mark this bug as Fix Released. Please add a link to the commit/gerrit review which addressed it. This gives us traceability that the blocking issue is addressed. Then you can open a lower priority bug (or enhancement in SB) to implement the tool/kickstart to take care of all file permissions.

Bruce Jones (brucej) wrote :

Erich, ping, please update this bug as Ghada suggests and open a new one for the other issue.

Erich Cordoba (ericho) wrote :

I created this one https://bugs.launchpad.net/starlingx/+bug/1794894

This one doesn't need a code review.

Ghada Khalil (gkhalil) on 2018-09-28
Changed in starlingx:
status: Triaged → Fix Released
Ghada Khalil (gkhalil) wrote :

Marking as Fix Released. The specific issue reported in this bug report has been addressed.

Ken Young (kenyis) on 2019-04-06
tags: added: stx.1.0
removed: stx.2018.10
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers