Illegal OpCode after Ubuntu install on ProLiant DL380 Gen9 with RAID

Bug #1543221 reported by Sergey Galkin
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Alexander Gordeev
8.0.x
Confirmed
High
Fuel Python (Deprecated)

Bug Description

HW: ProLiant DL380 Gen9 with HP Smart Array P840 Controller

During deploy, after first reboot from bootstrap image all nodes show red screen with "Illegal Opcode" error

Revision history for this message
Sergey Galkin (sgalkin) wrote :
Revision history for this message
Sergey Galkin (sgalkin) wrote :

snapshot

Revision history for this message
Sergey Galkin (sgalkin) wrote :

fuel - version
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "539"
  build_id: "539"
  fuel-nailgun_sha: "baec8643ca624e52b37873f2dbd511c135d236d9"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "658be72c4b42d3e1436b86ac4567ab914bfb451b"
  fuel-nailgun-agent_sha: "b2bb466fd5bd92da614cdbd819d6999c510ebfb1"
  astute_sha: "b81577a5b7857c4be8748492bae1dec2fa89b446"
  fuel-library_sha: "e2d79330d5d708796330fac67722c21f85569b87"
  fuel-ostf_sha: "3bc76a63a9e7d195ff34eadc29552f4235fa6c52"
  fuel-mirror_sha: "fb45b80d7bee5899d931f926e5c9512e2b442749"
  fuelmenu_sha: "78ffc73065a9674b707c081d128cb7eea611474f"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "a43cf96cd9532f10794dce736350bf5bed350e9d"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "87dfb6bc25d4650264f09c338ed77c21a3d6fe87"

Changed in fuel:
assignee: nobody → MOS Linux (mos-linux)
importance: Undecided → High
Revision history for this message
Ivan Suzdal (isuzdal) wrote :
Changed in fuel:
status: New → Invalid
Revision history for this message
Sergey Galkin (sgalkin) wrote :

1. In this link described FACT:HP Smart Array P410 controller but we have HP Smart Array P840 Controller
2. I can reset env, reboot nodes and all nodes will successfully boot in bootstrap image

Looks like this is our case http://ubuntuforums.org/showthread.php?t=1613584

Changed in fuel:
status: Invalid → New
Revision history for this message
Alexander Gordeev (a-gordeev) wrote :

Tipycally illegal opcode means that bootloader wasn't installed properly for some reasons. But we're wiping out with zeroes the beginning and the end of any disk presented in the system. So, BIOS/UEFI/whatever code jumps to boot sector and then CPU suddenly will realize that opcode isn't correct by any means.

node-2 and node-3 weren't provisioned. fuel-agent complained about improper disk partitioning scheme and exited.

http://paste.openstack.org/show/486318/

looks like a bug in fuel-agent. Requires diving into logs and the code.

Also, interesting messages in astute.log

2016-02-08 17:10:13 DEBUG [723] Task time summary: reboot_provisioned_nodes with status error on node 1 took 00:04:00
2016-02-08 17:10:13 DEBUG [723] Task time summary: reboot_provisioned_nodes with status error on node 4 took 00:04:00
2016-02-08 17:10:13 DEBUG [723] Task time summary: reboot_provisioned_nodes with status error on node 5 took 00:04:00
2016-02-08 17:10:13 DEBUG [723] Task time summary: reboot_provisioned_nodes with status error on node 6 took 00:04:00

Nodes 1,4,5,6 were provisioned without errors.

@Sergey Galkin (sgalkin),
Are you sure that all nodes were unable to boot and showed red screen of death?

Revision history for this message
Alexander Gordeev (a-gordeev) wrote :

@Sergey Galkin,

a link you're referring to is mainly contains rants about cciss disks. I didn't find any traces of cciss disks in fuel-agent logs. I don't think it's relevant.

On every node, bootloader (stage0) and /boot partition landed onto /dev/sdb and /dev/sdb3 respectively.

Revision history for this message
Sergey Galkin (sgalkin) wrote :

I created https://bugs.launchpad.net/fuel/+bug/1543233 about improper disk partitioning scheme

Revision history for this message
Sergey Galkin (sgalkin) wrote :

@Alexander Gordeev , sorry for incorrect description
controller and 3 ceph nodes shows "Illegal Opcode" error, all computes (2 nodes) stay in bootstrap image in error status (as described in https://bugs.launchpad.net/fuel/+bug/1543233)

I thought something wrong with grub, not with cciss

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Removed tag "scale" as it has nothing to do with scale.

tags: removed: scale
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Per https://bugs.launchpad.net/fuel/+bug/1543221/comments/6 ("So, BIOS/UEFI/whatever code jumps to boot sector and then CPU suddenly will realize that opcode isn't correct by any means.") it's very likely that this issue is actually caused by https://bugs.launchpad.net/fuel/+bug/1543233 .

I tentatively mark this as Incomplete, until we finish the investigation for the latter.

Changed in fuel:
status: New → Incomplete
Revision history for this message
Alexander Gordeev (a-gordeev) wrote :
Revision history for this message
Alexander Gordeev (a-gordeev) wrote :

what actually happens:

1) nailgun-agent reported 2 disks. /dev/sda (3T) and /dev/sdb (1G)
2) fuel-agent wiped out the beginning and the end of all disks. So, disks became clean and without any partition table
2) fuel-agent chose to put /boot partition and bootloader only onto smaller than 2T disks if possible due to https://bugs.launchpad.net/fuel/+bug/1461126 , so, /boot was created on /dev/sdb
3) fuel-agent finished provisioning without errors and astute rebooted a node.
4) during loading, node tried to boot from disks (but only 1 disk was available) - /dev/sda. /dev/dsb where bootloader and /boot partition were written, wasn't available at the early stage of booting.
5) /dev/sda didn't have any bootloader code, so BIOS/UEFI tried to boot from invalid (in term of operationing) boot sector and threw Illegal OpCode at red screen of death.

Changed in fuel:
status: Incomplete → Confirmed
milestone: none → 9.0
assignee: MOS Linux (mos-linux) → Alexander Gordeev (a-gordeev)
tags: added: area-python
Revision history for this message
Alexander Gordeev (a-gordeev) wrote :

further investigation showed that small 1G disk (/dev/sdb) is connected through USB.

Device sdb udev properties:

{
   "DEVLINKS": "/dev/disk/by-id/usb-HP_iLO_LUN_00_Media_0_000002660A01-0:0 /dev/disk/by-path/pci-0000:00:14.0-usb-0:3.1:1.0-scsi-0:0:0:0",
   "DEVNAME": "/dev/sdb",
   "DEVPATH": "/devices/pci0000:00/0000:00:14.0/usb4/4-3/4-3.1/4-3.1:1.0/host1/target1:0:0/1:0:0:0/block/sdb",
   "DEVTYPE": "disk",
   "ID_BUS": "usb",
   "ID_INSTANCE": "0:0",
   "ID_MODEL": "LUN_00_Media_0",
   "ID_MODEL_ENC": "LUN\\x2000\\x20Media\\x200\\x20\\x20",
   "ID_MODEL_ID": "4030",
   "ID_PATH": "pci-0000:00:14.0-usb-0:3.1:1.0-scsi-0:0:0:0",
   "ID_PATH_TAG": "pci-0000_00_14_0-usb-0_3_1_1_0-scsi-0_0_0_0",
   "ID_REVISION": "2.09",
   "ID_SERIAL": "HP_iLO_LUN_00_Media_0_000002660A01-0:0",
   "ID_SERIAL_SHORT": "000002660A01",
   "ID_TYPE": "disk",
   "ID_USB_DRIVER": "usb-storage",
   "ID_USB_INTERFACES": ":080650:",
   "ID_USB_INTERFACE_NUM": "00",
   "ID_VENDOR": "HP_iLO",
   "ID_VENDOR_ENC": "HP\\x20iLO\\x20\\x20",
   "ID_VENDOR_ID": "0424",
   "MAJOR": "8",
   "MINOR": "16",
   "SUBSYSTEM": "block",
   "USEC_INITIALIZED": "898576"
}

I doubt it this device is really needed for further deployment procedures. It looks like nailgun-agent must filter out all USB connected block-storages with no mercy.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-nailgun-agent (master)

Fix proposed to branch: master
Review: https://review.openstack.org/279593

Changed in fuel:
status: Confirmed → In Progress
Dmitry Pyzhov (dpyzhov)
tags: added: module-volumes
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-nailgun-agent (master)

Reviewed: https://review.openstack.org/279593
Committed: https://git.openstack.org/cgit/openstack/fuel-nailgun-agent/commit/?id=73877c75b03992f0458a19fd00c5caca2ec17474
Submitter: Jenkins
Branch: master

commit 73877c75b03992f0458a19fd00c5caca2ec17474
Author: Alexander Gordeev <email address hidden>
Date: Fri Feb 12 18:09:19 2016 +0300

    Exclude USB block devices by the default

    All USB storage devices must be filtered by the default as often this
    type of devices can be just an emulated temprorary storage for FW
    upgrade and so on.

    If one wants to get usb block devices reported to nailgun, then
    it could be either a cmdline option report_usb_block_devices or
    the same option added to the agent' config file.

    DocImpact
    Change-Id: Id609715732fd0ab393d1557b4810464fbfaf096e
    Closes-Bug: #1543221

Changed in fuel:
status: In Progress → Fix Committed
tags: added: scale
Revision history for this message
Leontii Istomin (listomin) wrote :

Hasn't been reproduced on 9.0-mos-481

Changed in fuel:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-nailgun-agent (stable/8.0)

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/363636

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-nailgun-agent (stable/8.0)

Change abandoned by Aleksandr Gordeev (<email address hidden>) on branch: stable/8.0
Review: https://review.openstack.org/363636
Reason: Not targeted for release

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.