Phase 1 tripleo-quickstart-promote-rocky-rdo_trunk-minimal job fail on master and rocky. Introspection failed to run hardware-detect utility

Bug #1818554 reported by Gabriele Cerami on 2019-03-04
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Critical
Gabriele Cerami

Bug Description

job in the phase1 promotion pipeline at

https://ci.centos.org/job/rdo_trunk-promote-rocky-current-tripleo/

and
https://ci.centos.org/job/rdo_trunk-promote-master-current-tripleo/

are failing during inspection.
The introspected nodes are libvirt VMs

From a differential analysis we saw that the job started to fail after Feb 25th. We suspect something was merged on master in ironic and backported to rocky.

Logs for ROCKY at

https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-promote-rocky-rdo_trunk-minimal-114/undercloud/home/stack/overcloud_prep_images.log.gz

show

2019-03-04 12:23:30 | Exception introspecting nodes: {u'status': u'RUNNING', u'node_uuids': [u'595e4656-3bf0-488b-a1f7-835ff8b25a61', u'7bc2cf1b-ff95-4f71-b08f-cb094a31a79f'], u'failed_introspection': [u'595e4656-3bf0-488b-a1f7-835ff8b25a61', u'7bc2cf1b-ff95-4f71-b08f-cb094a31a79f'], u'result': u'Failure caused by error in tasks: send_message\n\n send_message [task_ex_id=a2cca1f8-c043-4dc6-9b32-c52e0c5b08ce] -> Workflow failed due to message status\n [wf_ex_id=713dff73-cd2c-4dff-80d6-0673804f3374, idx=0]: Workflow failed due to message status\n', u'introspected_nodes': {u'7bc2cf1b-ff95-4f71-b08f-cb094a31a79f': {u'uuid': u'7bc2cf1b-ff95-4f71-b08f-cb094a31a79f', u'links': [{u'href': u'http://192.168.24.2:13050/v1/introspection/7bc2cf1b-ff95-4f71-b08f-cb094a31a79f', u'rel': u'self'}], u'finished_at': None, u'state': u'waiting', u'finished': False, u'error': None, u'started_at': u'2019-03-04T11:02:35'}, u'595e4656-3bf0-488b-a1f7-835ff8b25a61': {u'uuid': u'595e4656-3bf0-488b-a1f7-835ff8b25a61', u'links': [{u'href': u'http://192.168.24.2:13050/v1/introspection/595e4656-3bf0-488b-a1f7-835ff8b25a61', u'rel': u'self'}], u'finished_at': None, u'state': u'waiting', u'finished': False, u'error': None, u'started_at': u'2019-03-04T11:02:36'}}, u'message': u'Retrying 2 nodes that failed introspection. Attempt 1 of 3 ', u'introspection_attempt': 1}
2019-03-04 12:23:30 | Waiting for introspection to finish...
2019-03-04 12:23:30 | Introspection of node 7bc2cf1b-ff95-4f71-b08f-cb094a31a79f timed out.
2019-03-04 12:23:30 | Introspection of node 595e4656-3bf0-488b-a1f7-835ff8b25a61 timed out.
2019-03-04 12:23:30 | Retrying 2 nodes that failed introspection. Attempt 1 of 3
2019-03-04 12:23:30 | Introspection of node 595e4656-3bf0-488b-a1f7-835ff8b25a61 timed out.
2019-03-04 12:23:30 | Introspection of node 7bc2cf1b-ff95-4f71-b08f-cb094a31a79f timed out.
2019-03-04 12:23:30 | Retrying 2 nodes that failed introspection. Attempt 2 of 3
2019-03-04 12:23:30 | Introspection of node 595e4656-3bf0-488b-a1f7-835ff8b25a61 timed out.
2019-03-04 12:23:30 | Introspection of node 7bc2cf1b-ff95-4f71-b08f-cb094a31a79f timed out.
2019-03-04 12:23:30 | Retrying 2 nodes that failed introspection. Attempt 3 of 3
2019-03-04 12:23:30 | Introspection of node 7bc2cf1b-ff95-4f71-b08f-cb094a31a79f timed out.
2019-03-04 12:23:30 | Introspection of node 595e4656-3bf0-488b-a1f7-835ff8b25a61 timed out.
2019-03-04 12:23:30 | Retry limit reached with 2 nodes still failing introspection
2019-03-04 12:23:30 | {u'status': u'RUNNING', u'node_uuids': [u'595e4656-3bf0-488b-a1f7-835ff8b25a61', u'7bc2cf1b-ff95-4f71-b08f-cb094a31a79f'], u'failed_introspection': [u'595e4656-3bf0-488b-a1f7-835ff8b25a61', u'7bc2cf1b-ff95-4f71-b08f-cb094a31a79f'], u'result': u'Failure caused by error in tasks: send_message\n\n send_message [task_ex_id=a2cca1f8-c043-4dc6-9b32-c52e0c5b08ce] -> Workflow failed due to message status\n [wf_ex_id=713dff73-cd2c-4dff-80d6-0673804f3374, idx=0]: Workflow failed due to message status\n', u'introspected_nodes': {u'7bc2cf1b-ff95-4f71-b08f-cb094a31a79f': {u'uuid': u'7bc2cf1b-ff95-4f71-b08f-cb094a31a79f', u'links': [{u'href': u'http://192.168.24.2:13050/v1/introspection/7bc2cf1b-ff95-4f71-b08f-cb094a31a79f', u'rel': u'self'}], u'finished_at': None, u'state': u'waiting', u'finished': False, u'error': None, u'started_at': u'2019-03-04T11:02:35'}, u'595e4656-3bf0-488b-a1f7-835ff8b25a61': {u'uuid': u'595e4656-3bf0-488b-a1f7-835ff8b25a61', u'links': [{u'href': u'http://192.168.24.2:13050/v1/introspection/595e4656-3bf0-488b-a1f7-835ff8b25a61', u'rel': u'self'}], u'finished_at': None, u'state': u'waiting', u'finished': False, u'error': None, u'started_at': u'2019-03-04T11:02:36'}}, u'message': u'Retrying 2 nodes that failed introspection. Attempt 1 of 3 ', u'introspection_attempt': 1}

digging deeper we found some error at

https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-promote-rocky-rdo_trunk-minimal-114/undercloud/var/log/containers/ironic-inspector/ironic-inspector.log.gz

A series of error like these

2019-03-04 11:51:20.356 10 ERROR ironic_inspector.utils [req-6611cfa9-5467-4499-8615-4fa0fd2589f5 - - - - -] [node: MAC 00:b5:09:60:a5:01] Ramdisk reported error: The following errors were encountered:
* failed to run hardware-detect utility: Unexpected error while running command.
Command: hardware-detect
Exit code: 1
Stdout: u''
Stderr: u'/bin/sh: cli64: command not found\nInfo: detect_areca: No controller found\nCannot find megacli on the system\nFailed at getting disk information at /sys/block/vda/device/device/model: [Errno 2] No such file or directory: \'/sys/block/vda/device/model\'\nFailed at getting disk information at /sys/block/vda/device/device/rev: [Errno 2] No such file or directory: \'/sys/block/vda/device/rev\'\nFailed at getting disk information at /sys/block/vda/device/scsi_disk/virtio1/cache_type: [Errno 2] No such file or directory: \'/sys/block/vda/device/scsi_disk/virtio1/cache_type\'\nread_smart: Reading S.M.A.R.T information on /dev/vda\nread_smart: Reading S.M.A.R.T information on /dev/vda with -d ata\nread_smart: no device /dev/vda\nCannot get device pause settings: Operation not supported\nconnect: Connection refused\nFailed to connect to lldpad - clif_open: Invalid argument\nTraceback (most recent call last):\n File "/usr/bin/hardware-detect", line 10, in <module>\n sys.exit(main())\n File "/usr/lib/python2.7/site-packages/hardware/detect.py", line 996, in main\n _main(options)\n File "/usr/lib/python2.7/site-packages/hardware/detect.py", line 923, in _main\n if not detect_system(hrdw):\n File "/usr/lib/python2.7/site-packages/hardware/detect.py", line 666, in detect_system\n get_cpus(hw_lst)\n File "/usr/lib/python2.7/site-packages/hardware/detect.py", line 740, in get_cpus\n processor), item.lower(), lscpu[item]))\nKeyError: \'L3 cache\'\n'
2019-03-04 11:51:20.357 10 ERROR ironic_inspector.process [req-6611cfa9-5467-4499-8615-4fa0fd2589f5 - - - - -] [node: MAC 00:b5:09:60:a5:01] Hook ramdisk_error failed, delaying error report until node look up: Ramdisk reported error: The following errors were encountered:
* failed to run hardware-detect utility: Unexpected error while running command.
Command: hardware-detect
Exit code: 1
Stdout: u''
Stderr: u'/bin/sh: cli64: command not found\nInfo: detect_areca: No controller found\nCannot find megacli on the system\nFailed at getting disk information at /sys/block/vda/device/device/model: [Errno 2] No such file or directory: \'/sys/block/vda/device/model\'\nFailed at getting disk information at /sys/block/vda/device/device/rev: [Errno 2] No such file or directory: \'/sys/block/vda/device/rev\'\nFailed at getting disk information at /sys/block/vda/device/scsi_disk/virtio1/cache_type: [Errno 2] No such file or directory: \'/sys/block/vda/device/scsi_disk/virtio1/cache_type\'\nread_smart: Reading S.M.A.R.T information on /dev/vda\nread_smart: Reading S.M.A.R.T information on /dev/vda with -d ata\nread_smart: no device /dev/vda\nCannot get device pause settings: Operation not supported\nconnect: Connection refused\nFailed to connect to lldpad - clif_open: Invalid argument\nTraceback (most recent call last):\n File "/usr/bin/hardware-detect", line 10, in <module>\n sys.exit(main())\n File "/usr/lib/python2.7/site-packages/hardware/detect.py", line 996, in main\n _main(options)\n File "/usr/lib/python2.7/site-packages/hardware/detect.py", line 923, in _main\n if not detect_system(hrdw):\n File "/usr/lib/python2.7/site-packages/hardware/detect.py", line 666, in detect_system\n get_cpus(hw_lst)\n File "/usr/lib/python2.7/site-packages/hardware/detect.py", line 740, in get_cpus\n processor), item.lower(), lscpu[item]))\nKeyError: \'L3 cache\'\n': Error: Ramdisk reported error: The following errors were encountered:

On MASTER the introspection error at

https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-promote-master-current-tripleo-delorean-minimal-474/undercloud/home/stack/overcloud_prep_images.log.gz

Is slightly different:

2019-03-04 05:54:03 | Exception occured while running the command
2019-03-04 05:54:03 | Traceback (most recent call last):
2019-03-04 05:54:03 | File "/usr/lib/python2.7/site-packages/tripleoclient/command.py", line 29, in run
2019-03-04 05:54:03 | super(Command, self).run(parsed_args)
2019-03-04 05:54:03 | File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run
2019-03-04 05:54:03 | return super(Command, self).run(parsed_args)
2019-03-04 05:54:03 | File "/usr/lib/python2.7/site-packages/cliff/command.py", line 184, in run
2019-03-04 05:54:03 | return_code = self.take_action(parsed_args) or 0
2019-03-04 05:54:03 | File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_node.py", line 211, in take_action
2019-03-04 05:54:03 | run_validations=parsed_args.run_validations
2019-03-04 05:54:03 | File "/usr/lib/python2.7/site-packages/tripleoclient/workflows/baremetal.py", line 194, in introspect_manageable_nodes
2019-03-04 05:54:03 | 'Exception introspecting nodes: {}'.format(payload['message']))
2019-03-04 05:54:03 | IntrospectionError: Exception introspecting nodes: {u'status': u'FAILED', u'node_uuids': [u'cf5b42d4-a744-4067-bb16-a98d9dc19598', u'808c7aa0-e5fc-46a9-a737-fc95c82ac038'], u'failed_introspection': [u'cf5b42d4-a744-4067-bb16-a98d9dc19598', u'808c7aa0-e5fc-46a9-a737-fc95c82ac038'], u'result': u'Failure caused by error in tasks: send_message\n\n send_message [task_ex_id=bb577555-a8df-4f6d-9bb0-17f11143a8b2] -> Workflow failed due to message status. Status:FAILED Message:Retry limit reached with 2 nodes still failing introspection\n [wf_ex_id=11cf74fe-b8c3-4555-81d1-31f1ca9b77b8, idx=0]: Workflow failed due to message status. Status:FAILED Message:Retry limit reached with 2 nodes still failing introspection\n', u'introspected_nodes': {u'cf5b42d4-a744-4067-bb16-a98d9dc19598': {u'uuid': u'cf5b42d4-a744-4067-bb16-a98d9dc19598', u'links': [{u'href': u'http://192.168.24.2:13050/v1/introspection/cf5b42d4-a744-4067-bb16-a98d9dc19598', u'rel': u'self'}], u'finished_at': None, u'state': u'waiting', u'finished': False, u'error': None, u'started_at': u'2019-03-04T05:13:14'}, u'808c7aa0-e5fc-46a9-a737-fc95c82ac038': {u'uuid': u'808c7aa0-e5fc-46a9-a737-fc95c82ac038', u'links': [{u'href': u'http://192.168.24.2:13050/v1/introspection/808c7aa0-e5fc-46a9-a737-fc95c82ac038', u'rel': u'self'}], u'finished_at': None, u'state': u'waiting', u'finished': False, u'error': None, u'started_at': u'2019-03-04T05:13:15'}}, u'message': u'Retry limit reached with 2 nodes still failing introspection', u'introspection_attempt': 3}
2019-03-04 05:54:03 | Exception introspecting nodes: {u'status': u'FAILED', u'node_uuids': [u'cf5b42d4-a744-4067-bb16-a98d9dc19598', u'808c7aa0-e5fc-46a9-a737-fc95c82ac038'], u'failed_introspection': [u'cf5b42d4-a744-4067-bb16-a98d9dc19598', u'808c7aa0-e5fc-46a9-a737-fc95c82ac038'], u'result': u'Failure caused by error in tasks: send_message\n\n send_message [task_ex_id=bb577555-a8df-4f6d-9bb0-17f11143a8b2] -> Workflow failed due to message status. Status:FAILED Message:Retry limit reached with 2 nodes still failing introspection\n [wf_ex_id=11cf74fe-b8c3-4555-81d1-31f1ca9b77b8, idx=0]: Workflow failed due to message status. Status:FAILED Message:Retry limit reached with 2 nodes still failing introspection\n', u'introspected_nodes': {u'cf5b42d4-a744-4067-bb16-a98d9dc19598': {u'uuid': u'cf5b42d4-a744-4067-bb16-a98d9dc19598', u'links': [{u'href': u'http://192.168.24.2:13050/v1/introspection/cf5b42d4-a744-4067-bb16-a98d9dc19598', u'rel': u'self'}], u'finished_at': None, u'state': u'waiting', u'finished': False, u'error': None, u'started_at': u'2019-03-04T05:13:14'}, u'808c7aa0-e5fc-46a9-a737-fc95c82ac038': {u'uuid': u'808c7aa0-e5fc-46a9-a737-fc95c82ac038', u'links': [{u'href': u'http://192.168.24.2:13050/v1/introspection/808c7aa0-e5fc-46a9-a737-fc95c82ac038', u'rel': u'self'}], u'finished_at': None, u'state': u'waiting', u'finished': False, u'error': None, u'started_at': u'2019-03-04T05:13:15'}}, u'message': u'Retry limit reached with 2 nodes still failing introspection', u'introspection_attempt': 3}
2019-03-04 05:54:03 | Waiting for introspection to finish...
2019-03-04 05:54:03 | {u'status': u'FAILED', u'node_uuids': [u'cf5b42d4-a744-4067-bb16-a98d9dc19598', u'808c7aa0-e5fc-46a9-a737-fc95c82ac038'], u'failed_introspection': [u'cf5b42d4-a744-4067-bb16-a98d9dc19598', u'808c7aa0-e5fc-46a9-a737-fc95c82ac038'], u'result': u'Failure caused by error in tasks: send_message\n\n send_message [task_ex_id=bb577555-a8df-4f6d-9bb0-17f11143a8b2] -> Workflow failed due to message status. Status:FAILED Message:Retry limit reached with 2 nodes still failing introspection\n [wf_ex_id=11cf74fe-b8c3-4555-81d1-31f1ca9b77b8, idx=0]: Workflow failed due to message status. Status:FAILED Message:Retry limit reached with 2 nodes still failing introspection\n', u'introspected_nodes': {u'cf5b42d4-a744-4067-bb16-a98d9dc19598': {u'uuid': u'cf5b42d4-a744-4067-bb16-a98d9dc19598', u'links': [{u'href': u'http://192.168.24.2:13050/v1/introspection/cf5b42d4-a744-4067-bb16-a98d9dc19598', u'rel': u'self'}], u'finished_at': None, u'state': u'waiting', u'finished': False, u'error': None, u'started_at': u'2019-03-04T05:13:14'}, u'808c7aa0-e5fc-46a9-a737-fc95c82ac038': {u'uuid': u'808c7aa0-e5fc-46a9-a737-fc95c82ac038', u'links': [{u'href': u'http://192.168.24.2:13050/v1/introspection/808c7aa0-e5fc-46a9-a737-fc95c82ac038', u'rel': u'self'}], u'finished_at': None, u'state': u'waiting', u'finished': False, u'error': None, u'started_at': u'2019-03-04T05:13:15'}}, u'message': u'Retry limit reached with 2 nodes still failing introspection', u'introspection_attempt': 3}

But Digging deeper on logs at

https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-promote-master-current-tripleo-delorean-minimal-474/undercloud/var/log/containers/ironic-inspector/ironic-inspector.log.gz

it shows the same errors as rocky

019-03-04 04:34:43.041 8 ERROR ironic_inspector.utils [req-2942d3a2-03a6-45de-845b-b05f711d65bc - - - - -] [node: MAC 00:46:bd:9c:88:37] Ramdisk reported error: The following errors were encountered:
* failed to run hardware-detect utility: Unexpected error while running command.
Command: hardware-detect
Exit code: 1
Stdout: u''
Stderr: u'/bin/sh: cli64: command not found\nInfo: detect_areca: No controller found\nCannot find megacli on the system\nFailed at getting disk information at /sys/block/vda/device/device/model: [Errno 2] No such file or directory: \'/sys/block/vda/device/model\'\nFailed at getting disk information at /sys/block/vda/device/device/rev: [Errno 2] No such file or directory: \'/sys/block/vda/device/rev\'\nFailed at getting disk information at /sys/block/vda/device/scsi_disk/virtio1/cache_type: [Errno 2] No such file or directory: \'/sys/block/vda/device/scsi_disk/virtio1/cache_type\'\nread_smart: Reading S.M.A.R.T information on /dev/vda\nread_smart: Reading S.M.A.R.T information on /dev/vda with -d ata\nread_smart: no device /dev/vda\nCannot get device pause settings: Operation not supported\nconnect: Connection refused\nFailed to connect to lldpad - clif_open: Invalid argument\nTraceback (most recent call last):\n File "/usr/bin/hardware-detect", line 10, in <module>\n sys.exit(main())\n File "/usr/lib/python2.7/site-packages/hardware/detect.py", line 996, in main\n _main(options)\n File "/usr/lib/python2.7/site-packages/hardware/detect.py", line 923, in _main\n if not detect_system(hrdw):\n File "/usr/lib/python2.7/site-packages/hardware/detect.py", line 666, in detect_system\n get_cpus(hw_lst)\n File "/usr/lib/python2.7/site-packages/hardware/detect.py", line 740, in get_cpus\n processor), item.lower(), lscpu[item]))\nKeyError: \'L3 cache\'\n'
2019-03-04 04:34:43.042 8 ERROR ironic_inspector.process [req-2942d3a2-03a6-45de-845b-b05f711d65bc - - - - -] [node: MAC 00:46:bd:9c:88:37] Hook ramdisk_error failed, delaying error report until node look up: Ramdisk reported error: The following errors were encountered:
* failed to run hardware-detect utility: Unexpected error while running command.
Command: hardware-detect
Exit code: 1
Stdout: u''

summary: Phase 1 tripleo-quickstart-promote-rocky-rdo_trunk-minimal job fail on
- master and rocky during introspection due to missing megacli command
+ master and rocky. Introspection failed to run hardware-detect utility
Derek Higgins (derekh) wrote :

Looking at the errors in the inspector logs the two I see that I think caused inspection to fail look like they may be linked to 2 patches that merged recently into the hardware package
1. KeyError: \'L3 cache\'
This could be a red herring as it looks like this patch should be fixing it https://github.com/redhat-cip/hardware/pull/79

2. Cannot find megacli on the system
   This may be linked to a patch that merged to fix how megacli commands are found
   https://github.com/redhat-cip/hardware/commit/f9f94bec9d24d074e13f8ee0dd35eb5834f31a32

I can't confirm either of these for sure until I get onto a running ramdisk, I'm going to try and get on up this evening.

wes hayutin (weshayutin) wrote :

waiting on another tripleo promotion to see a fix...

One thing CI could do is test the virthost for libvirt capabilities and if they don't meet requirements bail out w/ user friendly error.

wes hayutin (weshayutin) on 2019-03-13
Changed in tripleo:
status: Triaged → In Progress
Changed in tripleo:
milestone: stein-3 → stein-rc1
wes hayutin (weshayutin) wrote :
Changed in tripleo:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers