exception introspecting nodes

Bug #1784847 reported by Rafael Folco
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Invalid
Critical
Gabriele Cerami

Bug Description

http://logs.rdoproject.org/openstack-periodic/git.openstack.org/openstack-infra/tripleo-ci/master/legacy-periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset021-master/e7ac777/logs/undercloud/home/zuul/overcloud_prep_images.log.txt.gz

2018-08-01 09:03:29 | Exception introspecting nodes: {u'status': u'FAILED', u'node_uuids': [u'a41d3f42-416d-4d90-8b24-42d5f83da664', u'9c8e0bb4-b02b-4bf2-be0e-8e077b35a1d3'], u'failed_introspection': [u'a41d3f42-416d-4d90-8b24-42d5f83da664', u'9c8e0bb4-b02b-4bf2-be0e-8e077b35a1d3'], u'result': u'Failure caused by error in tasks: send_message\n\n send_message [task_ex_id=ea6086d5-9177-4e0e-b713-96c446f74eb7] -> Workflow failed due to message status\n [wf_ex_id=551b2737-0137-4f00-96db-9bb8e2cc156d, idx=0]: Workflow failed due to message status\n', u'introspected_nodes': {u'9c8e0bb4-b02b-4bf2-be0e-8e077b35a1d3': {u'uuid': u'9c8e0bb4-b02b-4bf2-be0e-8e077b35a1d3', u'links': [{u'href': u'http://192.168.24.2:13050/v1/introspection/9c8e0bb4-b02b-4bf2-be0e-8e077b35a1d3', u'rel': u'self'}], u'finished_at': None, u'state': u'waiting', u'finished': False, u'error': None, u'started_at': u'2018-08-01T08:23:04'}, u'a41d3f42-416d-4d90-8b24-42d5f83da664': {u'uuid': u'a41d3f42-416d-4d90-8b24-42d5f83da664', u'links': [{u'href': u'http://192.168.24.2:13050/v1/introspection/a41d3f42-416d-4d90-8b24-42d5f83da664', u'rel': u'self'}], u'finished_at': None, u'state': u'waiting', u'finished': False, u'error': None, u'started_at': u'2018-08-01T08:23:05'}}, u'message': u'Retry limit reached with 2 nodes still failing introspection', u'introspection_attempt': 3}
2018-08-01 09:03:29 | Waiting for introspection to finish...
2018-08-01 09:03:29 | Introspection of node 9c8e0bb4-b02b-4bf2-be0e-8e077b35a1d3 timed out.
2018-08-01 09:03:29 | Introspection of node a41d3f42-416d-4d90-8b24-42d5f83da664 timed out.
2018-08-01 09:03:29 | Retrying 2 nodes that failed introspection. Attempt 1 of 3
2018-08-01 09:03:29 | Introspection of node a41d3f42-416d-4d90-8b24-42d5f83da664 timed out.
2018-08-01 09:03:29 | Introspection of node 9c8e0bb4-b02b-4bf2-be0e-8e077b35a1d3 timed out.
2018-08-01 09:03:29 | Retrying 2 nodes that failed introspection. Attempt 2 of 3
2018-08-01 09:03:29 | Introspection of node 9c8e0bb4-b02b-4bf2-be0e-8e077b35a1d3 timed out.
2018-08-01 09:03:29 | Introspection of node a41d3f42-416d-4d90-8b24-42d5f83da664 timed out.
2018-08-01 09:03:29 | Retrying 2 nodes that failed introspection. Attempt 3 of 3
2018-08-01 09:03:29 | Introspection of node 9c8e0bb4-b02b-4bf2-be0e-8e077b35a1d3 timed out.
2018-08-01 09:03:29 | Introspection of node a41d3f42-416d-4d90-8b24-42d5f83da664 timed out.
2018-08-01 09:03:29 | Retry limit reached with 2 nodes still failing introspection
2018-08-01 09:03:29 | {u'status': u'FAILED', u'node_uuids': [u'a41d3f42-416d-4d90-8b24-42d5f83da664', u'9c8e0bb4-b02b-4bf2-be0e-8e077b35a1d3'], u'failed_introspection': [u'a41d3f42-416d-4d90-8b24-42d5f83da664', u'9c8e0bb4-b02b-4bf2-be0e-8e077b35a1d3'], u'result': u'Failure caused by error in tasks: send_message\n\n send_message [task_ex_id=ea6086d5-9177-4e0e-b713-96c446f74eb7] -> Workflow failed due to message status\n [wf_ex_id=551b2737-0137-4f00-96db-9bb8e2cc156d, idx=0]: Workflow failed due to message status\n', u'introspected_nodes': {u'9c8e0bb4-b02b-4bf2-be0e-8e077b35a1d3': {u'uuid': u'9c8e0bb4-b02b-4bf2-be0e-8e077b35a1d3', u'links': [{u'href': u'http://192.168.24.2:13050/v1/introspection/9c8e0bb4-b02b-4bf2-be0e-8e077b35a1d3', u'rel': u'self'}], u'finished_at': None, u'state': u'waiting', u'finished': False, u'error': None, u'started_at': u'2018-08-01T08:23:04'}, u'a41d3f42-416d-4d90-8b24-42d5f83da664': {u'uuid': u'a41d3f42-416d-4d90-8b24-42d5f83da664', u'links': [{u'href': u'http://192.168.24.2:13050/v1/introspection/a41d3f42-416d-4d90-8b24-42d5f83da664', u'rel': u'self'}], u'finished_at': None, u'state': u'waiting', u'finished': False, u'error': None, u'started_at': u'2018-08-01T08:23:05'}}, u'message': u'Retry limit reached with 2 nodes still failing introspection', u'introspection_attempt': 3}

http://logs.rdoproject.org/openstack-periodic/git.openstack.org/openstack-infra/tripleo-ci/master/legacy-periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset021-master/e7ac777/logs/undercloud/var/log/extra/errors.txt.gz

summary: - exception instrospecting nodes on fs021
+ exception instrospecting nodes
summary: - exception instrospecting nodes
+ exception introspecting nodes
Revision history for this message
Bob Fournier (bfournie) wrote :

Can you provide access the the ironic and ironic-inspector logs when this problem occurs?

Revision history for this message
wes hayutin (weshayutin) wrote :
Revision history for this message
wes hayutin (weshayutin) wrote :
Revision history for this message
Ronelle Landy (rlandy) wrote :

The error from logs is:

2018-08-03 01:45:06.263 8 ERROR ironic.drivers.modules.ipmitool [req-6b5988b1-0c98-4421-8aa8-bea42f7317f2 7671b7d0804e49efb877555333eeab30 c1caa2f55f904c6089920e56f170c925 - default default] IPMI Error while attempting "ipmitool -I lanplus -H 192.168.102.145 -L ADMINISTRATOR -U admin -R 12 -N 5 -f /tmp/tmpGJsFde power status" for node 66649864-0322-4039-8fd0-3feaa5eaddef. Error: Unexpected error while running command.
Command: ipmitool -I lanplus -H 192.168.102.145 -L ADMINISTRATOR -U admin -R 12 -N 5 -f /tmp/tmpGJsFde power status

*********************

The error from /overcloud_prep_images.log:

shows the failure when registering nodes:
2018-08-03 01:40:22 | + openstack overcloud node import instackenv.json
2018-08-03 01:40:30 | Waiting for messages on queue 'tripleo' with no timeout.
2018-08-03 01:45:16 | Exception registering nodes: {u'status': u'FAILED', u'message': [{u'result': u'Node fef4ef85-c7fe-40b7-aa84-5cfe8f052a60 did not reach state "manageable", the state is "enroll", error: Failed to get power state for node fef4ef85-c7fe-40b7-aa84-5cfe8f052a60. Error: IPMI call failed: power status.'}, {u'result': u'Node 66649864-0322-4039-8fd0-3feaa5eaddef did not reach state "manageable", the state is "enroll",

************

To compare, a healthy output looks like:

2018-08-02 17:29:49 | + openstack overcloud node import instackenv.json
2018-08-02 17:29:57 | Waiting for messages on queue 'tripleo' with no timeout.
2018-08-02 17:30:32 |
2018-08-02 17:30:32 |
2018-08-02 17:30:32 | 4 node(s) successfully moved to the "manageable" state.
2018-08-02 17:30:32 | Successfully registered node UUID 7b7d8f17-eb32-4dbf-9167-fc90eff39764
2018-08-02 17:30:32 | Successfully registered node UUID a0d9bec4-2e05-4af7-abbf-b69d80e2df72
2018-08-02 17:30:32 | Successfully registered node UUID 54b7d248-dd28-4d0b-9245-c37ba43f690b
2018-08-02 17:30:32 | Successfully registered node UUID b67fd3c3-f6bc-42bc-babc-03e992d7aff2
2018-08-02 17:30:32 | + openstack overcloud node introspect --all-manageable

**************

So here we have an error getting power status before introspection happens.

May be unrelated but the errors look like what is reported in https://bugs.launchpad.net/nova/+bug/1291991 (along time ago).

Revision history for this message
Ronelle Landy (rlandy) wrote :

https://github.com/openstack/tripleo-common/blob/master/workbooks/baremetal.yaml#L53 is the error output - error is sporadic - ~60-70% of jobs

Installed on the undercloud:
python2-ironicclient-2.4.0-0.20180626143850.a49938d.el7.noarch
python2-ironic-inspector-client-3.2.0-0.20180529111245.6048afc.el7.noarch
puppet-ironic-13.2.0-0.20180730212733.387fffa.el7.noarch

Revision history for this message
Bob Fournier (bfournie) wrote :

This error:

2018-08-03 01:45:16 | Exception registering nodes: {u'status': u'FAILED', u'message': [{u'result': u'Node fef4ef85-c7fe-40b7-aa84-5cfe8f052a60 did not reach state "manageable", the state is "enroll", error: Failed to get power state for node fef4ef85-c7fe-40b7-aa84-5cfe8f052a60. Error: IPMI call failed: power status.'}, {u'result': u'Node 66649864-0322-4039-8fd0-3feaa5eaddef did not reach state "manageable", the state is "enroll",

is most likely the libvirt issue that has been resolved in RHEL 7.5 and 7.6 - https://bugzilla.redhat.com/show_bug.cgi?id=1576464.

See https://bugs.launchpad.net/tripleo/+bug/1782267 and specifically https://bugs.launchpad.net/tripleo/+bug/1782267/comments/7 - the problem was resolved with a later build.

Revision history for this message
Bob Fournier (bfournie) wrote :

Please pick up the latest libvirt and retest. This successfully got past this IPMI power issue in https://bugs.launchpad.net/tripleo/+bug/1782267.

Revision history for this message
Ronelle Landy (rlandy) wrote :

Bob, this error show up in OVB deployments (not libvirt) running on centos. Amy further thoughts?

Revision history for this message
Derek Higgins (derekh) wrote :

Stderr: u'Error: Unable to establish IPMI v2 / RMCP+ session\n': ProcessExecutionError: Unexpected error while running command.

Have you got access to the bmc logs? This may shed some light on whats going on, either virtualbmc isn't starting up, its not getting the ipmi packets or there is an error when it tries to talk to nova. Eitherway taking a look on the bmc should show the problem.

Revision history for this message
Derek Higgins (derekh) wrote :

Just to elaborate a little further on the above comment, if the cloud (rdo-cloud in this case) is having problems with API calls or network access then either of these can translate to ipmi errors in the CI job.
The reality may be that the cloud isn't stable at the moment and we need to wait until it is. Checking the BMC will confirm if that API is returning errors or not. I'm not sure how we could confirm network problems(maybe setting up a ping between the undercloud and the BMC during the job).

Revision history for this message
Bob Fournier (bfournie) wrote :
Download full text (3.7 KiB)

Thanks Derek. Agree that this looks like an issue with rdo-cloud stability, not a problem with ironic introspection. BTW, here is the bmc log that Ronelle gathered:

[84315.242975] openstackbmc[2540]: powerstate = self.get_power_state()
[84315.245362] openstackbmc[2540]: File "/usr/local/bin/openstackbmc", line 154, in get_power_state
[84315.247550] openstackbmc[2540]: state = self._instance_active()
[84315.249141] openstackbmc[2540]: File "/usr/local/bin/openstackbmc", line 149, in _instance_active
[84315.251358] openstackbmc[2540]: self.cached_status = self.novaclient.servers.get(self.instance).status
[84315.253656] openstackbmc[2540]: File "/usr/lib/python2.7/site-packages/novaclient/v2/servers.py", line 788, in get
[84315.256357] openstackbmc[2540]: return self._get("/servers/%s" % base.getid(server), "server")
[84315.258512] openstackbmc[2540]: File "/usr/lib/python2.7/site-packages/novaclient/base.py", line 351, in _get
[84315.260921] openstackbmc[2540]: resp, body = self.api.client.get(url)
[84315.262665] openstackbmc[2540]: File "/usr/lib/python2.7/site-packages/keystoneauth1/adapter.py", line 217, in get
[84315.265223] openstackbmc[2540]: return self.request(url, 'GET', **kwargs)
[84315.267283] openstackbmc[2540]: File "/usr/lib/python2.7/site-packages/novaclient/client.py", line 80, in request
[84315.269986] openstackbmc[2540]: raise exceptions.from_response(resp, body, url, method)
[84315.272153] openstackbmc[2540]: ClientException: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
[84315.275718] openstackbmc[2540]: <class 'oslo_db.exception.DBConnectionError'> (HTTP 500) (Request-ID: req-8399f82c-01e6-4cb1-9839-1f41ddd934a6)
[84315.685674] openstackbmc[2567]: Traceback (most recent call last):
[84315.687601] openstackbmc[2567]: File "/usr/lib/python2.7/site-packages/pyghmi/ipmi/bmc.py", line 136, in handle_raw_request
[84315.690391] openstackbmc[2567]: return self.get_chassis_status(session)
[84315.692156] openstackbmc[2567]: File "/usr/lib/python2.7/site-packages/pyghmi/ipmi/bmc.py", line 56, in get_chassis_status
[84315.694867] openstackbmc[2567]: powerstate = self.get_power_state()
[84315.697021] openstackbmc[2567]: File "/usr/local/bin/openstackbmc", line 154, in get_power_state
[84315.699231] openstackbmc[2567]: state = self._instance_active()
[84315.700840] openstackbmc[2567]: File "/usr/local/bin/openstackbmc", line 149, in _instance_active
[84315.703070] openstackbmc[2567]: self.cached_status = self.novaclient.servers.get(self.instance).status
[84315.705415] openstackbmc[2567]: File "/usr/lib/python2.7/site-packages/novaclient/v2/servers.py", line 788, in get
[84315.708411] openstackbmc[2567]: return self._get("/servers/%s" % base.getid(server), "server")
[84315.710617] openstackbmc[2567]: File "/usr/lib/python2.7/site-packages/novaclient/base.py", line 351, in _get
[84315.713099] openstackbmc[2567]: resp, body = self.api.client.get(url)
[84315.714835] openstackbmc[2567]: File "/usr/lib/python2.7/site-packages/keystoneauth1/adapter.py", line 217, in get
[84315.717414] openstackbmc[2567]: return self.request(url, 'GET', **kwargs)
[84315.719402] openstackbm...

Read more...

Revision history for this message
Derek Higgins (derekh) wrote :

Also going back to the original bug reported, the symptoms were the same but looking into the ironic logs there are no ipmitool errors, so we may have two separate issues here. The ironic log in this case suggests that the node powered on (no ipmi errors) but the dnsmasq logs[1] for inspector don't show any dhcp requests, 3 possibilities spring to mind here
1. the nodes didn't power on (may be clues on BMC)
2. the nodes powered on but arn't network booting
   - problem with the ipxe-boot image used in OVB
   - or if not using the ipxe-boot image then nova compute isn't patched with the OVB patch
3. the undercloud is blocking the dhcp requests

[1] - https://logs.rdoproject.org/openstack-periodic/git.openstack.org/openstack-infra/tripleo-ci/master/legacy-periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset021-master/e7ac777/logs/undercloud/var/log/containers/ironic-inspector/dnsmasq.log.txt.gz

Revision history for this message
Ronelle Landy (rlandy) wrote :

RDO Cloud has has spotty service today. The BMC log in #11 probably hit the cloud reboot. So, while, we were able to see the nodse stuck on enroll, it's probably not the situation hurting us in CI.

I have been rerunning registration, introspection without any consistent results - although I do see some attempts timeout.
Introspection of node 585a992d-11e6-4168-b0a1-2039ee394bfd timed out

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Note, for the linked https://logs.rdoproject.org/openstack-periodic/git.openstack.org/openstack-infra/tripleo-ci/master/legacy-periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset021-master/e7ac777/logs/undercloud/home/zuul/undercloud-parameter-defaults.yaml.txt.gz job we can see wrong netconfig for undercloud. That is expected to be net-config-undercloud.yaml and not ci/common/net-config-simple-bridge.yaml. I believe that had been fixed for https://review.openstack.org/#/c/585741/ (https://bugs.launchpad.net/tripleo/+bug/1784847). But it seems that http://git.openstack.org/cgit/openstack/tripleo-quickstart/tree/config/general_config/minimal.yml#n35 is not used to configure network for the openstack-periodic job... So the proper fix is to make sure it does.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
wes hayutin (weshayutin) wrote :

This is caused by either dup ip addresses and the bmc int fails to get an ip or....

keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to https://phx2.cloud.rdoproject.org:13000/v2.0/tokens: HTTPSConnectionPool(host='phx2.cloud.rdoproject.org', port=13000): Max retries exceeded with url: /v2.0/tokens (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x39b2a50>: Failed to establish a new connection: rrno -2] Name or service not known',))

Changed in tripleo:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.