CI: Exception registering nodes: Node 59ecc491-41b0-4950-8d00-ddce532ba0f4 is locked by host localhost.localdomain, please retry after the current operation is completed

Bug #1612622 reported by James Slagle on 2016-08-12
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Critical
Lucas Alvares Gomes

Bug Description

Failed job:

http://logs.openstack.org/periodic/periodic-tripleo-ci-centos-7-ovb-ha/681b543/

It's a periodic job, but could be affecting current-tripleo as well.

2310 2016-08-12 07:19:35.754507 | #################
2311 2016-08-12 07:19:35.754632 | tripleo.sh -- Register nodes
2312 2016-08-12 07:19:35.754681 | #################
2313 2016-08-12 07:19:35.754736 | You must source a stackrc file for the Undercloud.
2314 2016-08-12 07:19:35.754787 | Attempting to source /home/jenkins/stackrc
2315 2016-08-12 07:19:36.008671 | Done
2316 2016-08-12 07:20:16.886448 | Exception registering nodes: Node 59ecc491-41b0-4950-8d00-ddce532ba0f4 is locked by host localhost.localdomain, please retry after the current operation is completed.

534 2016-08-12 07:19:48.682 25289 DEBUG wsme.api [req-5e5cf62d-e19e-4b8b-bdc6-00c894fc919d admin admin - - -] Client-side error: Node 59ecc491-41b0-4950-8d00-ddce532ba0f4 is locked by host localhost.localdomain, please retry a fter the current operation is completed.
535 Traceback (most recent call last):
536
537 File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 199, in inner
538 return func(*args, **kwargs)
539
540 File "/usr/lib/python2.7/site-packages/ironic/conductor/manager.py", line 1021, in do_provisioning_action
541 % action) as task:
542
543 File "/usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py", line 160, in acquire
544 driver_name=driver_name, purpose=purpose)
545
546 File "/usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py", line 233, in __init__
547 self.release_resources()
548
549 File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
550 self.force_reraise()
551
552 File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
553 six.reraise(self.type_, self.value, self.tb)
554
555 File "/usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py", line 211, in __init__
556 self._lock()
557
558 File "/usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py", line 254, in _lock
559 reserve_node()
560
561 File "/usr/lib/python2.7/site-packages/retrying.py", line 68, in wrapped_f
562 return Retrying(*dargs, **dkw).call(f, *args, **kw)
563
564 File "/usr/lib/python2.7/site-packages/retrying.py", line 229, in call
565 raise attempt.get()
566
567 File "/usr/lib/python2.7/site-packages/retrying.py", line 261, in get
568 six.reraise(self.value[0], self.value[1], self.value[2])
569
570 File "/usr/lib/python2.7/site-packages/retrying.py", line 217, in call
571 attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
572
573 File "/usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py", line 247, in reserve_node
574 self.node_id)
575
576 File "/usr/lib/python2.7/site-packages/ironic/objects/node.py", line 270, in reserve
577 db_node = cls.dbapi.reserve_node(tag, node_id)
578
579 File "/usr/lib/python2.7/site-packages/ironic/db/sqlalchemy/api.py", line 269, in reserve_node
580 host=node['reservation'])
581
582 NodeLocked: Node 59ecc491-41b0-4950-8d00-ddce532ba0f4 is locked by host localhost.localdomain, please retry after the current operation is completed.
583 format_exception /usr/lib/python2.7/site-packages/wsme/api.py:221

Changed in tripleo:
status: New → In Progress
importance: Undecided → Critical
assignee: nobody → James Slagle (james-slagle)
milestone: none → newton-3
James Slagle (james-slagle) wrote :

not seeing this issue any longer

Changed in tripleo:
status: In Progress → Incomplete

In last periodic job:
Exception registering nodes: Node e164afd1-753b-4ff9-a36e-3ce7557f8a14 is locked by host localhost.localdomain, please retry after the current operation is completed.

http://logs.openstack.org/periodic/periodic-tripleo-ci-centos-7-ovb-nonha/1cd409d/console.html#_2016-08-24_07_30_30_682256

Changed in tripleo:
status: Incomplete → Triaged

I see there really long tasks, usually related to power status and power on/off set, they take from 25 to 36 seconds:

2016-08-24 06:04:33.419 29914 DEBUG ironic.conductor.task_manager [req-00061a52-3003-467c-8841-d6f85945dd18 - - - - -] Successfully released exclusive lock for changing node power state on node 7e7e6bae-605e-4466-8e34-21f3ba120c2b (lock was held 28.70 sec) release_resources /usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py:331

2016-08-24 10:14:58.671 28362 DEBUG ironic.conductor.task_manager [req-95d5865b-8dc9-4680-8363-6c15b43e79a8 - - - - -] Successfully released exclusive lock for provision action manage on node 153f5340-6c69-4c89-84f2-c86186dc0cbf (lock was held 36.23 sec) release_resources /usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py:331

But seems like we have a timeout about 20-25 seconds:
from ironic-api.log:

2016-08-23 23:44:08.016 29520 DEBUG wsme.api [req-34c76cc4-ceb0-4d42-842d-06f6a1ffbe27 admin admin - - -] Client-side error: Node 0ed88b39-4eb5-438f-aa27-80bdb47e0eba is locked by host localhost.localdomain, please retry after the current operation is completed.
Traceback (most recent call last):
.......
2016-08-23 23:44:24.779 29520 INFO eventlet.wsgi.server [req-4e4788ce-33f5-41ce-a780-9c220d345a8a admin admin - - -] 192.0.2.1 "PUT /v1/nodes/0ed88b39-4eb5-438f-aa27-80bdb47e0eba/states/provision HTTP/1.1" status: 409 len: 556 time: 2.1693511
2016-08-23 23:44:28.975 29520 INFO eventlet.wsgi.server [req-d00a9cdb-620b-4470-a1ca-64dd68ba4b3a admin admin - - -] 192.0.2.1 "PUT /v1/nodes/0ed88b39-4eb5-438f-aa27-80bdb47e0eba/states/provision HTTP/1.1" status: 409 len: 556 time: 2.1934800

So it's a race, I would go for increasing timeouts (where?) to 40-60 seconds at least. Operations themselves succeeds, it's always a timeout issue.

James Slagle (james-slagle) wrote :
Changed in tripleo:
assignee: James Slagle (james-slagle) → nobody

I suggest to increase post_deploy_get_power_state_retries to 20 in ironic.conf:

# Number of times to retry getting power state to check if
# bare metal node has been powered off after a soft power off.
# (integer value)
#post_deploy_get_power_state_retries = 6

# Amount of time (in seconds) to wait between polling power
# state after trigger soft poweroff. (integer value)
#post_deploy_get_power_state_retry_interval = 5

Changed in tripleo:
assignee: nobody → Sagi (Sergey) Shnaidman (sshnaidm)
Steven Hardy (shardy) on 2016-08-27
Changed in tripleo:
milestone: newton-3 → newton-rc1
Changed in tripleo:
status: Triaged → In Progress

There was a patch in impi library (pyghmi): https://review.openstack.org/#/c/362944/
It should have solved timeouts problem in ipmitool. Let's watch it for a while

The pyghmi patch didn't help a much. Returning to https://review.openstack.org/#/c/361915/

Changed in tripleo:
assignee: Sagi (Sergey) Shnaidman (sshnaidm) → nobody

I've proposed a fix for this here: https://review.openstack.org/#/c/365694/

Changed in tripleo:
assignee: nobody → Lucas Alvares Gomes (lucasagomes)
Changed in tripleo:
status: In Progress → Fix Committed
Changed in tripleo:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers