CI: Exception registering nodes: Node 59ecc491-41b0-4950-8d00-ddce532ba0f4 is locked by host localhost.localdomain, please retry after the current operation is completed

Bug #1612622 reported by James Slagle
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Lucas Alvares Gomes

Bug Description

Failed job:

http://logs.openstack.org/periodic/periodic-tripleo-ci-centos-7-ovb-ha/681b543/

It's a periodic job, but could be affecting current-tripleo as well.

2310 2016-08-12 07:19:35.754507 | #################
2311 2016-08-12 07:19:35.754632 | tripleo.sh -- Register nodes
2312 2016-08-12 07:19:35.754681 | #################
2313 2016-08-12 07:19:35.754736 | You must source a stackrc file for the Undercloud.
2314 2016-08-12 07:19:35.754787 | Attempting to source /home/jenkins/stackrc
2315 2016-08-12 07:19:36.008671 | Done
2316 2016-08-12 07:20:16.886448 | Exception registering nodes: Node 59ecc491-41b0-4950-8d00-ddce532ba0f4 is locked by host localhost.localdomain, please retry after the current operation is completed.

534 2016-08-12 07:19:48.682 25289 DEBUG wsme.api [req-5e5cf62d-e19e-4b8b-bdc6-00c894fc919d admin admin - - -] Client-side error: Node 59ecc491-41b0-4950-8d00-ddce532ba0f4 is locked by host localhost.localdomain, please retry a fter the current operation is completed.
535 Traceback (most recent call last):
536
537 File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 199, in inner
538 return func(*args, **kwargs)
539
540 File "/usr/lib/python2.7/site-packages/ironic/conductor/manager.py", line 1021, in do_provisioning_action
541 % action) as task:
542
543 File "/usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py", line 160, in acquire
544 driver_name=driver_name, purpose=purpose)
545
546 File "/usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py", line 233, in __init__
547 self.release_resources()
548
549 File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
550 self.force_reraise()
551
552 File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
553 six.reraise(self.type_, self.value, self.tb)
554
555 File "/usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py", line 211, in __init__
556 self._lock()
557
558 File "/usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py", line 254, in _lock
559 reserve_node()
560
561 File "/usr/lib/python2.7/site-packages/retrying.py", line 68, in wrapped_f
562 return Retrying(*dargs, **dkw).call(f, *args, **kw)
563
564 File "/usr/lib/python2.7/site-packages/retrying.py", line 229, in call
565 raise attempt.get()
566
567 File "/usr/lib/python2.7/site-packages/retrying.py", line 261, in get
568 six.reraise(self.value[0], self.value[1], self.value[2])
569
570 File "/usr/lib/python2.7/site-packages/retrying.py", line 217, in call
571 attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
572
573 File "/usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py", line 247, in reserve_node
574 self.node_id)
575
576 File "/usr/lib/python2.7/site-packages/ironic/objects/node.py", line 270, in reserve
577 db_node = cls.dbapi.reserve_node(tag, node_id)
578
579 File "/usr/lib/python2.7/site-packages/ironic/db/sqlalchemy/api.py", line 269, in reserve_node
580 host=node['reservation'])
581
582 NodeLocked: Node 59ecc491-41b0-4950-8d00-ddce532ba0f4 is locked by host localhost.localdomain, please retry after the current operation is completed.
583 format_exception /usr/lib/python2.7/site-packages/wsme/api.py:221

Changed in tripleo:
status: New → In Progress
importance: Undecided → Critical
assignee: nobody → James Slagle (james-slagle)
milestone: none → newton-3
Revision history for this message
James Slagle (james-slagle) wrote :

not seeing this issue any longer

Changed in tripleo:
status: In Progress → Incomplete
Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

In last periodic job:
Exception registering nodes: Node e164afd1-753b-4ff9-a36e-3ce7557f8a14 is locked by host localhost.localdomain, please retry after the current operation is completed.

http://logs.openstack.org/periodic/periodic-tripleo-ci-centos-7-ovb-nonha/1cd409d/console.html#_2016-08-24_07_30_30_682256

Changed in tripleo:
status: Incomplete → Triaged
Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :
Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

I see there really long tasks, usually related to power status and power on/off set, they take from 25 to 36 seconds:

2016-08-24 06:04:33.419 29914 DEBUG ironic.conductor.task_manager [req-00061a52-3003-467c-8841-d6f85945dd18 - - - - -] Successfully released exclusive lock for changing node power state on node 7e7e6bae-605e-4466-8e34-21f3ba120c2b (lock was held 28.70 sec) release_resources /usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py:331

2016-08-24 10:14:58.671 28362 DEBUG ironic.conductor.task_manager [req-95d5865b-8dc9-4680-8363-6c15b43e79a8 - - - - -] Successfully released exclusive lock for provision action manage on node 153f5340-6c69-4c89-84f2-c86186dc0cbf (lock was held 36.23 sec) release_resources /usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py:331

But seems like we have a timeout about 20-25 seconds:
from ironic-api.log:

2016-08-23 23:44:08.016 29520 DEBUG wsme.api [req-34c76cc4-ceb0-4d42-842d-06f6a1ffbe27 admin admin - - -] Client-side error: Node 0ed88b39-4eb5-438f-aa27-80bdb47e0eba is locked by host localhost.localdomain, please retry after the current operation is completed.
Traceback (most recent call last):
.......
2016-08-23 23:44:24.779 29520 INFO eventlet.wsgi.server [req-4e4788ce-33f5-41ce-a780-9c220d345a8a admin admin - - -] 192.0.2.1 "PUT /v1/nodes/0ed88b39-4eb5-438f-aa27-80bdb47e0eba/states/provision HTTP/1.1" status: 409 len: 556 time: 2.1693511
2016-08-23 23:44:28.975 29520 INFO eventlet.wsgi.server [req-d00a9cdb-620b-4470-a1ca-64dd68ba4b3a admin admin - - -] 192.0.2.1 "PUT /v1/nodes/0ed88b39-4eb5-438f-aa27-80bdb47e0eba/states/provision HTTP/1.1" status: 409 len: 556 time: 2.1934800

So it's a race, I would go for increasing timeouts (where?) to 40-60 seconds at least. Operations themselves succeeds, it's always a timeout issue.

Revision history for this message
James Slagle (james-slagle) wrote :
Changed in tripleo:
assignee: James Slagle (james-slagle) → nobody
Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

I suggest to increase post_deploy_get_power_state_retries to 20 in ironic.conf:

# Number of times to retry getting power state to check if
# bare metal node has been powered off after a soft power off.
# (integer value)
#post_deploy_get_power_state_retries = 6

# Amount of time (in seconds) to wait between polling power
# state after trigger soft poweroff. (integer value)
#post_deploy_get_power_state_retry_interval = 5

Changed in tripleo:
assignee: nobody → Sagi (Sergey) Shnaidman (sshnaidm)
Steven Hardy (shardy)
Changed in tripleo:
milestone: newton-3 → newton-rc1
Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :
Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

There was a patch in impi library (pyghmi): https://review.openstack.org/#/c/362944/
It should have solved timeouts problem in ipmitool. Let's watch it for a while

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

The pyghmi patch didn't help a much. Returning to https://review.openstack.org/#/c/361915/

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :
Changed in tripleo:
assignee: Sagi (Sergey) Shnaidman (sshnaidm) → nobody
Revision history for this message
Lucas Alvares Gomes (lucasagomes) wrote :

I've proposed a fix for this here: https://review.openstack.org/#/c/365694/

Changed in tripleo:
assignee: nobody → Lucas Alvares Gomes (lucasagomes)
Changed in tripleo:
status: In Progress → Fix Committed
Changed in tripleo:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.