ServiceBinaryExists - binary for nova-conductor already exists

Bug #1326901 reported by Corey Bryant on 2014-06-05
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Undecided
Corey Bryant
Icehouse
Undecided
Unassigned
nova (Ubuntu)
High
Corey Bryant
Trusty
High
Unassigned
Utopic
High
Corey Bryant

Bug Description

We're hitting an intermittent issue where ServiceBinaryExists is raised for nova-conductor on deployment.

From nova-conductor's upstart log ( /var/log/upstart/nova-conductor.log ):

2014-05-15 12:02:25.206 34494 INFO nova.openstack.common.periodic_task [-] Skipping periodic task _periodic_update_dns because its interval is negative
2014-05-15 12:02:25.241 34494 INFO nova.openstack.common.service [-] Starting 8 workers
2014-05-15 12:02:25.242 34494 INFO nova.openstack.common.service [-] Started child 34501
2014-05-15 12:02:25.244 34494 INFO nova.openstack.common.service [-] Started child 34502
2014-05-15 12:02:25.246 34494 INFO nova.openstack.common.service [-] Started child 34503
2014-05-15 12:02:25.246 34501 AUDIT nova.service [-] Starting conductor node (version 2014.1)
2014-05-15 12:02:25.247 34502 AUDIT nova.service [-] Starting conductor node (version 2014.1)
2014-05-15 12:02:25.247 34494 INFO nova.openstack.common.service [-] Started child 34504
2014-05-15 12:02:25.249 34503 AUDIT nova.service [-] Starting conductor node (version 2014.1)
2014-05-15 12:02:25.251 34504 AUDIT nova.service [-] Starting conductor node (version 2014.1)
2014-05-15 12:02:25.254 34505 AUDIT nova.service [-] Starting conductor node (version 2014.1)
2014-05-15 12:02:25.250 34494 INFO nova.openstack.common.service [-] Started child 34505
2014-05-15 12:02:25.261 34494 INFO nova.openstack.common.service [-] Started child 34506
2014-05-15 12:02:25.263 34494 INFO nova.openstack.common.service [-] Started child 34507
2014-05-15 12:02:25.266 34494 INFO nova.openstack.common.service [-] Started child 34508
2014-05-15 12:02:25.267 34507 AUDIT nova.service [-] Starting conductor node (version 2014.1)
2014-05-15 12:02:25.268 34506 AUDIT nova.service [-] Starting conductor node (version 2014.1)
2014-05-15 12:02:25.271 34508 AUDIT nova.service [-] Starting conductor node (version 2014.1)
/usr/lib/python2.7/dist-packages/nova/openstack/common/db/sqlalchemy/session.py:379: DeprecationWarning: BaseException.message has been deprecated as of Python 2.6
  match = pattern.match(integrity_error.message)
/usr/lib/python2.7/dist-packages/nova/openstack/common/db/sqlalchemy/session.py:379: DeprecationWarning: BaseException.message has been deprecated as of Python 2.6
  match = pattern.match(integrity_error.message)
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 346, in fire_timers
    timer()
  File "/usr/lib/python2.7/dist-packages/eventlet/hubs/timer.py", line 56, in __call__
    cb(*args, **kw)
  File "/usr/lib/python2.7/dist-packages/eventlet/greenthread.py", line 194, in main
2014-05-15 12:02:25.862 34502 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on localhost:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
    result = function(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/nova/openstack/common/service.py", line 480, in run_service
    service.start()
  File "/usr/lib/python2.7/dist-packages/nova/service.py", line 172, in start
    self.service_ref = self._create_service_ref(ctxt)
  File "/usr/lib/python2.7/dist-packages/nova/service.py", line 224, in _create_service_ref
    service = self.conductor_api.service_create(context, svc_values)
  File "/usr/lib/python2.7/dist-packages/nova/conductor/api.py", line 202, in service_create
    return self._manager.service_create(context, values)
  File "/usr/lib/python2.7/dist-packages/nova/utils.py", line 966, in wrapper
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/nova/conductor/manager.py", line 461, in service_create
    svc = self.db.service_create(context, values)
  File "/usr/lib/python2.7/dist-packages/nova/db/api.py", line 139, in service_create
    return IMPL.service_create(context, values)
  File "/usr/lib/python2.7/dist-packages/nova/db/sqlalchemy/api.py", line 146, in wrapper
    return f(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/nova/db/sqlalchemy/api.py", line 521, in service_create
    binary=values.get('binary'))
ServiceBinaryExists: Service with host glover binary nova-conductor exists.
2014-05-15 12:02:25.864 34503 ERROR nova.openstack.common.threadgroup [-] Service with host glover binary nova-conductor exists.
2014-05-15 12:02:25.864 34503 TRACE nova.openstack.common.threadgroup Traceback (most recent call last):
2014-05-15 12:02:25.864 34503 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/nova/openstack/common/threadgroup.py", line 117, in wait
2014-05-15 12:02:25.864 34503 TRACE nova.openstack.common.threadgroup x.wait()
2014-05-15 12:02:25.864 34503 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/nova/openstack/common/threadgroup.py", line 49, in wait
2014-05-15 12:02:25.864 34503 TRACE nova.openstack.common.threadgroup return self.thread.wait()
2014-05-15 12:02:25.864 34503 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/eventlet/greenthread.py", line 168, in wait
2014-05-15 12:02:25.864 34503 TRACE nova.openstack.common.threadgroup return self._exit_event.wait()
2014-05-15 12:02:25.864 34503 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/eventlet/event.py", line 116, in wait
2014-05-15 12:02:25.864 34503 TRACE nova.openstack.common.threadgroup return hubs.get_hub().switch()
2014-05-15 12:02:25.864 34503 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 187, in switch
2014-05-15 12:02:25.864 34503 TRACE nova.openstack.common.threadgroup return self.greenlet.switch()
2014-05-15 12:02:25.864 34503 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/eventlet/greenthread.py", line 194, in main
2014-05-15 12:02:25.864 34503 TRACE nova.openstack.common.threadgroup result = function(*args, **kwargs)
2014-05-15 12:02:25.864 34503 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/nova/openstack/common/service.py", line 480, in run_service
2014-05-15 12:02:25.864 34503 TRACE nova.openstack.common.threadgroup service.start()
2014-05-15 12:02:25.864 34503 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/nova/service.py", line 172, in start
2014-05-15 12:02:25.864 34503 TRACE nova.openstack.common.threadgroup self.service_ref = self._create_service_ref(ctxt)
2014-05-15 12:02:25.864 34503 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/nova/service.py", line 224, in _create_service_ref
2014-05-15 12:02:25.864 34503 TRACE nova.openstack.common.threadgroup service = self.conductor_api.service_create(context, svc_values)
2014-05-15 12:02:25.864 34503 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/nova/conductor/api.py", line 202, in service_create
2014-05-15 12:02:25.864 34503 TRACE nova.openstack.common.threadgroup return self._manager.service_create(context, values)
2014-05-15 12:02:25.864 34503 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/nova/utils.py", line 966, in wrapper
2014-05-15 12:02:25.864 34503 TRACE nova.openstack.common.threadgroup return func(*args, **kwargs)
2014-05-15 12:02:25.864 34503 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/nova/conductor/manager.py", line 461, in service_create
2014-05-15 12:02:25.864 34503 TRACE nova.openstack.common.threadgroup svc = self.db.service_create(context, values)
2014-05-15 12:02:25.864 34503 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/nova/db/api.py", line 139, in service_create
2014-05-15 12:02:25.864 34503 TRACE nova.openstack.common.threadgroup return IMPL.service_create(context, values)
2014-05-15 12:02:25.864 34503 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/nova/db/sqlalchemy/api.py", line 146, in wrapper
2014-05-15 12:02:25.864 34503 TRACE nova.openstack.common.threadgroup return f(*args, **kwargs)
2014-05-15 12:02:25.864 34503 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/nova/db/sqlalchemy/api.py", line 521, in service_create
2014-05-15 12:02:25.864 34503 TRACE nova.openstack.common.threadgroup binary=values.get('binary'))
2014-05-15 12:02:25.864 34503 TRACE nova.openstack.common.threadgroup ServiceBinaryExists: Service with host glover binary nova-conductor exists.

After looking into the traceback, I came across this piece of code which starts a service. Of note is the 2nd except path. This exception was introduced in January to handle races when creating a record. It seems as if we are racing similarly in our situation and might need to handle ServiceBinaryExists the same way ServiceTopicExists is being handled. I'll submit a patch for this.

From nova/service.py:

 158 def start(self):
 159 verstr = version.version_string_with_package()
 160 LOG.audit(_('Starting %(topic)s node (version %(version)s)'),
 161 {'topic': self.topic, 'version': verstr})
 162 self.basic_config_check()
 163 self.manager.init_host()
 164 self.model_disconnected = False
 165 ctxt = context.get_admin_context()
 166 try:
 167 self.service_ref = self.conductor_api.service_get_by_args(ctxt,
 168 self.host, self.binary)
 169 self.service_id = self.service_ref['id']
 170 except exception.NotFound:
 171 try:
 172 self.service_ref = self._create_service_ref(ctxt)
 173 except exception.ServiceTopicExists:
 174 # NOTE(danms): If we race to create a record with a sibling
 175 # worker, don't fail here.
 176 self.service_ref = self.conductor_api.service_get_by_args(ctxt,
 177 self.host, self.binary)

$ git blame -L 158,178 nova/service.py
b3f5aba0 (Andy Smith 2010-12-09 15:25:14 -0800 158) def start(self):
481d6ff1 (Daniel P. Berrange 2012-12-17 12:17:59 +0000 159) verstr = version.version_string_with_package()
481d6ff1 (Daniel P. Berrange 2012-12-17 12:17:59 +0000 160) LOG.audit(_('Starting %(topic)s node (version %(versi
481d6ff1 (Daniel P. Berrange 2012-12-17 12:17:59 +0000 161) {'topic': self.topic, 'version': verstr})
e8386a27 (Rafi Khardalian 2013-01-23 01:55:09 +0000 162) self.basic_config_check()
065257fb (Vishvananda Ishaya 2010-09-23 12:43:41 -0700 163) self.manager.init_host()
9c98cfb4 (Vishvananda Ishaya 2010-08-30 00:55:19 -0700 164) self.model_disconnected = False
5e3da586 (Vishvananda Ishaya 2010-10-01 05:57:17 -0700 165) ctxt = context.get_admin_context()
57a103b3 (Vishvananda Ishaya 2010-09-02 14:13:22 -0700 166) try:
e9d34263 (Russell Bryant 2013-01-14 17:39:29 -0500 167) self.service_ref = self.conductor_api.service_get
e9d34263 (Russell Bryant 2013-01-14 17:39:29 -0500 168) self.host, self.binary)
e34bc343 (Wenhao Xu 2013-01-21 19:07:34 +0800 169) self.service_id = self.service_ref['id']
57a103b3 (Vishvananda Ishaya 2010-09-02 14:13:22 -0700 170) except exception.NotFound:
f6c341b4 (Dan Smith 2014-01-27 14:03:57 -0800 171) try:
f6c341b4 (Dan Smith 2014-01-27 14:03:57 -0800 172) self.service_ref = self._create_service_ref(c
f6c341b4 (Dan Smith 2014-01-27 14:03:57 -0800 173) except exception.ServiceTopicExists:
f6c341b4 (Dan Smith 2014-01-27 14:03:57 -0800 174) # NOTE(danms): If we race to create a record
f6c341b4 (Dan Smith 2014-01-27 14:03:57 -0800 175) # worker, don't fail here.
f6c341b4 (Dan Smith 2014-01-27 14:03:57 -0800 176) self.service_ref = self.conductor_api.service
f6c341b4 (Dan Smith 2014-01-27 14:03:57 -0800 177) self.host, self.binary)

Tracy Jones (tjones-i) on 2014-06-11
tags: added: conductor
Matt Riedemann (mriedem) wrote :

What release is this on, or trunk/master for Juno on 6/5?

Corey Bryant (corey.bryant) wrote :

Hi Matt,

Sorry I missed your comment. I have a fix proposed but just need a reviewer. Any chance you could review?

https://review.openstack.org/#/c/98199/

Thanks,
Corey

Changed in nova:
assignee: nobody → Corey Bryant (corey.bryant)
Changed in ubuntu:
assignee: nobody → Corey Bryant (corey.bryant)

Reviewed: https://review.openstack.org/98199
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=37247aba813fbc408901a2f1b1ad88e2acb796a1
Submitter: Jenkins
Branch: master

commit 37247aba813fbc408901a2f1b1ad88e2acb796a1
Author: Corey Bryant <email address hidden>
Date: Mon Jun 16 12:17:14 2014 -0400

    Handle service creation race by service workers

    This handle a race where a sibling has already created the service
    record in the database. This fix is similar to that of commit
    f6c341b4.

    Change-Id: I7975fc632eee34390c5d7dc275b52363ad45475e
    Closes-bug: 1326901

Changed in nova:
status: New → Fix Committed
James Page (james-page) on 2014-07-21
affects: ubuntu → nova (Ubuntu)
James Page (james-page) on 2014-07-21
tags: added: icehouse-backport-potential
Changed in nova:
milestone: none → juno-2
status: Fix Committed → Fix Released
James Page (james-page) on 2014-08-01
Changed in nova (Ubuntu Utopic):
status: New → Fix Released
importance: Undecided → High
Changed in nova (Ubuntu Trusty):
importance: Undecided → High
status: New → Triaged

Reviewed: https://review.openstack.org/108401
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=72cc37dd952e342966f583be3a020d9424d266df
Submitter: Jenkins
Branch: stable/icehouse

commit 72cc37dd952e342966f583be3a020d9424d266df
Author: Corey Bryant <email address hidden>
Date: Mon Jun 16 12:17:14 2014 -0400

    Handle service creation race by service workers

    This handle a race where a sibling has already created the service
    record in the database. This fix is similar to that of commit
    f6c341b4.

    Change-Id: I7975fc632eee34390c5d7dc275b52363ad45475e
    Closes-bug: 1326901
    (cherry picked from commit 37247aba813fbc408901a2f1b1ad88e2acb796a1)

tags: added: in-stable-icehouse
Thierry Carrez (ttx) on 2014-10-16
Changed in nova:
milestone: juno-2 → 2014.2
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers