On service start, nova-compute fails to register itself to the DB

Bug #1566783 reported by Shoham Peller on 2016-04-06
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Low
Unassigned

Bug Description

When the service starts for the first time, nova-compute fails to register itself to the DB. It seems that a nova-compute service of that node has already registered to the DB, so it's a duplicate entry.
Could be a race condition, and nova-compute tries to write itself to the DB twice?

nova version is 12.0.0, hypervisor is Libvirt + KVM

nova-compute log is attached

2016-04-05 16:31:43,938.938 25518 WARNING nova.compute.resource_tracker [req-34113df1-3be0-4126-8469-cb769e8bcbb0 - - - - -] No compute node record for node2.node.strato:node2.node.strato
2016-04-05 16:31:43,940.940 25518 DEBUG oslo_messaging._drivers.amqpdriver [req-34113df1-3be0-4126-8469-cb769e8bcbb0 - - - - -] MSG_ID is d8657f010ac243d08a3e8fb02f9e76bc _send /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:393
2016-04-05 16:31:46,177.177 25518 DEBUG oslo_concurrency.lockutils [req-34113df1-3be0-4126-8469-cb769e8bcbb0 - - - - -] Lock "compute_resources" released by "nova.compute.resource_tracker._update_available_resource" :: held 2.259s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:265
2016-04-05 16:31:46,178.178 25518 ERROR nova.compute.manager [req-34113df1-3be0-4126-8469-cb769e8bcbb0 - - - - -] Error updating resources for node node2.node.strato: Remote error: DBDuplicateEntry (_mysql_exceptions.IntegrityError) (1062, "Duplicate entry '3' for key 'PRIMARY'") [SQL: u'INSERT INTO compute_nodes (created_at, updated_at, deleted_at, deleted, service_id, host, vcpus, memory_mb, local_gb, vcpus_used, memory_mb_used, local_gb_used, hypervisor_type, hypervisor_version, hypervisor_hostname, free_ram_mb, free_disk_gb, current_workload, running_vms, cpu_info, disk_available_least, host_ip, supported_instances, metrics, pci_stats, extra_resources, stats, numa_topology, ram_allocation_ratio, cpu_allocation_ratio) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'] [parameters: (datetime.datetime(2016, 4, 5, 20, 31, 43, 946474), None, None, 0, None, 'node2.node.strato', 12, 63809, 29, 0, 1963, 11, 'QEMU', 1005003, 'node2.node.strato', None, None, None, None, '{"vendor": "Intel", "model": "Haswell-noTSX", "arch": "x86_64", "features": ["vme", "dtes64", "vmx", "xtpr", "est", "monitor", "smx", "abm", "tm", "acpi", "osxsave", "ht", "dca", "pdcm", "pdpe1gb", "f16c", "ds", "invtsc", "tm2", "ss", "pbe", "ds_cpl", "rdrand"], "topology": {"cores": 6, "cells": 1, "threads": 2, "sockets": 1}}', 16, '1.132.162.240', '[["i686", "qemu", "hvm"], ["i686", "kvm", "hvm"], ["x86_64", "qemu", "hvm"], ["x86_64", "kvm", "hvm"]]', None, None, None, '{}', None, 1.5, 16.0)]
[u'Traceback (most recent call last):\n', u' File "/usr/lib/python2.7/site-packages/nova/conductor/manager.py", line 447, in _object_dispatch\n return getattr(target, method)(*args, **kwargs)\n', u' File "/usr/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 213, in wrapper\n return fn(self, *args, **kwargs)\n', u' File "/usr/lib/python2.7/site-packages/nova/objects/compute_node.py", line 304, in create\n db_compute = db.compute_node_create(self._context, updates)\n', u' File "/usr/lib/python2.7/site-packages/nova/db/api.py", line 244, in compute_node_create\n return IMPL.compute_node_create(context, values)\n', u' File "/usr/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 607, in compute_node_create\n compute_node_ref.save()\n', u' File "/usr/lib/python2.7/site-packages/nova/db/sqlalchemy/models.py", line 82, in save\n super(NovaBase, self).save(session=session)\n', u' File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/models.py", line 48, in save\n session.flush()\n', u' File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 2004, in flush\n self._flush(objects)\n', u' File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 2122, in _flush\n transaction.rollback(_capture_exception=True)\n', u' File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 60, in __exit__\n compat.reraise(exc_type, exc_value, exc_tb)\n', u' File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 2086, in _flush\n flush_context.execute()\n', u' File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 373, in execute\n rec.execute(self)\n', u' File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 532, in execute\n uow\n', u' File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 174, in save_obj\n mapper, table, insert)\n', u' File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 761, in _emit_insert_statements\n execute(statement, params)\n', u' File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 914, in execute\n return meth(self, multiparams, params)\n', u' File "/usr/lib64/python2.7/site-packages/sqlalchemy/sql/elements.py", line 323, in _execute_on_connection\n return connection._execute_clauseelement(self, multiparams, params)\n', u' File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1010, in _execute_clauseelement\n compiled_sql, distilled_params\n', u' File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1146, in _execute_context\n context)\n', u' File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1335, in _handle_dbapi_exception\n util.raise_from_cause(newraise, exc_info)\n', u' File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause\n reraise(type(exception), exception, tb=exc_tb)\n', u' File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in _execute_context\n context)\n', u' File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/default.py", line 442, in do_execute\n cursor.execute(statement, parameters)\n', u' File "/usr/lib64/python2.7/site-packages/MySQLdb/cursors.py", line 174, in execute\n self.errorhandler(self, exc, value)\n', u' File "/usr/lib64/python2.7/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler\n raise errorclass, errorvalue\n', u'DBDuplicateEntry: (_mysql_exceptions.IntegrityError) (1062, "Duplicate entry \'3\' for key \'PRIMARY\'") [SQL: u\'INSERT INTO compute_nodes (created_at, updated_at, deleted_at, deleted, service_id, host, vcpus, memory_mb, local_gb, vcpus_used, memory_mb_used, local_gb_used, hypervisor_type, hypervisor_version, hypervisor_hostname, free_ram_mb, free_disk_gb, current_workload, running_vms, cpu_info, disk_available_least, host_ip, supported_instances, metrics, pci_stats, extra_resources, stats, numa_topology, ram_allocation_ratio, cpu_allocation_ratio) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)\'] [parameters: (datetime.datetime(2016, 4, 5, 20, 31, 43, 946474), None, None, 0, None, \'node2.node.strato\', 12, 63809, 29, 0, 1963, 11, \'QEMU\', 1005003, \'node2.node.strato\', None, None, None, None, \'{"vendor": "Intel", "model": "Haswell-noTSX", "arch": "x86_64", "features": ["vme", "dtes64", "vmx", "xtpr", "est", "monitor", "smx", "abm", "tm", "acpi", "osxsave", "ht", "dca", "pdcm", "pdpe1gb", "f16c", "ds", "invtsc", "tm2", "ss", "pbe", "ds_cpl", "rdrand"], "topology": {"cores": 6, "cells": 1, "threads": 2, "sockets": 1}}\', 16, \'1.132.162.240\', \'[["i686", "qemu", "hvm"], ["i686", "kvm", "hvm"], ["x86_64", "qemu", "hvm"], ["x86_64", "kvm", "hvm"]]\', None, None, None, \'{}\', None, 1.5, 16.0)]\n'].

Shoham Peller (shoham-peller) wrote :
tags: added: compute

@Shoham Peller:

Would you please attach the "nova-conductor" log as well? That's the
service which handles the communication with the database. The issue
sounds a little like a race-condition. Please switch the status back
to "New" after you attached the nova-conductor log.

HINT: You can also use "sosreport" to collect the information of a node:

    $ sudo sosreport -o openstack_nova --batch

Changed in nova:
status: New → Incomplete
Shoham Peller (shoham-peller) wrote :

Attached the conductor log.
Thanks

Changed in nova:
status: Incomplete → New

I assume that the RPC messages to create a new database record for
the second compute node create a race-condition. The new
nova-compute service for host "node2.node.strato" seems to be already
saved in the database, which is unknown to a second thread from another
RPC message. The important log message seem to be these two:

nova-compute.log

    Error updating resources for node node2.node.strato:
    Cannot load 'id' in the base class

nova-conductor.log

    Expected exception during message handling
    ((_mysql_exceptions.IntegrityError)
    (1062, "Duplicate entry '3' for key 'PRIMARY'")
    [SQL: u'INSERT INTO compute_nodes [...]

I'm no expert here but I assume that we need to catch the
"DBDuplicateError" like in [1] in the method "compute_node_create" [2].
Would be cool if a DB expert could have a look.

I don't know how to reproduce it locally, but it looks like a valid
bug to me.

References:
[1] https://github.com/openstack/nova/blob/stable/liberty/nova/db/sqlalchemy/api.py#L2693-L2697
[2] https://github.com/openstack/nova/blob/stable/liberty/nova/db/sqlalchemy/api.py#L599-L609

Changed in nova:
status: New → Confirmed
importance: Undecided → Low

Fix proposed to branch: master
Review: https://review.openstack.org/304415

Changed in nova:
assignee: nobody → Rajesh Tailor (ratailor)
status: Confirmed → In Progress

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/304415
Reason: This patch has been sitting unchanged for more than 12 weeks. I am therefore going to abandon it to keep the nova review queue sane. Please feel free to restore the change if you're still working on it.

Sean Dague (sdague) wrote :

There are no currently open reviews on this bug, changing the status back to the previous state and unassigning. If there are active reviews related to this bug, please include links in comments.

Changed in nova:
status: In Progress → Confirmed
assignee: Rajesh Tailor (ratailor) → nobody
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers