2.4.0a1 upgrade failed with old node event data

Bug #1748712 reported by Mark Shuttleworth
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Won't Fix
High
Unassigned

Bug Description

I saw this today:

  Applying maasserver.0131_update_event_model_for_audit_logs...Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
psycopg2.IntegrityError: insert or update on table "maasserver_event" violates foreign key constraint "maasserver_event_node_id_dd4495a7_fk_maasserver_node_id"
DETAIL: Key (node_id)=(16) is not present in table "maasserver_node".

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/lib/python3/dist-packages/maasserver/__main__.py", line 9, in <module>
    raise SystemExit(execute_from_command_line())
  File "/usr/lib/python3/dist-packages/maasserver/__init__.py", line 56, in execute_from_command_line
    management.execute_from_command_line()
  File "/usr/lib/python3/dist-packages/django/core/management/__init__.py", line 364, in execute_from_command_line
    utility.execute()
  File "/usr/lib/python3/dist-packages/django/core/management/__init__.py", line 356, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/usr/lib/python3/dist-packages/django/core/management/base.py", line 283, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/usr/lib/python3/dist-packages/django/core/management/base.py", line 330, in execute
    output = self.handle(*args, **options)
  File "/usr/lib/python3/dist-packages/maasserver/management/commands/dbupgrade.py", line 426, in handle
    fake_initial=self._south_was_performed(database))
  File "/usr/lib/python3/dist-packages/django/core/management/__init__.py", line 131, in call_command
    return command.execute(*args, **defaults)
  File "/usr/lib/python3/dist-packages/django/core/management/base.py", line 330, in execute
    output = self.handle(*args, **options)
  File "/usr/lib/python3/dist-packages/django/core/management/commands/migrate.py", line 204, in handle
    fake_initial=fake_initial,
  File "/usr/lib/python3/dist-packages/django/db/migrations/executor.py", line 115, in migrate
    state = self._migrate_all_forwards(state, plan, full_plan, fake=fake, fake_initial=fake_initial)
  File "/usr/lib/python3/dist-packages/django/db/migrations/executor.py", line 145, in _migrate_all_forwards
    state = self.apply_migration(state, migration, fake=fake, fake_initial=fake_initial)
  File "/usr/lib/python3/dist-packages/django/db/migrations/executor.py", line 244, in apply_migration
    state = migration.apply(state, schema_editor)
  File "/usr/lib/python3/dist-packages/django/db/migrations/migration.py", line 129, in apply
    operation.database_forwards(self.app_label, schema_editor, old_state, project_state)
  File "/usr/lib/python3/dist-packages/django/db/migrations/operations/fields.py", line 221, in database_forwards
    schema_editor.alter_field(from_model, from_field, to_field)
  File "/usr/lib/python3/dist-packages/django/db/backends/base/schema.py", line 515, in alter_field
    old_db_params, new_db_params, strict)
  File "/usr/lib/python3/dist-packages/django/db/backends/postgresql/schema.py", line 112, in _alter_field
    new_db_params, strict,
  File "/usr/lib/python3/dist-packages/django/db/backends/base/schema.py", line 779, in _alter_field
    self.execute(self._create_fk_sql(model, new_field, "_fk_%(to_table)s_%(to_column)s"))
  File "/usr/lib/python3/dist-packages/django/db/backends/base/schema.py", line 120, in execute
    cursor.execute(sql, params)
  File "/usr/lib/python3/dist-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
  File "/usr/lib/python3/dist-packages/django/db/utils.py", line 94, in __exit__
    six.reraise(dj_exc_type, dj_exc_value, traceback)
  File "/usr/lib/python3/dist-packages/django/utils/six.py", line 685, in reraise
    raise value.with_traceback(tb)
  File "/usr/lib/python3/dist-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
django.db.utils.IntegrityError: insert or update on table "maasserver_event" violates foreign key constraint "maasserver_event_node_id_dd4495a7_fk_maasserver_node_id"
DETAIL: Key (node_id)=(16) is not present in table "maasserver_node".

dpkg: error processing package maas-region-controller (--configure):
 installed maas-region-controller package post-installation script subprocess returned error exit status 1

This looks like a very problematic failure!

Revision history for this message
Mark Shuttleworth (sabdfl) wrote :

Digging in there is an old record for what must be an old node:

maasdb=> select * from maasserver_event where node_id=16;
  id | created | updated | action | description | node_id | type_id
-------+-------------------------------+-------------------------------+--------+--------------------------+---------+---------
 96387 | 2016-07-01 17:03:36.883757+01 | 2016-07-01 17:03:36.883757+01 | | Power state queried: off | 16 | 6
(1 row)

I will try deleting that record and proceeding from there, but we should nonetheless have an upgrade strategy to deal with loose ends in the event table.

Revision history for this message
Mark Shuttleworth (sabdfl) wrote :

Deleting that event row allowed the upgrade to proceed:

$ sudo dpkg --configure -a
Setting up maas-region-controller (2.4.0~alpha1-6573-g12ee2331b-0ubuntu1) ...
Operations to perform:
  Apply all migrations: auth, contenttypes, maasserver, metadataserver, piston3, sessions, sites
Running migrations:
  Applying maasserver.0131_update_event_model_for_audit_logs... OK
  Applying maasserver.0132_consistent_model_name_validation... OK
  Applying maasserver.0133_add_resourcepool_model... OK
  Applying maasserver.0134_create_default_resourcepool... OK
  Applying maasserver.0135_add_pool_reference_to_node... OK
  Applying maasserver.0136_add_user_role_models... OK
  Applying maasserver.0137_create_default_roles... OK
  Applying maasserver.0138_add_ip_and_user_agent_to_event_model... OK
  Applying maasserver.0139_add_endpoint_and_increase_user_agent_length_for_event... OK
  Applying maasserver.0140_add_usergroup_model... OK
  Applying maasserver.0141_add_default_usergroup... OK
  Applying maasserver.0142_pod_default_resource_pool... OK
  Applying maasserver.0143_blockdevice_firmware... OK
  Applying maasserver.0144_filesystem_zfsroot_support... OK
  Applying metadataserver.0016_script_model_fw_update_and_hw_config... OK
  Applying metadataserver.0017_store_requested_scripts... OK
  Applying sites.0002_alter_domain_unique... OK
Setting up maas (2.4.0~alpha1-6573-g12ee2331b-0ubuntu1) ...

summary: - Upgrade to 2.4.0a1 failed
+ 2.4.0a1 upgrade failed with old node event data
Revision history for this message
Mike Pontillo (mpontillo) wrote :

The commit that corresponds to migration 131 changes the Event model in the following relevant way:

- node = ForeignKey('Node', null=False, editable=False, on_delete=CASCADE)
+ node = ForeignKey('Node', null=True, editable=False, on_delete=SET_NULL)

It's worth noting here that It looks like in this case, Django failed to CASCADE the delete to the event table. The way this happens is noted in the Django docs as follows:

"""
Cascade deletes. Django emulates the behavior of the SQL constraint ON DELETE CASCADE and also deletes the object containing the ForeignKey.
"""

Since Django is /emulating/ the behavior of the cascade delete, it's possible that due to a Django bug (or unlucky timing when stopping a service?) Django failed to delete this row in the database when it should have CASCADEd from the deletion of a Node. With the old model, I wonder if this was implicitly allowed (no foreign key constraint existed at all).

In case anyone else runs across this, here's a general workaround that can be executed from the `sudo maas-region dbshell`:

    DELETE FROM maasserver_event where node_id NOT IN (select id from maasserver_node);

Changed in maas:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.4.x
Revision history for this message
Mike Pontillo (mpontillo) wrote :

After checking a MAAS 2.3 install, I see that there is indeed a foreign key constraint that should prevent events from being inserted for invalid nodes.

So my best guess at what happened is: a power check event was entered into the database at the same time as a different transaction was deleting the Node object. Django deleted all the referenced events, except the newly-inserted power check event. Concurrently, the insertion of the power check event was allowed because the Node existed, but the job running to delete the Node didn't cascade-delete the newly-created event... resulting in this "dangling reference" in the database.

Revision history for this message
Blake Rouse (blake-rouse) wrote :

Based on the date of that event, it was when MAAS 2.0rc1 was put out for testing. This could be related to a pre-release version issue, where the constraint was add before MAAS 2.0 went final.

Revision history for this message
Mike Pontillo (mpontillo) wrote :

I think it's likely that this is a subtle/rare bug with postgresql (the transaction should have failed!?), but it could have been something to do with the way MAAS was setting up transaction isolation (IIRC, there were a lot of database transaction related bugs being worked on back then).

A general fix to this issue may involve significantly more than the event table. If I run the following in the MAAS repository:

    git grep AlterField | cut -d':' -f 1 | xargs cat | grep ForeignKey | wc -l

... I see that there are 71 instances in migrations where fields are both altered, and foreign keys are modified. The estimate is a bit high, because a migration could both add a foreign key and modify another field, but each of these 71 alterations implies a change to the constraints. So there are dozens of migrations that *could* cause a foreign key violation, __in the event that the referential integrity of the database has been breached__.

I would rather not implement and maintain dozens of potential workarounds (one more every time we change a foreign key in the model) for database integrity issues in MAAS -- the database should be responsible for database integrity. I feel like we should wait to see if others hit this or similar issues first. If we see this bug happening with the Event model in other environments, maybe we should attempt a one-off workaround (without considering the other foreign keys).

Thoughts?

Changed in maas:
milestone: 2.4.x → none
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

Issue not reproduced since, and the affected version of MAAS is long out of support.

Changed in maas:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.