[2.x] psycopg2.IntegrityError: update or delete on table "maasserver_node" violates foreign key constraint "maasserver_event_node_id_xxx_fk_maasserver_node_id" on table "maasserver_event" DETAIL: Key (id)=(xx) is still referenced from table "maasserver_event".

Bug #1726474 reported by Jason Hobbs
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Blake Rouse
2.2
Won't Fix
Critical
Unassigned

Bug Description

A "node delete" API request failed with an integrity error:

psycopg2.IntegrityError: update or delete on table "maasserver_node" violates foreign key constraint "maasserver_event_node_id_61e4521ce30bf078_fk_maasserver_node_id" on table "maasserver_event"
DETAIL: Key (id)=(26) is still referenced from table "maasserver_event".

This is in an HA setup, 3 region and 3 rack controllers. Logs are available here:

https://10.245.162.101/artifacts/f6a4b8b4-7ba6-49f0-b32a-c85569499618/cpe_cloud_232/infra-logs.tar

The integrity error stack trace can be seen in the regiond.log for 10.245.208.30.

Related branches

Changed in maas:
milestone: none → 2.3.0beta3
Changed in maas:
importance: Undecided → Critical
status: New → Triaged
summary: - psycopg2.IntegrityError: update or delete on table "maasserver_node"
- violates foreign key constraint
+ [2.x] psycopg2.IntegrityError: update or delete on table
+ "maasserver_node" violates foreign key constraint
"maasserver_event_node_id_xxx_fk_maasserver_node_id" on table
"maasserver_event" DETAIL: Key (id)=(xx) is still referenced from table
"maasserver_event".
Revision history for this message
Blake Rouse (blake-rouse) wrote :

Would it be possible that you are deleting a node why it is deploying?

I believe what is happening is that Django is deleting the node, which also involves the cascade delete of all the events for that node.

1. Region A starts a transaction.
2. Region A selects all events that are related to the node.
3. Region A deletes all selected events.
4. Region B starts a transaction.
5. Region B adds a new events.
6. Region B commits transaction.
7. Region A deletes the node.
8. Region A commits the transaction (which fails, since Region B added new events)

So back to the original question was the node still deploying or reboot and events where being created when you choose to delete this node?

Changed in maas:
status: Triaged → Incomplete
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

In foundation engine, we add new nodes without knowing their MAC address (workaround bug 1707216) by taking the following approach:

1) add the node with a made up mac address and valid bmc credentials
2) maas tells the node to power on and pxe boot to perform commissioning
3) we delete the node before it starts pxe booting (right after the "add node" api call returns).
4) the node pxe boots and enlists
5) we poll listing nodes, matching the new node by its power parameters (ip address of bmc).

So, yes, the node would have been actively booting when we issued the delete API call.

Changed in maas:
status: Incomplete → New
Revision history for this message
Björn Tillenius (bjornt) wrote :

Normally the situation Blake described would be solved by cascading deletes. However Django implements it's own cascading logic instead of relying on the database, and it doesn't seem to retry in case of errors like these.

Ideally it should be fixed in Django, but maybe we could retry the request if it fails like this?

Revision history for this message
Blake Rouse (blake-rouse) wrote :

The reason Django implements its own cascade delete logic is so it can send the pre_delete and post_delete signal for all objects that are cascade deleted. I don't think changing that logic is the correct path forward as that might cause other fallout.

I do think your correct in that we should retry in this case. We have builtin retry mechanism for handling serialization issues like this. I think that needs to be extended to handle this case correctly.

Revision history for this message
Björn Tillenius (bjornt) wrote :

Right. When I said it should be fixed in Django, I meant that it should expect serialization errors and automatically retry. But I don't see that happening so we'll have to fix it in our code.

Changed in maas:
status: New → In Progress
assignee: nobody → Blake Rouse (blake-rouse)
Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.