Potentially duplicated presence updates

Bug #1702671 reported by Mark Shuttleworth
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
High
Unassigned

Bug Description

I see a lot of this in the postgres log for MAAS:

2017-07-06 12:26:44 BST [92158-1] maas@maasdb ERROR: could not serialize access due to concurrent update
2017-07-06 12:26:44 BST [92158-2] maas@maasdb STATEMENT: UPDATE "maasserver_node" SET "updated" = '2017-07-06T12:26:44.442947'::timestamp, "status_expires" = NULL WHERE "maasserver_node"."id" = 1
2017-07-06 12:26:45 BST [92165-1] maas@maasdb ERROR: could not serialize access due to concurrent update
2017-07-06 12:26:45 BST [92165-2] maas@maasdb STATEMENT: UPDATE "maasserver_staticipaddress" SET "created" = '2017-07-06T12:26:45'::timestamp, "updated" = '2017-07-06T12:26:45.597922'::timestamp, "ip" = '192.168.1.166'::inet, "alloc_type" = 6, "subnet_id" = 2, "user_id" = NULL, "lease_time" = 600 WHERE "maasserver_staticipaddress"."id" = 349572

What are we using the 'updated' field for in this table?

Changed in maas:
importance: Undecided → High
milestone: none → 2.3.0
status: New → Triaged
tags: added: performance
Changed in maas:
milestone: 2.3.0 → 2.3.x
Revision history for this message
Alok G Singh (alephnull-b) wrote :
Download full text (4.1 KiB)

The timestamp updates causes considerable CPU load. By watching top and pg_stat_activity, I was able to catch PID 51544 in the act. The actions of the PID are:

root@infr002:/var/log/postgresql# grep 51544 postgresql-9.5-ha.log
2019-01-28 01:20:13 UTC [51544-1] maas@maasdb ERROR: could not serialize access due to concurrent update
2019-01-28 01:20:13 UTC [51544-2] maas@maasdb STATEMENT: UPDATE "maasserver_node" SET "updated" = '2019-01-28T01:19:59.158232'::timestamp, "status_expires" = NULL WHERE "maasserver_node"."id" = 1
2019-01-28 01:20:23 UTC [51544-3] maas@maasdb ERROR: could not serialize access due to concurrent update
2019-01-28 01:20:23 UTC [51544-4] maas@maasdb STATEMENT: UPDATE "maasserver_node" SET "updated" = '2019-01-28T01:20:13.954158'::timestamp, "status_expires" = NULL WHERE "maasserver_node"."id" = 1
2019-01-28 01:20:33 UTC [51544-5] maas@maasdb ERROR: could not serialize access due to concurrent update
2019-01-28 01:20:33 UTC [51544-6] maas@maasdb STATEMENT: UPDATE "maasserver_node" SET "updated" = '2019-01-28T01:20:23.316807'::timestamp, "status_expires" = NULL WHERE "maasserver_node"."id" = 1
2019-01-28 01:20:47 UTC [51544-7] maas@maasdb ERROR: could not serialize access due to concurrent update
2019-01-28 01:20:47 UTC [51544-8] maas@maasdb STATEMENT: UPDATE "maasserver_service" SET "updated" = '2019-01-28T01:20:33.513096'::timestamp, "status_info" = '50% connected to region controllers.' WHERE "maasserver_service"."id" = 24
2019-01-28 01:20:55 UTC [51544-9] maas@maasdb ERROR: could not serialize access due to concurrent update
2019-01-28 01:20:55 UTC [51544-10] maas@maasdb STATEMENT: UPDATE "maasserver_node" SET "updated" = '2019-01-28T01:20:47.634756'::timestamp, "status_expires" = NULL WHERE "maasserver_node"."id" = 1
2019-01-28 01:21:43 UTC [51544-11] maas@maasdb ERROR: could not serialize access due to concurrent update
2019-01-28 01:21:43 UTC [51544-12] maas@maasdb STATEMENT: UPDATE "maasserver_service" SET "updated" = '2019-01-28T01:20:56.272271'::timestamp, "status_info" = '33% connected to region controllers.' WHERE "maasserver_service"."id" = 29
2019-01-28 01:23:45 UTC [52492-15] maas@maasdb DETAIL: Process 52492 waits for ShareLock on transaction 107337590; blocked by process 51544.
        Process 51544 waits for ExclusiveLock on tuple (74745,15) of relation 16829 of database 16385; blocked by process 52486.
        Process 51544: UPDATE "maasserver_node" SET "managing_process_id" = NULL WHERE "maasserver_node"."id" IN (3)
2019-01-28 01:24:04 UTC [52486-15] maas@maasdb DETAIL: Process 52486 waits for ShareLock on transaction 107337590; blocked by process 51544.
        Process 51544 waits for ExclusiveLock on tuple (74745,15) of relation 16829 of database 16385; blocked by process 54211.
        Process 51544: UPDATE "maasserver_node" SET "managing_process_id" = NULL WHERE "maasserver_node"."id" IN (3)
2019-01-28 01:24:30 UTC [52489-18] maas@maasdb DETAIL: Process 52489 waits for ShareLock on transaction 107337590; blocked by process 51544.
        Process 51544 waits for ShareLock on transaction 107337670; blocked by process 52489.
        Process 51544: UPDATE "maasserver_node" SET "managing_p...

Read more...

Revision history for this message
Mark Shuttleworth (sabdfl) wrote :

Devs this was targeted to 2.3, please address it.

Revision history for this message
Alok G Singh (alephnull-b) wrote :

Subscribing field-medium as when this situation arises, it affects operations. For instance juju ssh will fail with errors of the form:

ERROR opening environment: unexpected: Get http://172.18.247.21/MAAS/api/2.0/users/?op=whoami: EOF
ERROR opening environment: Get http://172.18.247.21/MAAS/api/2.0/version/: EOF

Rebuilding the postgres cluster seems to "fix" the problem but I am not sure what causes the bug to be triggered.

Revision history for this message
Andres Rodriguez (andreserl) wrote :

I’m setting this incomplete as there is not enough information to determine what the issue may be.

Can you provide information such as:

What version of Maas are you using.
How is your configuration, how many controllers.
How many machines are there in the your Maas cluster, and how many machines get IPs from dhcp.
Can you provide Maas logs from the time you see the invalid response from the API?
Also, is juju connecting to MAAS directly or is juju connecting to MAAS via haprozy? If it’s HAPROXY, what’s the timeout configured?
Also, is this an issue currently affecting a production environment or just a testing environment?

Changed in maas:
milestone: 2.3.x → none
status: Triaged → Incomplete
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Also, logs provided in #1 seem to be updates from Maas due to being unable to make full connectivity between controllers. There is one log that stands out specifically and that is this:

ERROR: requested WAL segment 000000010000002100000042 has already been removed

That to me indicate Postgres replication issues, and those replication issue could actually be the reason of why you have cpu starvation. I would like to see full logs but otherwise this doesn’t look related to the originally reported issue.

Revision history for this message
Alok G Singh (alephnull-b) wrote :

MAAS version:
aloksingh@infr001:/var/log/maas$ dpkg -l | grep '^ii' | grep maas
ii maas-cli 2.3.5-6511-gf466fdb-0ubuntu1~16.04.1 all MAAS client and command-line interface
ii maas-common 2.3.5-6511-gf466fdb-0ubuntu1~16.04.1 all MAAS server common files
ii maas-dhcp 2.3.5-6511-gf466fdb-0ubuntu1~16.04.1 all MAAS DHCP server
ii maas-dns 2.3.5-6511-gf466fdb-0ubuntu1~16.04.1 all MAAS DNS server
ii maas-proxy 2.3.5-6511-gf466fdb-0ubuntu1~16.04.1 all MAAS Caching Proxy
ii maas-rack-controller 2.3.5-6511-gf466fdb-0ubuntu1~16.04.1 all Rack Controller for MAAS
ii maas-region-api 2.3.5-6511-gf466fdb-0ubuntu1~16.04.1 all Region controller API service for MAAS
ii maas-region-controller 2.3.5-6511-gf466fdb-0ubuntu1~16.04.1 all Region Controller for MAAS
ii python3-django-maas 2.3.5-6511-gf466fdb-0ubuntu1~16.04.1 all MAAS server Django web framework (Python 3)
ii python3-maas-client 2.3.5-6511-gf466fdb-0ubuntu1~16.04.1 all MAAS python API client (Python 3)
ii python3-maas-provisioningserver 2.3.5-6511-gf466fdb-0ubuntu1~16.04.1 all MAAS server provisioning libraries (Python 3)

> How is your configuration, how many controllers.

There are three controllers running in KVMs on three infra nodes.

> How many machines are there in the your Maas cluster

Three machines.

> and how many machines get IPs from dhcp.

100-1000

> Can you provide Maas logs from the time you see the invalid response from the API?

I will provide it in a separate comment. I will have to establish the timing. I did run a tail -f of postgres while running juju ssh but didn't see anything apart from what I've posted already.

> Also, is juju connecting to MAAS directly or is juju connecting to MAAS via haprozy? If it’s HAPROXY, what’s the timeout configured?

MAAS is accessed via a VIP which is managed by haproxy.
$ cat haproxy.cfg
frontend maas
    bind *:80
    retries 3
    option redispatch
    option http-server-close
    default_backend maas

backend maas
    timeout server 90s
    balance source
    hash-type consistent
    server maas-api-0 172.18.247.6:5240 check
    server maas-api-1 172.18.247.7:5240 check
    server maas-api-2 172.18.247.8:5240 check

> Also, is this an issue currently affecting a production environment or just a testing environment?

Production.

Revision history for this message
Alok G Singh (alephnull-b) wrote :

@Anders, the missing WAL segment error has been in the logs for some months before the problem manifested itself. The problem has also been seen in clouds where there are no missing WAL segments.

Changed in maas:
status: Incomplete → New
Alvaro Uria (aluria)
tags: added: canonical-bootstack
Revision history for this message
Blake Rouse (blake-rouse) wrote :

We believe this bug is fixed in the up coming 2.6.1 release. We did a lot of performance analysis on database transactions to help reduce serialization issues and the number of retries.

Changed in maas:
milestone: none → 2.6.1rc1
status: New → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.