MAAS

Potentially duplicated presence updates

Bug #1702671 reported by Mark Shuttleworth on 2017-07-06

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	MAAS	Fix Released	High	Unassigned	MAAS 2.6.1rc1

Bug Description

I see a lot of this in the postgres log for MAAS:

2017-07-06 12:26:44 BST [92158-1] maas@maasdb ERROR: could not serialize access due to concurrent update
2017-07-06 12:26:44 BST [92158-2] maas@maasdb STATEMENT: UPDATE "maasserver_node" SET "updated" = '2017-07-06T12:26:44.442947'::timestamp, "status_expires" = NULL WHERE "maasserver_node"."id" = 1
2017-07-06 12:26:45 BST [92165-1] maas@maasdb ERROR: could not serialize access due to concurrent update
2017-07-06 12:26:45 BST [92165-2] maas@maasdb STATEMENT: UPDATE "maasserver_staticipaddress" SET "created" = '2017-07-06T12:26:45'::timestamp, "updated" = '2017-07-06T12:26:45.597922'::timestamp, "ip" = '192.168.1.166'::inet, "alloc_type" = 6, "subnet_id" = 2, "user_id" = NULL, "lease_time" = 600 WHERE "maasserver_staticipaddress"."id" = 349572

What are we using the 'updated' field for in this table?

Tags:

Andres Rodriguez (andreserl) on 2017-07-06

Changed in maas:
importance:	Undecided → High
milestone:	none → 2.3.0
status:	New → Triaged
tags:	added: performance

Andres Rodriguez (andreserl) on 2017-11-10

Changed in maas:
milestone:	2.3.0 → 2.3.x

Revision history for this message

Alok G Singh (alephnull-b) wrote on 2019-02-01:

Download full text (4.1 KiB)

The timestamp updates causes considerable CPU load. By watching top and pg_stat_activity, I was able to catch PID 51544 in the act. The actions of the PID are:

root@infr002:/var/log/postgresql# grep 51544 postgresql-9.5-ha.log
2019-01-28 01:20:13 UTC [51544-1] maas@maasdb ERROR: could not serialize access due to concurrent update
2019-01-28 01:20:13 UTC [51544-2] maas@maasdb STATEMENT: UPDATE "maasserver_node" SET "updated" = '2019-01-28T01:19:59.158232'::timestamp, "status_expires" = NULL WHERE "maasserver_node"."id" = 1
2019-01-28 01:20:23 UTC [51544-3] maas@maasdb ERROR: could not serialize access due to concurrent update
2019-01-28 01:20:23 UTC [51544-4] maas@maasdb STATEMENT: UPDATE "maasserver_node" SET "updated" = '2019-01-28T01:20:13.954158'::timestamp, "status_expires" = NULL WHERE "maasserver_node"."id" = 1
2019-01-28 01:20:33 UTC [51544-5] maas@maasdb ERROR: could not serialize access due to concurrent update
2019-01-28 01:20:33 UTC [51544-6] maas@maasdb STATEMENT: UPDATE "maasserver_node" SET "updated" = '2019-01-28T01:20:23.316807'::timestamp, "status_expires" = NULL WHERE "maasserver_node"."id" = 1
2019-01-28 01:20:47 UTC [51544-7] maas@maasdb ERROR: could not serialize access due to concurrent update
2019-01-28 01:20:47 UTC [51544-8] maas@maasdb STATEMENT: UPDATE "maasserver_service" SET "updated" = '2019-01-28T01:20:33.513096'::timestamp, "status_info" = '50% connected to region controllers.' WHERE "maasserver_service"."id" = 24
2019-01-28 01:20:55 UTC [51544-9] maas@maasdb ERROR: could not serialize access due to concurrent update
2019-01-28 01:20:55 UTC [51544-10] maas@maasdb STATEMENT: UPDATE "maasserver_node" SET "updated" = '2019-01-28T01:20:47.634756'::timestamp, "status_expires" = NULL WHERE "maasserver_node"."id" = 1
2019-01-28 01:21:43 UTC [51544-11] maas@maasdb ERROR: could not serialize access due to concurrent update
2019-01-28 01:21:43 UTC [51544-12] maas@maasdb STATEMENT: UPDATE "maasserver_service" SET "updated" = '2019-01-28T01:20:56.272271'::timestamp, "status_info" = '33% connected to region controllers.' WHERE "maasserver_service"."id" = 29
2019-01-28 01:23:45 UTC [52492-15] maas@maasdb DETAIL: Process 52492 waits for ShareLock on transaction 107337590; blocked by process 51544.
        Process 51544 waits for ExclusiveLock on tuple (74745,15) of relation 16829 of database 16385; blocked by process 52486.
        Process 51544: UPDATE "maasserver_node" SET "managing_process_id" = NULL WHERE "maasserver_node"."id" IN (3)
2019-01-28 01:24:04 UTC [52486-15] maas@maasdb DETAIL: Process 52486 waits for ShareLock on transaction 107337590; blocked by process 51544.
        Process 51544 waits for ExclusiveLock on tuple (74745,15) of relation 16829 of database 16385; blocked by process 54211.
        Process 51544: UPDATE "maasserver_node" SET "managing_process_id" = NULL WHERE "maasserver_node"."id" IN (3)
2019-01-28 01:24:30 UTC [52489-18] maas@maasdb DETAIL: Process 52489 waits for ShareLock on transaction 107337590; blocked by process 51544.
        Process 51544 waits for ShareLock on transaction 107337670; blocked by process 52489.
        Process 51544: UPDATE "maasserver_node" SET "managing_p...

The timestamp updates causes considerable CPU load. By watching top and pg_stat_activity, I was able to catch PID 51544 in the act. The actions of the PID are:

root@infr002:/var/log/postgresql# grep 51544 postgresql-9.5-ha.log
2019-01-28 01:20:13 UTC [51544-1] maas@maasdb ERROR:  could not serialize access due to concurrent update
2019-01-28 01:20:13 UTC [51544-2] maas@maasdb STATEMENT:  UPDATE "maasserver_node" SET "updated" = '2019-01-28T01:19:59.158232'::timestamp, "status_expires" = NULL WHERE "maasserver_node"."id" = 1
2019-01-28 01:20:23 UTC [51544-3] maas@maasdb ERROR:  could not serialize access due to concurrent update
2019-01-28 01:20:23 UTC [51544-4] maas@maasdb STATEMENT:  UPDATE "maasserver_node" SET "updated" = '2019-01-28T01:20:13.954158'::timestamp, "status_expires" = NULL WHERE "maasserver_node"."id" = 1
2019-01-28 01:20:33 UTC [51544-5] maas@maasdb ERROR:  could not serialize access due to concurrent update
2019-01-28 01:20:33 UTC [51544-6] maas@maasdb STATEMENT:  UPDATE "maasserver_node" SET "updated" = '2019-01-28T01:20:23.316807'::timestamp, "status_expires" = NULL WHERE "maasserver_node"."id" = 1
2019-01-28 01:20:47 UTC [51544-7] maas@maasdb ERROR:  could not serialize access due to concurrent update
2019-01-28 01:20:47 UTC [51544-8] maas@maasdb STATEMENT:  UPDATE "maasserver_service" SET "updated" = '2019-01-28T01:20:33.513096'::timestamp, "status_info" = '50% connected to region controllers.' WHERE "maasserver_service"."id" = 24
2019-01-28 01:20:55 UTC [51544-9] maas@maasdb ERROR:  could not serialize access due to concurrent update
2019-01-28 01:20:55 UTC [51544-10] maas@maasdb STATEMENT:  UPDATE "maasserver_node" SET "updated" = '2019-01-28T01:20:47.634756'::timestamp, "status_expires" = NULL WHERE "maasserver_node"."id" = 1
2019-01-28 01:21:43 UTC [51544-11] maas@maasdb ERROR:  could not serialize access due to concurrent update
2019-01-28 01:21:43 UTC [51544-12] maas@maasdb STATEMENT:  UPDATE "maasserver_service" SET "updated" = '2019-01-28T01:20:56.272271'::timestamp, "status_info" = '33% connected to region controllers.' WHERE "maasserver_service"."id" = 29
2019-01-28 01:23:45 UTC [52492-15] maas@maasdb DETAIL:  Process 52492 waits for ShareLock on transaction 107337590; blocked by process 51544.
        Process 51544 waits for ExclusiveLock on tuple (74745,15) of relation 16829 of database 16385; blocked by process 52486.
        Process 51544: UPDATE "maasserver_node" SET "managing_process_id" = NULL WHERE "maasserver_node"."id" IN (3)
2019-01-28 01:24:04 UTC [52486-15] maas@maasdb DETAIL:  Process 52486 waits for ShareLock on transaction 107337590; blocked by process 51544.
        Process 51544 waits for ExclusiveLock on tuple (74745,15) of relation 16829 of database 16385; blocked by process 54211.
        Process 51544: UPDATE "maasserver_node" SET "managing_process_id" = NULL WHERE "maasserver_node"."id" IN (3)
2019-01-28 01:24:30 UTC [52489-18] maas@maasdb DETAIL:  Process 52489 waits for ShareLock on transaction 107337590; blocked by process 51544.
        Process 51544 waits for ShareLock on transaction 107337670; blocked by process 52489.
        Process 51544: UPDATE "maasserver_node" SET "managing_process_id" = NULL WHERE "maasserver_node"."id" IN (3)
2019-01-28 03:29:53 UTC [39286-2] maas@maasdb STATEMENT:  UPDATE "maasserver_notification" SET "created" = '2019-01-28T03:27:32.295353'::timestamp, "updated" = '2019-01-28T03:29:52.515445'::timestamp, "ident" = 'clusters', "user_id" = NULL, "users" = false, "admins" = true, "message" = '3 rack controllers are not yet connected to the region. Visit the <a href="/MAAS/#/nodes?tab=controllers">rack controllers page</a> for more information.', "context" = '{}', "category" = 'error' WHERE "maasserver_notification"."id" = 36470
2019-01-28 11:32:40 UTC [32348-6] maas@maasdb STATEMENT:  UPDATE "maasserver_node" SET "updated" = '2019-01-28T11:32:32.951544'::timestamp, "status_expires" = NULL WHERE "maasserver_node"."id" = 3
2019-01-29 00:59:25 UTC [51544-1] postgres@[unknown] ERROR:  requested WAL segment 000000010000002100000042 has already been removed

There are a number of such workers that beat on the CPU constantly and cause other processes to starve.

Revision history for this message

Mark Shuttleworth (sabdfl) wrote on 2019-02-02:

Devs this was targeted to 2.3, please address it.

Revision history for this message

Alok G Singh (alephnull-b) wrote on 2019-02-05:

Subscribing field-medium as when this situation arises, it affects operations. For instance juju ssh will fail with errors of the form:

ERROR opening environment: unexpected: Get http://172.18.247.21/MAAS/api/2.0/users/?op=whoami: EOF
ERROR opening environment: Get http://172.18.247.21/MAAS/api/2.0/version/: EOF

Rebuilding the postgres cluster seems to "fix" the problem but I am not sure what causes the bug to be triggered.

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2019-02-05:

I’m setting this incomplete as there is not enough information to determine what the issue may be.

Can you provide information such as:

What version of Maas are you using.
How is your configuration, how many controllers.
How many machines are there in the your Maas cluster, and how many machines get IPs from dhcp.
Can you provide Maas logs from the time you see the invalid response from the API?
Also, is juju connecting to MAAS directly or is juju connecting to MAAS via haprozy? If it’s HAPROXY, what’s the timeout configured?
Also, is this an issue currently affecting a production environment or just a testing environment?

Changed in maas:
milestone:	2.3.x → none
status:	Triaged → Incomplete

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2019-02-05:

Also, logs provided in #1 seem to be updates from Maas due to being unable to make full connectivity between controllers. There is one log that stands out specifically and that is this:

ERROR: requested WAL segment 000000010000002100000042 has already been removed

That to me indicate Postgres replication issues, and those replication issue could actually be the reason of why you have cpu starvation. I would like to see full logs but otherwise this doesn’t look related to the originally reported issue.

Revision history for this message

Alok G Singh (alephnull-b) wrote on 2019-02-06:

MAAS version:
aloksingh@infr001:/var/log/maas$ dpkg -l | grep '^ii' | grep maas
ii maas-cli 2.3.5-6511-gf466fdb-0ubuntu1~16.04.1 all MAAS client and command-line interface
ii maas-common 2.3.5-6511-gf466fdb-0ubuntu1~16.04.1 all MAAS server common files
ii maas-dhcp 2.3.5-6511-gf466fdb-0ubuntu1~16.04.1 all MAAS DHCP server
ii maas-dns 2.3.5-6511-gf466fdb-0ubuntu1~16.04.1 all MAAS DNS server
ii maas-proxy 2.3.5-6511-gf466fdb-0ubuntu1~16.04.1 all MAAS Caching Proxy
ii maas-rack-controller 2.3.5-6511-gf466fdb-0ubuntu1~16.04.1 all Rack Controller for MAAS
ii maas-region-api 2.3.5-6511-gf466fdb-0ubuntu1~16.04.1 all Region controller API service for MAAS
ii maas-region-controller 2.3.5-6511-gf466fdb-0ubuntu1~16.04.1 all Region Controller for MAAS
ii python3-django-maas 2.3.5-6511-gf466fdb-0ubuntu1~16.04.1 all MAAS server Django web framework (Python 3)
ii python3-maas-client 2.3.5-6511-gf466fdb-0ubuntu1~16.04.1 all MAAS python API client (Python 3)
ii python3-maas-provisioningserver 2.3.5-6511-gf466fdb-0ubuntu1~16.04.1 all MAAS server provisioning libraries (Python 3)

> How is your configuration, how many controllers.

There are three controllers running in KVMs on three infra nodes.

> How many machines are there in the your Maas cluster

Three machines.

> and how many machines get IPs from dhcp.

100-1000

> Can you provide Maas logs from the time you see the invalid response from the API?

I will provide it in a separate comment. I will have to establish the timing. I did run a tail -f of postgres while running juju ssh but didn't see anything apart from what I've posted already.

> Also, is juju connecting to MAAS directly or is juju connecting to MAAS via haprozy? If it’s HAPROXY, what’s the timeout configured?

MAAS is accessed via a VIP which is managed by haproxy.
$ cat haproxy.cfg
frontend maas
    bind *:80
    retries 3
    option redispatch
    option http-server-close
    default_backend maas

backend maas
    timeout server 90s
    balance source
    hash-type consistent
    server maas-api-0 172.18.247.6:5240 check
    server maas-api-1 172.18.247.7:5240 check
    server maas-api-2 172.18.247.8:5240 check

> Also, is this an issue currently affecting a production environment or just a testing environment?

Production.

MAAS version: 
aloksingh@infr001:/var/log/maas$ dpkg -l | grep '^ii' | grep maas
ii  maas-cli                              2.3.5-6511-gf466fdb-0ubuntu1~16.04.1        all          MAAS client and command-line interface
ii  maas-common                           2.3.5-6511-gf466fdb-0ubuntu1~16.04.1        all          MAAS server common files
ii  maas-dhcp                             2.3.5-6511-gf466fdb-0ubuntu1~16.04.1        all          MAAS DHCP server
ii  maas-dns                              2.3.5-6511-gf466fdb-0ubuntu1~16.04.1        all          MAAS DNS server
ii  maas-proxy                            2.3.5-6511-gf466fdb-0ubuntu1~16.04.1        all          MAAS Caching Proxy
ii  maas-rack-controller                  2.3.5-6511-gf466fdb-0ubuntu1~16.04.1        all          Rack Controller for MAAS
ii  maas-region-api                       2.3.5-6511-gf466fdb-0ubuntu1~16.04.1        all          Region controller API service for MAAS
ii  maas-region-controller                2.3.5-6511-gf466fdb-0ubuntu1~16.04.1        all          Region Controller for MAAS
ii  python3-django-maas                   2.3.5-6511-gf466fdb-0ubuntu1~16.04.1        all          MAAS server Django web framework (Python 3)
ii  python3-maas-client                   2.3.5-6511-gf466fdb-0ubuntu1~16.04.1        all          MAAS python API client (Python 3)
ii  python3-maas-provisioningserver       2.3.5-6511-gf466fdb-0ubuntu1~16.04.1        all          MAAS server provisioning libraries (Python 3)

> How is your configuration, how many controllers.

There are three controllers running in KVMs on three infra nodes.

> How many machines are there in the your Maas cluster

Three machines.

> and how many machines get IPs from dhcp.

100-1000

> Can you provide Maas logs from the time you see the invalid response from the API?

I will provide it in a separate comment. I will have to establish the timing. I did run a tail -f of postgres while running juju ssh but didn't see anything apart from what I've posted already.

> Also, is juju connecting to MAAS directly or is juju connecting to MAAS via haprozy? If it’s HAPROXY, what’s the timeout configured?

MAAS is accessed via a VIP which is managed by haproxy.
$ cat haproxy.cfg
frontend maas
    bind    *:80
    retries 3
    option  redispatch
    option  http-server-close
    default_backend maas

> Also, is this an issue currently affecting a production environment or just a testing environment?

Production.

Revision history for this message

Alok G Singh (alephnull-b) wrote on 2019-02-06:

@Anders, the missing WAL segment error has been in the logs for some months before the problem manifested itself. The problem has also been seen in clouds where there are no missing WAL segments.

Alok G Singh (alephnull-b) on 2019-02-06

Changed in maas:
status:	Incomplete → New

Alvaro Uria (aluria) on 2019-02-06

tags:

added: canonical-bootstack

Revision history for this message

Blake Rouse (blake-rouse) wrote on 2019-09-20:

We believe this bug is fixed in the up coming 2.6.1 release. We did a lot of performance analysis on database transactions to help reduce serialization issues and the number of retries.

Changed in maas:
milestone:	none → 2.6.1rc1
status:	New → Fix Committed

Adam Collard (adam-collard) on 2019-09-25

Changed in maas:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.