archive policy still in use (random failure)

Bug #1681955 reported by gordon chung
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Gnocchi
Fix Released
High
gordon chung

Bug Description

http://logs.openstack.org/39/453239/6/check/gate-gnocchi-tox-py27-mysql-ubuntu-xenial/b4a57fb/testr_results.html.gz

for some unknown reason, gabbilive archive policy says still in use sometimes (seen it on file and ceph drivers). all resources with metrics using policy should have been deleted... and pifpaf says it's expunging deleted metrics every 1s so wtf!?! it could mean:

1) gnocchi isn't expunging (can't verify because pifpaf lack of logging)
2) metric is not marked for deletion

2017-04-11 18:17:12.866864 | gabbi.suitemaker.test_gabbi_live_live_delete_single_archive_policy_cleanup.test_request
2017-04-11 18:17:12.866912 | ---------------------------------------------------------------------------------------
2017-04-11 18:17:12.866927 |
2017-04-11 18:17:12.866945 | Captured traceback:
2017-04-11 18:17:12.866964 | ~~~~~~~~~~~~~~~~~~~
2017-04-11 18:17:12.866986 | Traceback (most recent call last):
2017-04-11 18:17:12.867043 | File "/home/jenkins/workspace/gate-gnocchi-tox-py27-mysql-ubuntu-xenial/.tox/py27-mysql/local/lib/python2.7/site-packages/gabbi/case.py", line 93, in wrapper
2017-04-11 18:17:12.867062 | func(self)
2017-04-11 18:17:12.867133 | File "/home/jenkins/workspace/gate-gnocchi-tox-py27-mysql-ubuntu-xenial/.tox/py27-mysql/local/lib/python2.7/site-packages/gabbi/case.py", line 141, in test_request
2017-04-11 18:17:12.867157 | self._run_test()
2017-04-11 18:17:12.867215 | File "/home/jenkins/workspace/gate-gnocchi-tox-py27-mysql-ubuntu-xenial/.tox/py27-mysql/local/lib/python2.7/site-packages/gabbi/case.py", line 463, in _run_test
2017-04-11 18:17:12.867252 | raise failure
2017-04-11 18:17:12.867308 | AssertionError: '400' not found in ['204'], response:
2017-04-11 18:17:12.867337 | 400 Bad Request
2017-04-11 18:17:12.867352 |
2017-04-11 18:17:12.867392 | The server could not comply with the request since it is either malformed or otherwise incorrect.
2017-04-11 18:17:12.867407 |
2017-04-11 18:17:12.867432 | Archive policy gabbilive is still in use
2017-04-11 18:17:12.867446 |
2017-04-11 18:17:12.867458 |
2017-04-11 18:17:12.867490 | gabbi.suitemaker.test_gabbi_live_live_confirm_delete_of_cleanup.test_request
2017-04-11 18:17:12.867523 | ----------------------------------------------------------------------------
2017-04-11 18:17:12.867535 |
2017-04-11 18:17:12.867552 | Captured traceback:
2017-04-11 18:17:12.867570 | ~~~~~~~~~~~~~~~~~~~
2017-04-11 18:17:12.867592 | Traceback (most recent call last):
2017-04-11 18:17:12.867647 | File "/home/jenkins/workspace/gate-gnocchi-tox-py27-mysql-ubuntu-xenial/.tox/py27-mysql/local/lib/python2.7/site-packages/gabbi/case.py", line 93, in wrapper
2017-04-11 18:17:12.867665 | func(self)
2017-04-11 18:17:12.867722 | File "/home/jenkins/workspace/gate-gnocchi-tox-py27-mysql-ubuntu-xenial/.tox/py27-mysql/local/lib/python2.7/site-packages/gabbi/case.py", line 141, in test_request
2017-04-11 18:17:12.867741 | self._run_test()
2017-04-11 18:17:12.867797 | File "/home/jenkins/workspace/gate-gnocchi-tox-py27-mysql-ubuntu-xenial/.tox/py27-mysql/local/lib/python2.7/site-packages/gabbi/case.py", line 467, in _run_test
2017-04-11 18:17:12.867818 | self._assert_response()
2017-04-11 18:17:12.867876 | File "/home/jenkins/workspace/gate-gnocchi-tox-py27-mysql-ubuntu-xenial/.tox/py27-mysql/local/lib/python2.7/site-packages/gabbi/case.py", line 176, in _assert_response
2017-04-11 18:17:12.867909 | self._test_status(self.test_data['status'], self.response['status'])
2017-04-11 18:17:12.867966 | File "/home/jenkins/workspace/gate-gnocchi-tox-py27-mysql-ubuntu-xenial/.tox/py27-mysql/local/lib/python2.7/site-packages/gabbi/case.py", line 506, in _test_status
2017-04-11 18:17:12.867995 | self.assert_in_or_print_output(observed_status, statii)
2017-04-11 18:17:12.868055 | File "/home/jenkins/workspace/gate-gnocchi-tox-py27-mysql-ubuntu-xenial/.tox/py27-mysql/local/lib/python2.7/site-packages/gabbi/case.py", line 567, in assert_in_or_print_output
2017-04-11 18:17:12.868075 | self.fail(msg)
2017-04-11 18:17:12.868130 | File "/home/jenkins/workspace/gate-gnocchi-tox-py27-mysql-ubuntu-xenial/.tox/py27-mysql/local/lib/python2.7/site-packages/unittest2/case.py", line 690, in fail
2017-04-11 18:17:12.868153 | raise self.failureException(msg)
2017-04-11 18:17:12.868181 | AssertionError: '200' not found in ['404'], response:
2017-04-11 18:17:12.868195 | {
2017-04-11 18:17:12.868214 | "name": "gabbilive",
2017-04-11 18:17:12.868234 | "aggregation_methods": [
2017-04-11 18:17:12.868251 | "mean",
2017-04-11 18:17:12.868268 | "max",
2017-04-11 18:17:12.868301 | "min"
2017-04-11 18:17:12.868329 | ],
2017-04-11 18:17:12.868351 | "definition": [
2017-04-11 18:17:12.868366 | {
2017-04-11 18:17:12.868387 | "timespan": "0:01:00",
2017-04-11 18:17:12.868409 | "granularity": "0:00:01",
2017-04-11 18:17:12.868427 | "points": 60
2017-04-11 18:17:12.868442 | },
2017-04-11 18:17:12.868457 | {
2017-04-11 18:17:12.868478 | "timespan": "0:01:00",
2017-04-11 18:17:12.868499 | "granularity": "0:00:02",
2017-04-11 18:17:12.868517 | "points": 30
2017-04-11 18:17:12.868532 | },
2017-04-11 18:17:12.868547 | {
2017-04-11 18:17:12.868567 | "timespan": "0:05:00",
2017-04-11 18:17:12.868589 | "granularity": "0:01:00",
2017-04-11 18:17:12.868618 | "points": 5
2017-04-11 18:17:12.868634 | }
2017-04-11 18:17:12.868649 | ],
2017-04-11 18:17:12.868667 | "back_window": 0
2017-04-11 18:17:12.868680 | }

gordon chung (chungg)
Changed in gnocchi:
importance: Undecided → High
status: New → Triaged
status: Triaged → New
importance: High → Medium
importance: Medium → Undecided
Julien Danjou (jdanjou)
Changed in gnocchi:
status: New → Triaged
summary: - archive policy still in use (random failture)
+ archive policy still in use (random failure)
Changed in gnocchi:
importance: Undecided → High
Revision history for this message
gordon chung (chungg) wrote :
Download full text (6.0 KiB)

something is wrong with cleanup process but i can't tell because of pifpaf. seems deleted metric is not cleaned up so it can't delete archivepolicy. patch here: https://review.openstack.org/#/c/456413/3

state after 'delete archive policy' attempt

2017-04-21 17:14:08.958924 | #### get all metrics to verify ####
2017-04-21 17:14:08.964021 | > GET http://localhost:8041/v1/metric
2017-04-21 17:14:08.966243 | > x-auth-token:
2017-04-21 17:14:08.989412 | > authorization: basic YWRtaW46
2017-04-21 17:14:08.995254 | > user-agent: gabbi/1.33.0 (Python urllib3)
2017-04-21 17:14:08.995377 |
2017-04-21 17:14:08.996602 | < 200 OK
2017-04-21 17:14:09.001726 | < Date: Fri, 21 Apr 2017 17:14:08 GMT
2017-04-21 17:14:09.007103 | < Server: WSGIServer/0.2 CPython/3.5.2
2017-04-21 17:14:09.009870 | < Content-Length: 2
2017-04-21 17:14:09.028967 | < Content-Type: application/json
2017-04-21 17:14:09.029094 |
2017-04-21 17:14:09.029363 | []
2017-04-21 17:14:09.029479 |
2017-04-21 17:14:09.054252 | #### get all deleted metrics to verify ####
2017-04-21 17:14:09.064531 | > GET http://localhost:8041/v1/metric?status=delete
2017-04-21 17:14:09.066197 | > x-auth-token:
2017-04-21 17:14:09.069386 | > authorization: basic YWRtaW46
2017-04-21 17:14:09.073763 | > user-agent: gabbi/1.33.0 (Python urllib3)
2017-04-21 17:14:09.073883 |
2017-04-21 17:14:09.074737 | < 200 OK
2017-04-21 17:14:09.078506 | < Date: Fri, 21 Apr 2017 17:14:08 GMT
2017-04-21 17:14:09.082348 | < Server: WSGIServer/0.2 CPython/3.5.2
2017-04-21 17:14:09.084569 | < Content-Length: 496
2017-04-21 17:14:09.087811 | < Content-Type: application/json
2017-04-21 17:14:09.087931 |
2017-04-21 17:14:09.088123 | [
2017-04-21 17:14:09.088533 | {
2017-04-21 17:14:09.091966 | "created_by_user_id": "admin",
2017-04-21 17:14:09.094340 | "creator": "admin",
2017-04-21 17:14:09.096846 | "resource_id": null,
2017-04-21 17:14:09.098885 | "name": "vcpus",
2017-04-21 17:14:09.102184 | "created_by_project_id": "",
2017-04-21 17:14:09.104037 | "unit": null,
2017-04-21 17:14:09.115264 | "id": "e9f190a7-2f71-42a0-8021-74e680745aa4",
2017-04-21 17:14:09.121779 | "archive_policy": {
2017-04-21 17:14:09.129021 | "name": "gabbilive",
2017-04-21 17:14:09.134894 | "definition": [
2017-04-21 17:14:09.137582 | {
2017-04-21 17:14:09.147791 | "timespan": "0:01:00",
2017-04-21 17:14:09.150210 | "points": 60,
2017-04-21 17:14:09.153689 | "granularity": "0:00:01"
2017-04-21 17:14:09.154780 | },
2017-04-21 17:14:09.155769 | {
2017-04-21 17:14:09.159013 | "timespan": "0:01:00",
2017-04-21 17:14:09.161415 | "points": 30,
2017-04-21 17:14:09.164898 | "granularity": "0:00:02"
2017-04-21 17:14:09.165949 | },
2017-04-21 17:14:09.166924 | {
2017-04-21 17:14:09.175654 | "timespan": "0:05:00",
2017-04-21 17:14:09.178018 | "points": 5,
2017-04-21 17:14:09.181525 | "granularity": "0:01:00"
2017-04-21 17:14:09.182506 | }
2017-04-21 17:14:09.183385 | ],
2017-04-21 17:14:09.186458 | "aggregation_methods": [
2017-04-21 17:14:09.188172 | "max",
2017-0...

Read more...

Revision history for this message
yatin (yatinkarel) wrote :
Revision history for this message
gordon chung (chungg) wrote :

do you use pifpaf as well? if not, you can try setting a coordination url that isn't sql or increase the number of sql connections.

we need a new pifpaf for upstream gate it seems.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to gnocchi (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/460568

Revision history for this message
yatin (yatinkarel) wrote :

@chungg Thanks for the response.

We are running tests using tempest, i think pifpaf is not used.

I tried with setting max_pool_size=0 in gnocchi.conf, still faced the same error. Is there some other config parameter has to be changed or is there any other way to fix this.

May be following config gives some hint, if any other info is needed please let me know.

[indexer]
url = mysql+pymysql://gnocchi:gnocchi@127.0.0.1/gnocchi?charset=utf8

[storage]
driver = ceph
...
coordination_url = None

Changed in gnocchi:
assignee: nobody → gordon chung (chungg)
status: Triaged → In Progress
Revision history for this message
yatin (yatinkarel) wrote :

After completion of tests gnocchi db has following data:-

MariaDB [gnocchi]> select * from archive_policy;
+-----------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------+
| name | back_window | definition | aggregation_methods |
+-----------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------+
| bool | 3600 | [{"points": 31536000, "granularity": 1.0, "timespan": 31536000.0}] | ["last"] |
| gabbilive | 0 | [{"points": 60, "granularity": 1.0, "timespan": 60.0}, {"points": 30, "granularity": 2.0, "timespan": 60.0}, {"points": 5, "granularity": 60.0, "timespan": 300.0}] | ["max", "mean", "min"] |
| high | 0 | [{"points": 3600, "granularity": 1.0, "timespan": 3600.0}, {"points": 10080, "granularity": 60.0, "timespan": 604800.0}, {"points": 8760, "granularity": 3600.0, "timespan": 31536000.0}] | ["std", "count", "min", "max", "sum", "mean"] |
| low | 0 | [{"points": 8640, "granularity": 300.0, "timespan": 2592000.0}] | ["std", "count", "min", "max", "sum", "mean"] |
| medium | 0 | [{"points": 10080, "granularity": 60.0, "timespan": 604800.0}, {"points": 8760, "granularity": 3600.0, "timespan": 31536000.0}] | ["std", "count", "min", "max", "sum", "mean"] |
+-----------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------+
5 rows in set (0.00 sec)

MariaDB [gnocchi]> select * from archive_policy_rule;
+---------+---------------------+----------------+
| name | archive_policy_name | metric_pattern |
+---------+---------------------+----------------+
| default | low | * |
+---------+---------------------+----------------+

and gnocchi pool: (rados ls -p gnocchi) has one object: measure

Revision history for this message
yatin (yatinkarel) wrote :

I tried https://review.openstack.org/460568 and this time tests(est_gabbi_live_live_delete_single_archive_policy_cleanup.test_request) which failed earlier, passed.
Some tests(like creation of gabbilive archive policy) failed but that were due to i ran tests on the same environment.

Revision history for this message
gordon chung (chungg) wrote :

you'll need to check if there is a metric with status='deleted' but never expunged, that's what is blocking upstream gate i believe

i'm not sure what max_pool_size param is for? it seems that the issue we see in gate is that gnocchi is creating too many connections to coordinator. this is because in upstream and it seems downstream too, we dont set coordinator which ends up using mysql/postgresql as the coordination driver. both sql backends i believe set default connections to 100. this should be enough, but tooz locking in mysql/pgsql saves a connection per lock so the more locks that are created, the more connections are opened.

we have a patch to address this in tooz[1] but it's better to not rely on sql coordination i believe.

[1] https://review.openstack.org/#/c/460229/

Revision history for this message
yatin (yatinkarel) wrote :

After the test complete i checked tha database and metrices table was empty.
In our case storage driver is ceph. BTW after https://review.openstack.org/460568 RDO-CI passed.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to gnocchi (master)

Reviewed: https://review.openstack.org/460568
Committed: https://git.openstack.org/cgit/openstack/gnocchi/commit/?id=2a6c20794ad728dc6dbb985ff7780cf36b50a359
Submitter: Jenkins
Branch: master

commit 2a6c20794ad728dc6dbb985ff7780cf36b50a359
Author: Mehdi Abaakouk <email address hidden>
Date: Thu Apr 27 15:43:33 2017 +0200

    tests: more live verification

    Sometime live tests fail for an yet unknown reason.

    This change adds tests to help the debugging of this failure.

    Related-bug: #1681955
    Change-Id: Ifadf481794943b3283211dd4e276df503f73f9b0

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to gnocchi (master)

Reviewed: https://review.openstack.org/460634
Committed: https://git.openstack.org/cgit/openstack/gnocchi/commit/?id=065c0eaf4fd364fbd888e9c043a046937531daa5
Submitter: Jenkins
Branch: master

commit 065c0eaf4fd364fbd888e9c043a046937531daa5
Author: gord chung <email address hidden>
Date: Thu Apr 27 15:14:07 2017 +0000

    use redis as coordinator

    by default we use sql as coordinator but gnocchi has high locking
    requirements becuase of number of sacks which far exceeds default
    100 connection limit of sql. switch to use redis as lock manager
    instead.

    Closes-Bug: #1681955

    Change-Id: I9f7606c17535432743c3bdd0cda0e4200f03d328

Changed in gnocchi:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.