After upgrading form Pike to Queens, we see a lot of these error messages in ceilometer-agent-notification.log:
2018-11-27 06:33:42.975 146962 ERROR ceilometer.publisher.gnocchi [-] Request to http://internal.api.url:8041/v1/batch/resources/metrics/measures timed out: ConnectionTimeout: Request to http://internal.api.url:8041/v1/batch/resources/metrics/measures timed out
Ceilometer version is 10.0.2.dev2
Gnocchi client version is 7.0.1
Gnocchi version is 4.2.0
After adding some debug logging we could verify it only happens when ceilomemter tries to send 0 measures to Gnocchi. Every request sending at least 1 measure workes.
We could get around the Problem by setting batch_timeout in ceilometer.conf to None.
We were also able to reproduce the behavior with the openstack cli:
echo "{}" | openstack --timing --debug metric measures batch-resources-metrics --create-metrics -
...
REQ: curl -g -i -X POST "http://internal.api.url:8041/v1/batch/resources/metrics/measures?create_metrics=True" -H "User-Agent: osc-lib/1.9.0 keystoneauth1/3.4.0 python-requests/2.18.4 CPython/2.7.12" -H "Content-Type: application/json" -H "Accept: application/json, */*" -H "X-Auth-Token: {SHA1}***" -d '{}'
Starting new HTTP connection (1): internal.api.url
http://internal.api.url:8041 "POST /v1/batch/resources/metrics/measures?create_metrics=True HTTP/1.1" 202 0
RESP: [202] Date: Tue, 27 Nov 2018 05:33:15 GMT Server: Apache Content-Length: 0 Content-Type: application/json
clean_up CliResourcesMetricsMeasuresBatch:
run(Namespace(columns=[], fit_width=False, format_config=None, formatter='table', max_width=0, noindent=False, print_empty=False, quote_mode='nonnumeric', sort_columns=[]))
+---------------------------------------------------------------------------------------------------------+-----------+
| URL | Seconds |
+---------------------------------------------------------------------------------------------------------+-----------+
| GET https://internal.api.url:5000/v3 | 0.026579 |
| POST https://internal.api.url:5000/v3/auth/tokens | 0.410096 |
| POST https://internal.api.url:5000/v3/auth/tokens | 0.396178 |
| POST http://internal.api.url:8041/v1/batch/resources/metrics/measures?create_metrics=True | 31.054647 |
| Total | 31.8875 |
+---------------------------------------------------------------------------------------------------------+-----------+
END return value: 0
The requests takes longer as the default request_timeout in ceilometer of 6.05s.
The request return much faster sending some data:
REQ: curl -g -i -X POST "http://internal.api.url:8041/v1/batch/resources/metrics/measures?create_metrics=True" -H "User-Agent: osc-lib/1.9.0 keystoneauth1/3.4.0 python-requests/2.18.4 CPython/2.7.12" -H "Content-Type: application/json" -H "Accept: application/json, */*" -H "X-Auth-Token: {SHA1}904608e0fe75460452a748c3ecd9d8d113ccad52" -d '{"8dbaa111-c9ae-4dad-9f2d-388418462207":{"memory.usage":[{"timestamp":"2018-11-27T06:23:25.713279","value":3731}]},"791c6968-95a7-4cf9-b3f0-0664f4f9ed79":{"volume.size":[{"timestamp":"2018-11-27T06:23:25.654068","value":9}]},"eddaaf67-b5a0-4851-8024-0319ddac72bf":{"memory.usage":[{"timestamp":"2018-11-27T06:23:25.713279","value":3254}]},"4ad7eb50-4a34-4e65-9645-9ea51064bafb":{"memory.usage":[{"timestamp":"2018-11-27T06:23:25.713279","value":945}]},"670ab511-4351-4171-920f-c27184048368":{"memory.resident":[{"timestamp":"2018-11-27T06:23:25.778222","value":2259}]},"ef00c155-8f03-4e68-9137-7eaf731bd216":{"volume.size":[{"timestamp":"2018-11-27T06:23:25.654068","value":20}]},"940305f6-4637-4553-b8fc-12f773154333":{"memory.usage":[{"timestamp":"2018-11-27T06:23:25.616891","value":11067}]},"2b502727-eb0f-4489-ab3b-de9a69861c1b":{"image.size":[{"timestamp":"2018-11-27T06:23:25.743257","value":2361393152}]}}'
Starting new HTTP connection (1): internal.api.url
http://internal.api.url:8041 "POST /v1/batch/resources/metrics/measures?create_metrics=True HTTP/1.1" 202 0
RESP: [202] Date: Tue, 27 Nov 2018 06:26:25 GMT Server: Apache Content-Length: 0 Content-Type: application/json
clean_up CliResourcesMetricsMeasuresBatch:
run(Namespace(columns=[], fit_width=False, format_config=None, formatter='table', max_width=0, noindent=False, print_empty=False, quote_mode='nonnumeric', sort_columns=[]))
+---------------------------------------------------------------------------------------------------------+----------+
| URL | Seconds |
+---------------------------------------------------------------------------------------------------------+----------+
| GET https://internal.api.url:5000/v3 | 0.019071 |
| POST https://internal.api.url:5000/v3/auth/tokens | 0.410266 |
| POST https://internal.api.url:5000/v3/auth/tokens | 0.451102 |
| POST http://internal.api.url:8041/v1/batch/resources/metrics/measures?create_metrics=True | 1.131044 |
| Total | 2.011483 |
+---------------------------------------------------------------------------------------------------------+----------+
END return value: 0
An update to the above description:
I was quite sure batch_timeout = None worked in my test, but I couldn't reproduce it.
Finally I had to set request_timeout = 45.0, but this option is marked as deprecated.
I also tried setting batch_timeout = 1800, but it does not seem to work. ceilometer. publisher. gnocchi is called more often than the configured timeout.