Ceilometer API service stuck - Address already in use

Bug #1678142 reported by Dmitry Goloshubov
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Won't Fix
Low
MOS Maintenance
9.x
Fix Released
High
Ilya Tyaptin

Bug Description

MOS 9.2

During the tests, the communication with the ceilometer-api on one of the controllers suddenly fails.

From the ceilometer-api.log:
2017-03-16T15:18:10.265429+09:00 cic-1 ceilometer-api[2362]: 2017-03-16 15:18:10.265 2362 INFO ceilometer.api.app [-] serving on http://192.168.6.60:8777
2017-03-16T15:18:10.265857+09:00 cic-1 ceilometer-api[2362]: 2017-03-16 15:18:10.265 2362 INFO werkzeug [-] * Running on http://192.168.6.60:8777/
2017-03-16T15:18:10.322773+09:00 cic-1 ceilometer-api[2362]: 2017-03-16 15:18:10.266 2362 CRITICAL ceilometer [-] error: [Errno 98] Address already in use

Workaround: restart the service.

Haven't found a way to reproduce that.

----------------------
Additional information:

To probe the status of the Ceilometer services we manually created a Curl request to address locally to every ceilometer instance on the CICs.

[From any of the CIC]
curl -g -i -X 'GET' 'http://<CIC_IP_on_mmt>:8777/' -H 'User-Agent: ceilometerclient.openstack.common.apiclient' -H 'X-Auth-Token: ...'

Result: request goes through all the CICs (Result: 200 OK) except CIC-1 where it gets a time out.

>>>>> ceilomter-api service status <<<<<<
We checked the status of the service with the following results

>>>>> strace

root@cic-1:~# strace -tt -T -p 5596
Process 5596 attached
19:48:39.985419 wait4(0, <- the ceilomter-api master got stuck here

>>>>> GDB
>> CIC-1 !Faulty!
(gdb) bt
#0 0x00007f2feb0dced9 in waitpid () from /lib/x86_64-linux-gnu/libpthread.so.0 <- Stuck in wait
#1 0x000000000041d95a in ?? ()
#2 0x000000000049968d in PyEval_EvalFrameEx ()
#3 0x0000000000499ef2 in PyEval_EvalFrameEx ()
#4 0x0000000000499ef2 in PyEval_EvalFrameEx ()
#5 0x0000000000499ef2 in PyEval_EvalFrameEx ()
#6 0x00000000004a1c9a in ?? ()
#7 0x00000000004dfe94 in ?? ()
#8 0x0000000000499be5 in PyEval_EvalFrameEx ()
#9 0x00000000004a090c in PyEval_EvalCodeEx ()
#10 0x000000000049ab45 in PyEval_EvalFrameEx ()
#11 0x00000000004a090c in PyEval_EvalCodeEx ()
#12 0x000000000049ab45 in PyEval_EvalFrameEx ()
#13 0x00000000004a090c in PyEval_EvalCodeEx ()
#14 0x0000000000499a52 in PyEval_EvalFrameEx ()
#15 0x0000000000499ef2 in PyEval_EvalFrameEx ()
#16 0x0000000000499ef2 in PyEval_EvalFrameEx ()
#17 0x00000000004a1634 in ?? ()
#18 0x000000000044e4a5 in PyRun_FileExFlags ()
#19 0x000000000044ec9f in PyRun_SimpleFileExFlags ()
#20 0x000000000044f904 in Py_Main ()
#21 0x00007f2fead29f45 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#22 0x0000000000578c4e in _start ()

>> CIC-2 !working fine!
(gdb) bt
#0 0x00007f7d3376dc53 in select () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x000000000047fbbd in ?? ()
#2 0x000000000049c4d9 in PyEval_EvalFrameEx ()
#3 0x00000000004a090c in PyEval_EvalCodeEx ()
#4 0x000000000049ab45 in PyEval_EvalFrameEx ()
#5 0x00000000004a1c9a in ?? ()
#6 0x00000000004dfe94 in ?? ()
#7 0x0000000000499be5 in PyEval_EvalFrameEx ()
#8 0x00000000004a090c in PyEval_EvalCodeEx ()
#9 0x000000000049ab45 in PyEval_EvalFrameEx ()
#10 0x00000000004a090c in PyEval_EvalCodeEx ()
#11 0x000000000049ab45 in PyEval_EvalFrameEx ()
#12 0x00000000004a090c in PyEval_EvalCodeEx ()
#13 0x0000000000499a52 in PyEval_EvalFrameEx ()
#14 0x0000000000499ef2 in PyEval_EvalFrameEx ()
#15 0x0000000000499ef2 in PyEval_EvalFrameEx ()
#16 0x00000000004a1634 in ?? ()
#17 0x000000000044e4a5 in PyRun_FileExFlags ()
#18 0x000000000044ec9f in PyRun_SimpleFileExFlags ()
#19 0x000000000044f904 in Py_Main ()
#20 0x00007f7d3369df45 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#21 0x0000000000578c4e in _start ()

>>>>> /proc/<ceilometer-api>/stack
>> CIC-1 !Faulty!
[<ffffffff8107fd11>] do_wait+0x1c1/0x230 <- Stuck in wait
[<ffffffff81080da4>] SyS_wait4+0x64/0xc0
[<ffffffff817fa4f6>] entry_SYSCALL_64_fastpath+0x16/0x75
[<ffffffffffffffff>] 0xffffffffffffffff

>> CIC-2 !Working fine!
[<ffffffff81211799>] poll_schedule_timeout+0x49/0x70
[<ffffffff8121212c>] do_select+0x58c/0x750
[<ffffffff812124bc>] core_sys_select+0x1cc/0x2d0
[<ffffffff8121266b>] SyS_select+0xab/0xf0
[<ffffffff817fa4f6>] entry_SYSCALL_64_fastpath+0x16/0x75
[<ffffffffffffffff>] 0xffffffffffffffff

The root cause seems to be related to a deadlock in the ceilometer-api

----------------------

Also, there is a bug that could be potentially related:
https://bugs.launchpad.net/mos/8.0.x/+bug/1566202

Changed in mos:
importance: Undecided → High
description: updated
summary: - Ceilometer API service doesn't start - Address already in use
+ Ceilometer API service can't start - Address already in use
summary: - Ceilometer API service can't start - Address already in use
+ Ceilometer API service stuck - Address already in use
Revision history for this message
Alexander Rubtsov (arubtsov) wrote :

sla1 for 9.0-updates

Changed in mos:
assignee: nobody → MOS Ceilometer (mos-ceilometer)
status: New → Confirmed
milestone: none → 10.0
tags: added: area-ceilometer
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Dmitry,

Blocking on wait4() is a normal behavior for a "master" process. The idea is the following: a master starts, creates a WSGI application instance, *binds a TCP socket*, forks N times and then watches the children and restarts them if needed (e.g. if one or more die) and also propagates signals like SIGHUP.

It's not clear for me how exactly you get "Address already in use" - but this is most likely a problem with the master process. Maybe something is wrong with the init script and the previous instance of the service is not stopped properly?

Changed in mos:
assignee: MOS Ceilometer (mos-ceilometer) → Ilya Tyaptin (ityaptin)
Revision history for this message
Ilya Tyaptin (ityaptin) wrote :

Change request https://review.fuel-infra.org/#/c/32958 fixes this bug.

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/ceilometer (9.0/mitaka)

Reviewed: https://review.fuel-infra.org/32958
Submitter: Pkgs Jenkins <email address hidden>
Branch: 9.0/mitaka

Commit: 8180104fdd8afe1fe9456e5140a0cf5bc182a237
Author: Ilya Tyaptin <email address hidden>
Date: Wed Apr 12 12:53:40 2017

Add default socket timeout for werkzeug handlers

It needs, because under the load ceilometer-api can keep
the empty socked indefinitely.

Closes-bug: #1678142
Change-Id: If83d4932d02ab7a85849e042d80514bca1bb03b3

Revision history for this message
TatyanaGladysheva (tgladysheva) wrote :

Verified on 9.2 mu2 updates.

Changed in mos:
assignee: Ilya Tyaptin (ityaptin) → MOS Maintenance (mos-maintenance)
Revision history for this message
Alexey Stupnikov (astupnikov) wrote :

Setting importance to Low and Closing as Won't Fix for MOS10.0 as we no longer support MOS10.

Changed in mos:
importance: High → Low
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.