gnocchi-metricd uses all memory and get killed by OOM killer

Bug #1533793 reported by Nicolas Vila
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Gnocchi
Fix Released
Critical
Julien Danjou
1.3
Fix Released
Critical
Julien Danjou

Bug Description

When starting gnocchi-metricd, it begins using an ever increasing amount of memory up until it segfaults and exits, leaving behind defunct PIDs:

(...)
[60290.151162] gnocchi-metricd[15021]: segfault at 24 ip 0000000000558077 sp 00007fff97572b10 error 6 in python2.7[400000+2bc000]
[60860.395947] ntpd invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[60860.395956] ntpd cpuset=/ mems_allowed=0
[60860.395969] CPU: 0 PID: 1761 Comm: ntpd Tainted: G D 3.13.0-46-generic #75-Ubuntu
[60860.395983] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.7.5-20150310_111955-batsu 04/01/2014
[60860.395986] 0000000000000000 ffff880427b7f968 ffffffff817212c6 ffff88042793b000
[60860.395992] ffff880427b7f9f0 ffffffff8171bb81 0000000000000000 0000000000000000
[60860.395994] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[60860.395996] Call Trace:
[60860.396062] [<ffffffff817212c6>] dump_stack+0x45/0x56
[60860.396074] [<ffffffff8171bb81>] dump_header+0x7f/0x1f1
[60860.396091] [<ffffffff8115299e>] oom_kill_process+0x1ce/0x330
[60860.396113] [<ffffffff812d70e5>] ? security_capable_noaudit+0x15/0x20
[60860.396117] [<ffffffff811530d4>] out_of_memory+0x414/0x450
[60860.396127] [<ffffffff81159440>] __alloc_pages_nodemask+0xa60/0xb80
[60860.396141] [<ffffffff811979e3>] alloc_pages_current+0xa3/0x160
[60860.396148] [<ffffffff8114f557>] __page_cache_alloc+0x97/0xc0
[60860.396151] [<ffffffff81150f65>] filemap_fault+0x185/0x410
[60860.396158] [<ffffffff81175ddf>] __do_fault+0x6f/0x530
[60860.396173] [<ffffffff8108eb82>] ? __hrtimer_start_range_ns+0x1a2/0x3a0
[60860.396176] [<ffffffff81179f82>] handle_mm_fault+0x482/0xf00
[60860.396184] [<ffffffff8107764b>] ? recalc_sigpending+0x1b/0x50
[60860.396190] [<ffffffff8172d2d4>] __do_page_fault+0x184/0x560
[60860.396194] [<ffffffff8107a9ac>] ? signal_delivered+0x5c/0x80
[60860.396197] [<ffffffff8107aa0c>] ? signal_setup_done+0x3c/0x60
[60860.396212] [<ffffffff810135c5>] ? do_signal+0x1b5/0xa40
[60860.396215] [<ffffffff8107764b>] ? recalc_sigpending+0x1b/0x50
[60860.396219] [<ffffffff81077f52>] ? __set_task_blocked+0x32/0x70
[60860.396222] [<ffffffff8172d6ca>] do_page_fault+0x1a/0x70
[60860.396225] [<ffffffff8172cd49>] do_async_page_fault+0x29/0xe0
[60860.396228] [<ffffffff81729b58>] async_page_fault+0x28/0x30
[60860.396231] Mem-Info:
[60860.396233] Node 0 DMA per-cpu:
[60860.396236] CPU 0: hi: 0, btch: 1 usd: 0
[60860.396238] CPU 1: hi: 0, btch: 1 usd: 0
[60860.396239] CPU 2: hi: 0, btch: 1 usd: 0
[60860.396241] CPU 3: hi: 0, btch: 1 usd: 0
[60860.396242] Node 0 DMA32 per-cpu:
[60860.396244] CPU 0: hi: 186, btch: 31 usd: 0
[60860.396246] CPU 1: hi: 186, btch: 31 usd: 11
[60860.396247] CPU 2: hi: 186, btch: 31 usd: 30
[60860.396249] CPU 3: hi: 186, btch: 31 usd: 0
[60860.396250] Node 0 Normal per-cpu:
[60860.396252] CPU 0: hi: 186, btch: 31 usd: 0
[60860.396253] CPU 1: hi: 186, btch: 31 usd: 79
[60860.396255] CPU 2: hi: 186, btch: 31 usd: 0
[60860.396256] CPU 3: hi: 186, btch: 31 usd: 0
[60860.396261] active_anon:3996921 inactive_anon:134 isolated_anon:0
[60860.396261] active_file:165 inactive_file:175 isolated_file:0
[60860.396261] unevictable:0 dirty:52 writeback:0 unstable:0
[60860.396261] free:33844 slab_reclaimable:3272 slab_unreclaimable:4927
[60860.396261] mapped:206 shmem:160 pagetables:9651 bounce:0
[60860.396261] free_cma:0
[60860.396265] Node 0 DMA free:15908kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[60860.396272] lowmem_reserve[]: 0 2847 15902 15902
[60860.396275] Node 0 DMA32 free:64108kB min:12088kB low:15108kB high:18132kB active_anon:2841656kB inactive_anon:92kB active_file:188kB inactive_file:204kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3129212kB managed:2919140kB mlocked:0kB dirty:4kB writeback:0kB mapped:188kB shmem:116kB slab_reclaimable:1340kB slab_unreclaimable:2688kB kernel_stack:736kB pagetables:6476kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:688 all_unreclaimable? yes
[60860.396281] lowmem_reserve[]: 0 0 13054 13054
[60860.396284] Node 0 Normal free:55360kB min:55424kB low:69280kB high:83136kB active_anon:13146028kB inactive_anon:444kB active_file:472kB inactive_file:496kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:13631488kB managed:13368116kB mlocked:0kB dirty:204kB writeback:0kB mapped:636kB shmem:524kB slab_reclaimable:11748kB slab_unreclaimable:17020kB kernel_stack:2752kB pagetables:32128kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:1585 all_unreclaimable? yes
[60860.396306] lowmem_reserve[]: 0 0 0 0
[60860.396309] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (R) 3*4096kB (M) = 15908kB
[60860.396322] Node 0 DMA32: 91*4kB (UE) 94*8kB (UEM) 65*16kB (UEM) 96*32kB (UEM) 70*64kB (UEM) 50*128kB (UEM) 32*256kB (U) 30*512kB (UE) 24*1024kB (UM) 0*2048kB 0*4096kB = 64236kB
[60860.396335] Node 0 Normal: 294*4kB (UEM) 183*8kB (UEM) 203*16kB (UE) 168*32kB (UEM) 93*64kB (UEM) 40*128kB (UE) 53*256kB (UEM) 35*512kB (UE) 2*1024kB (UM) 0*2048kB 0*4096kB = 55872kB
[60860.396347] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[60860.396349] 558 total pagecache pages
[60860.396350] 0 pages in swap cache
[60860.396351] Swap cache stats: add 0, delete 0, find 0/0
[60860.396352] Free swap = 0kB
[60860.396353] Total swap = 0kB
[60860.396354] 4194173 pages RAM
[60860.396355] 0 pages HighMem/MovableOnly
[60860.396355] 65843 pages reserved
[60860.396356] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[60860.396363] [ 386] 0 386 4868 69 13 0 0 upstart-udev-br
[60860.396365] [ 402] 0 402 12443 197 27 0 -1000 systemd-udevd
[60860.396368] [ 543] 0 543 3814 57 12 0 0 upstart-socket-
[60860.396372] [ 583] 0 583 5855 74 17 0 0 rpcbind
[60860.396374] [ 612] 108 612 5385 123 16 0 0 rpc.statd
[60860.396376] [ 645] 0 645 2555 575 7 0 0 dhclient
[60860.396378] [ 895] 102 895 9804 94 23 0 0 dbus-daemon
[60860.396380] [ 911] 0 911 3818 64 12 0 0 upstart-file-br
[60860.396382] [ 942] 0 942 10862 96 26 0 0 systemd-logind
[60860.396384] [ 1041] 0 1041 3634 47 12 0 0 getty
[60860.396385] [ 1043] 0 1043 3634 45 12 0 0 getty
[60860.396387] [ 1045] 101 1045 65018 437 30 0 0 rsyslogd
[60860.396389] [ 1046] 0 1046 6926 58 18 0 0 rpc.idmapd
[60860.396391] [ 1051] 0 1051 3634 48 11 0 0 getty
[60860.396392] [ 1053] 0 1053 3634 48 12 0 0 getty
[60860.396394] [ 1055] 0 1055 3634 47 12 0 0 getty
[60860.396396] [ 1086] 0 1086 15342 178 33 0 -1000 sshd
[60860.396398] [ 1094] 0 1094 5913 62 18 0 0 cron
[60860.396399] [ 1095] 0 1095 4784 43 13 0 0 atd
[60860.396401] [ 1109] 0 1109 1091 42 8 0 0 acpid
[60860.396403] [ 1184] 0 1184 4796 65 15 0 0 irqbalance
[60860.396404] [ 1436] 109 1436 243611 1008 64 0 0 icinga2
[60860.396406] [ 1485] 107 1485 11417 694 26 0 0 snmpd
[60860.396408] [ 1569] 0 1569 3634 46 12 0 0 getty
[60860.396409] [ 1585] 0 1585 3196 46 12 0 0 getty
[60860.396411] [ 1761] 106 1761 6804 138 18 0 0 ntpd
[60860.396413] [13118] 0 13118 26408 251 55 0 0 sshd
[60860.396415] [21356] 1000 21356 26408 249 52 0 0 sshd
[60860.396424] [22434] 1000 22434 5390 562 15 0 0 bash
[60860.396426] [ 5471] 0 5471 17492 126 38 0 0 sudo
[60860.396428] [ 6430] 0 6430 16330 126 37 0 0 su
[60860.396429] [ 6467] 0 6467 5409 593 15 0 0 bash
[60860.396432] [26130] 0 26130 22118 438 46 0 0 apache2
[60860.396434] [27421] 1002 27421 1037105 373121 1000 0 0 apache2
[60860.396436] [27423] 1002 27423 1100679 405184 1052 0 0 apache2
[60860.396437] [27425] 33 27425 112125 2102 79 0 0 apache2
[60860.396439] [27426] 33 27426 111911 1736 77 0 0 apache2
[60860.396441] [14799] 0 14799 43879 11714 90 0 0 gnocchi-metricd
[60860.396442] [15022] 0 15022 3408476 3167088 6308 0 0 gnocchi-metricd
[60860.396445] [26084] 1002 26084 705711 33719 301 0 0 apache2
[60860.396446] Out of memory: Kill process 15022 (gnocchi-metricd) score 755 or sacrifice child
[60860.401764] Killed process 15022 (gnocchi-metricd) total-vm:13633904kB, anon-rss:12668352kB, file-rss:0kB
[60863.305984] init: gnocchi-metricd main process ended, respawning

This issue happens with 1, 2 and 3 workers in an instance with 16GB of RAM. I'm using CEPH for storage, please let me know if I can add any more information (I'm uploading gnocchi.conf).

Thanks, regards.

Revision history for this message
Nicolas Vila (nvlan) wrote :
Revision history for this message
Julien Danjou (jdanjou) wrote :

Which version of Gnocchi?

affects: ceilometer → gnocchi
Revision history for this message
Julien Danjou (jdanjou) wrote :

Please also run gnocchi-metricd with debug enabled and paste the log it outputs before dying.

Changed in gnocchi:
status: New → Incomplete
Revision history for this message
Nicolas Vila (nvlan) wrote :

I've installed from Git repository, branch master (cloned on Dec 16 2015). I'm attaching gnocchi-metricd.log.

Thanks!

Revision history for this message
Nicolas Vila (nvlan) wrote :
Revision history for this message
Julien Danjou (jdanjou) wrote :

It looks like you have millions of metrics in your database. If that the case? can you check in the metric table of the gnocchi database how many records you got? (try "SELECT COUNT(*) FROM metric")

Revision history for this message
Nicolas Vila (nvlan) wrote :

I believe it's not the case:

mysql> select count(*) from metric;
+----------+
| count(*) |
+----------+
| 4854 |
+----------+
1 row in set (0.00 sec)

I did at one point have over 50.000 messages in the metering.sample queue that weren't being processed in time but that's no longer the case.
Gnocchi-metricd starts using little memory, but as time passes it ramps up:

root@gnocchi-api-2:/var/lib/gnocchi# free -m
             total used free shared buffers cached
Mem: 16049 465 15583 203 15 280
-/+ buffers/cache: 169 15879
Swap: 0 0 0
root@gnocchi-api-2:/var/lib/gnocchi# service gnocchi-metricd start
gnocchi-metricd start/running, process 2536
root@gnocchi-api-2:/var/lib/gnocchi# free -m
             total used free shared buffers cached
Mem: 16049 576 15472 203 15 280
-/+ buffers/cache: 280 15768
Swap: 0 0 0

Revision history for this message
Nicolas Vila (nvlan) wrote :

I do have many objects in the gnocchi ceph pool (compared to the pools we use for images, volumes, etc):
root@gnocchi-api-2:~# rados df

pool name KB objects clones degraded unfound rd rd KB wr wr KB
gnocchi 663410 4515911 0 0 0 61512026 745333319 40978835 219043438
nubeliu_backups 0 1 0 0 0 103 79 633 843779
nubeliu_images 223115799 27449 0 0 0 65848 268587731 68336 255736063
nubeliu_vms 0 0 0 0 0 0 0 0 0
nubeliu_volumes 45406537 21598 0 0 0 3818172 28632843 15970713 269035507
nubeliutest 52484609 12828 0 0 0 445 574 207451 52483585
  total used 737807500 4577787
  total avail 15066454868
  total space 15804262368

Revision history for this message
Julien Danjou (jdanjou) wrote :

Ok, let's assume it's a MySQL memory error, because well, it's MySQL after all.
Remember kid, we do recommend PostgreSQL.

You're using the MySQLdb driver – let's try with pymysql. Replace mysql:// in your connection string by mysql+pymysql:// (and install pymysql in necessary). Then launch metricd and let me know if you still the issue.

Revision history for this message
Nicolas Vila (nvlan) wrote :
Download full text (10.1 KiB)

Hello Julien,

I applied the mysql+pymysql change you suggested but with the same result, gnocchi-metricd was running for almos 30 minutes before segfaulting again. The gnocchi-metricd.log file shows:

(...)
2016-01-14 15:24:47.765 31023 DEBUG gnocchi.service [-] ******************************************************************************** log_opt_values /usr/local/lib/python2.7/dist-packages/oslo_config/cfg.py:2343
2016-01-14 15:25:02.653 2114 DEBUG oslo_db.sqlalchemy.engines [-] MySQL server mode set to STRICT_TRANS_TABLES,STRICT_ALL_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,TRADITIONAL,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION _check_effective_sql_mode /usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/engines.py:256
2016-01-14 15:25:02.654 2115 DEBUG oslo_db.sqlalchemy.engines [-] MySQL server mode set to STRICT_TRANS_TABLES,STRICT_ALL_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,TRADITIONAL,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION _check_effective_sql_mode /usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/engines.py:256
2016-01-14 15:25:02.658 2114 DEBUG gnocchi.storage [-] Processing new and to delete measures process_background_tasks /usr/local/lib/python2.7/dist-packages/gnocchi/storage/__init__.py:171
2016-01-14 15:25:02.658 2115 DEBUG gnocchi.storage [-] Processing new and to delete measures process_background_tasks /usr/local/lib/python2.7/dist-packages/gnocchi/storage/__init__.py:171
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage [-] Unexpected error during measures processing
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage Traceback (most recent call last):
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage File "/usr/local/lib/python2.7/dist-packages/gnocchi/storage/__init__.py", line 173, in process_background_tasks
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage self.process_measures(index, sync)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage File "/usr/local/lib/python2.7/dist-packages/gnocchi/storage/_carbonara.py", line 159, in process_measures
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage metrics = indexer.get_metrics(metrics_to_process)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage File "/usr/local/lib/python2.7/dist-packages/gnocchi/indexer/sqlalchemy.py", line 148, in get_metrics
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage metrics = list(query.all())
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2584, in all
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage return list(self)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2732, in __iter__
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage return self._execute_and_instances(context)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2747, in _execute_and_instances
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage result = conn.execute(querycontext.statement, self._params)
2016-01-14 15:52:10.018 2115 ERROR gnocc...

Revision history for this message
Julien Danjou (jdanjou) wrote :

Ok, so the error is different, but the problem is the same. I'm trying to find what may causes it. I'll probably send a few patches for you to try in the next days.

Revision history for this message
Julien Danjou (jdanjou) wrote :

To reply to your question: no, this is a bug, one host with 16GB is _largely_ enough.

Revision history for this message
Nicolas Vila (nvlan) wrote :

I've just realized that the log extract I posted doesn't seem to be complete, would you like me to upload it as an attachment?

Thanks, regards.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to gnocchi (master)

Fix proposed to branch: master
Review: https://review.openstack.org/267692

Changed in gnocchi:
assignee: nobody → Julien Danjou (jdanjou)
status: Incomplete → In Progress
Revision history for this message
Julien Danjou (jdanjou) wrote : Re: gnocchi-metricd uses all memory and segfaults

I tried to guess the problem, and I wrote https://review.openstack.org/267692

Can you give it a try and tell me if it fixes your issue?

Revision history for this message
Julien Danjou (jdanjou) wrote :

It is complete, there's a download full text at the top of your message :) I got it, don't worry! :)

Revision history for this message
Nicolas Vila (nvlan) wrote :

Hello Julien,

I just tried it out, but sadly with the same result. The trace is very similar to the one in comment #10, last 10 lines are:

(...)
2016-01-14 17:16:34.002 13837 ERROR gnocchi.storage return self._generate_generic_binary(binary, opstring, **kw)
2016-01-14 17:16:34.002 13837 ERROR gnocchi.storage File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/compiler.py", line 938, in _generate_generic_binary
2016-01-14 17:16:34.002 13837 ERROR gnocchi.storage binary.right._compiler_dispatch(self, **kw)
2016-01-14 17:16:34.002 13837 ERROR gnocchi.storage File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/visitors.py", line 93, in _compiler_dispatch
2016-01-14 17:16:34.002 13837 ERROR gnocchi.storage return meth(self, **kw)
2016-01-14 17:16:34.002 13837 ERROR gnocchi.storage File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/compiler.py", line 524, in visit_grouping
2016-01-14 17:16:34.002 13837 ERROR gnocchi.storage return "(" + grouping.element._compiler_dispatch(self, **kwargs) + ")"
2016-01-14 17:16:34.002 13837 ERROR gnocchi.storage MemoryError
2016-01-14 17:16:34.002 13837 ERROR gnocchi.storage
2016-01-14 17:16:34.831 13837 DEBUG gnocchi.storage [-] Expunging deleted metrics process_background_tasks /usr/local/lib/python2.7/dist-packages/gnocchi/storage/__init__.py:179

Thanks, regards.

Revision history for this message
Nicolas Vila (nvlan) wrote :

Hello Julien,

After more testing, it turns out the patch *does* works. Gnocchi-metricd segfaulted when the API received a /v1/status call. As long as no such request is sent, gnocchi-metricd has been working fine for the last couple of hours.

Thanks, regards.

Revision history for this message
Julien Danjou (jdanjou) wrote :

Cool, Nicolas, that confirms my first intuition. It's normal that the API call breaks it, since I didn't patch that one.

My patch is not complete yet, it only covers your particular problem with metricd and Ceph, I'll complete it soon! Thanks for your help!

Revision history for this message
Julien Danjou (jdanjou) wrote :

Nicolas, I've sent a different and simpler version of the patch.
Can you stop metricd for a while so the Ceph pool fills with some new metrics, and then start metricd with this patch to see if it works without any problem?
The API should work fine with that new patch too.

Changed in gnocchi:
importance: Undecided → Critical
Revision history for this message
Nicolas Vila (nvlan) wrote :

Hello Julien,

So far, so good! The new patch has been working for nearly one hour, and memory usage is still within normal values (nearly 9GB out of 16GB). I have 5 days of metrics to process, so it's going to be working for a while. I've executed a "gnocchi status" from the client, waiting for its reply. I'll let you know how it goes.

Thanks, regards!

Revision history for this message
Julien Danjou (jdanjou) wrote :

Cool! Yeah seeing the number of measures you accumulated, I'm not surprised. If you have some statistics on the processing rate, etc, I'd be glad to hear – even if it's not related to this bug. ;)

Revision history for this message
Nicolas Vila (nvlan) wrote :

Hello Julien,

Unfortunately, the "gnocchi status" caused metricd to segfault. I'll stop everyone from executing that for now, and stop the ceilometer-agents to allow metricd to come closer to the current date (it's almost six days behind). Plus, it'll allow me to gather some statistics, and see if metricd can process a couple of days uninterrupted.

Many thanks, regards.

Revision history for this message
Julien Danjou (jdanjou) wrote :

Hm, "gnocchi" status has 0 interaction with metricd (in theory). Can you paste me your segfault trace? I imagine it's not a segfault but just the OOM killer kicking in?

Revision history for this message
Nicolas Vila (nvlan) wrote :

Hello Julien,

I've been doing several tests this weekend, and if I use more than one metricd worker (as per gnocchi.conf), patch set 1 and 3 eventually segfault. Now with only one worker, patch set 1 has been working non-stop for over 24 hours. I'll try now to configure more than one metricd in separate hosts with a redis server for coordination, using patch set 1 in one and patch set 3 in the other, and let you know the results.

Many thanks, kind regards.

Revision history for this message
Julien Danjou (jdanjou) wrote :

Hi Nicolas,

You use 'segfault', but I don't think it's really a "segmentation fault". I understand it crashes, but do you have a backtrace, a log, or anything? It should not crash with 3 workers. Is it the OOM killer killing a process? Do they use too much memory? Are they processing metrics?

I'd be interested in having more info to help improve the current situation!

Revision history for this message
Nicolas Vila (nvlan) wrote :
Download full text (10.2 KiB)

Hello Julien,

Without any patch, dmesg shows "segmentation fault". Currently, I have one node using patch set 1. It's been working without issues for over 24 hours with just one worker. I changed the configuration about 30 minutes ago, set workers = 2 and restarted. The gnocchi-metricd.log file shows that it is processing measures, but after a while gnocchi-metricd invoked oom-killer:

[500580.805296] gnocchi-metricd invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[500580.805307] gnocchi-metricd cpuset=/ mems_allowed=0
[500580.805318] CPU: 2 PID: 14524 Comm: gnocchi-metricd Tainted: G D 3.13.0-46-generic #75-Ubuntu
[500580.805321] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.7.5-20150310_111955-batsu 04/01/2014
[500580.805323] 0000000000000000 ffff8802aeab9a68 ffffffff817212c6 ffff8800a3b51800
[500580.805329] ffff8802aeab9af0 ffffffff8171bb81 0000000000000000 0000000000000000
[500580.805331] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[500580.805333] Call Trace:
[500580.805369] [<ffffffff817212c6>] dump_stack+0x45/0x56
[500580.805375] [<ffffffff8171bb81>] dump_header+0x7f/0x1f1
[500580.805388] [<ffffffff8115299e>] oom_kill_process+0x1ce/0x330
[500580.805403] [<ffffffff812d70e5>] ? security_capable_noaudit+0x15/0x20
[500580.805405] [<ffffffff811530d4>] out_of_memory+0x414/0x450
[500580.805410] [<ffffffff81159440>] __alloc_pages_nodemask+0xa60/0xb80
[500580.805422] [<ffffffff81199caa>] alloc_pages_vma+0x9a/0x140
[500580.805428] [<ffffffff8117a623>] handle_mm_fault+0xb23/0xf00
[500580.805433] [<ffffffff8172d2d4>] __do_page_fault+0x184/0x560
[500580.805436] [<ffffffff81182705>] ? change_protection+0x65/0xb0
[500580.805439] [<ffffffff811828a1>] ? mprotect_fixup+0x151/0x290
[500580.805441] [<ffffffff8172d6ca>] do_page_fault+0x1a/0x70
[500580.805444] [<ffffffff8172cd49>] do_async_page_fault+0x29/0xe0
[500580.805446] [<ffffffff81729b58>] async_page_fault+0x28/0x30
[500580.805448] Mem-Info:
[500580.805450] Node 0 DMA per-cpu:
[500580.805451] CPU 0: hi: 0, btch: 1 usd: 0
[500580.805453] CPU 1: hi: 0, btch: 1 usd: 0
[500580.805454] CPU 2: hi: 0, btch: 1 usd: 0
[500580.805455] CPU 3: hi: 0, btch: 1 usd: 0
[500580.805455] Node 0 DMA32 per-cpu:
[500580.805457] CPU 0: hi: 186, btch: 31 usd: 0
[500580.805458] CPU 1: hi: 186, btch: 31 usd: 0
[500580.805459] CPU 2: hi: 186, btch: 31 usd: 0
[500580.805460] CPU 3: hi: 186, btch: 31 usd: 169
[500580.805461] Node 0 Normal per-cpu:
[500580.805462] CPU 0: hi: 186, btch: 31 usd: 73
[500580.805463] CPU 1: hi: 186, btch: 31 usd: 9
[500580.805464] CPU 2: hi: 186, btch: 31 usd: 0
[500580.805466] CPU 3: hi: 186, btch: 31 usd: 178
[500580.805469] active_anon:3994472 inactive_anon:134 isolated_anon:0
[500580.805469] active_file:160 inactive_file:332 isolated_file:0
[500580.805469] unevictable:0 dirty:0 writeback:0 unstable:0
[500580.805469] free:34674 slab_reclaimable:3233 slab_unreclaimable:5086
[500580.805469] mapped:206 shmem:160 pagetables:10093 bounce:0
[500580.805469] free_cma:0
[500580.805472] Node 0 DMA free:15908kB min:64kB low:80kB high:96...

Revision history for this message
Julien Danjou (jdanjou) wrote :

Hi Nicolas,

Thank you for all this information!

Ok so we have 2 different issues! The OOM killer is killing metricd *without* the patch, and that makes sense since the patch reduces the memory consumption a lot – it's really what this bug is about.

The segmentation fault you see is probably not a bug in Gnocchi itself, but in Ceph (either in python-rados or librados). That's another issue which is going to be a bit more complicated to track.

So keep it running with the patch. I'll send this bug to Mehdi in case he'd have an idea on the segfault due to Rados. Can you tell me which version of Ceph and rados libs you have installed on your system?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to gnocchi (master)

Reviewed: https://review.openstack.org/267692
Committed: https://git.openstack.org/cgit/openstack/gnocchi/commit/?id=614e13d47fdcaeea9d41bebc214014e0c83a0e83
Submitter: Jenkins
Branch: master

commit 614e13d47fdcaeea9d41bebc214014e0c83a0e83
Author: Julien Danjou <email address hidden>
Date: Thu Jan 14 17:31:14 2016 +0100

    ceph: fix the metric list to process with new measures

    Currently, the list returned in the Ceph driver contains a lot of
    doubloons because it returns a list and not a set. If 1 metric has N new
    measures batch waiting to be processed, the list return will be size N
    and not 1.

    Using a set() avoids that issue and the memory draining implied.

    Closes-Bug: #1533793
    Change-Id: I3a0b726aae14a17a23a365babc1a2537fb4d1052

Changed in gnocchi:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to gnocchi (stable/1.3)

Fix proposed to branch: stable/1.3
Review: https://review.openstack.org/269531

Revision history for this message
Nicolas Vila (nvlan) wrote : Re: gnocchi-metricd uses all memory and segfaults

Hello Julien,

These are the versions in use on the gnocchi-api host:

root@gnocchi-api-1:~# dpkg -l | egrep -e "rados|ceph"
ii ceph 0.94.5-1trusty amd64 distributed storage and file system
ii ceph-common 0.94.5-1trusty amd64 common utilities to mount and interact with a ceph storage cluster
ii ceph-fs-common 0.94.5-1trusty amd64 common utilities to mount and interact with a ceph file system
ii ceph-fuse 0.94.5-1trusty amd64 FUSE-based client for the Ceph distributed file system
ii ceph-mds 0.94.5-1trusty amd64 metadata server for the ceph distributed file system
ii libcephfs1 0.94.5-1trusty amd64 Ceph distributed file system client library
ii librados2 0.94.5-1trusty amd64 RADOS distributed object store client library
ii libradosstriper1 0.94.5-1trusty amd64 RADOS striping interface
ii python-cephfs 0.94.5-1trusty amd64 Python libraries for the Ceph libcephfs library
ii python-rados 0.94.5-1trusty amd64 Python libraries for the Ceph librados library

Should I open a new bug report for Mehdi to proceed?

Many thanks, regards.

Revision history for this message
Julien Danjou (jdanjou) wrote :

Nicolas I guess you can open a bug, but we'll just mark it as incomplete as we cannot fix it in Gnocchi, since it's likely a Ceph bug, and we don't have much information.

It looks like your Ceph version is a bit old, maybe you could try with a more recent version?

summary: - gnocchi-metricd uses all memory and segfaults
+ gnocchi-metricd uses all memory and get killed by OO
summary: - gnocchi-metricd uses all memory and get killed by OO
+ gnocchi-metricd uses all memory and get killed by OOM
summary: - gnocchi-metricd uses all memory and get killed by OOM
+ gnocchi-metricd uses all memory and get killed by OOM killer
Revision history for this message
Nicolas Vila (nvlan) wrote :
Download full text (18.0 KiB)

Hello Julien,

I've separated gnocchi-api from gnocchi-metricd, and moved the metricd daemon to a new separate host. I applied patch set 3, configured gnocchi.conf with 1 worker and 1 aggregation_workers_number. Had gnocchi-metricd running (ps shows 3 processes) and it is segfaulting:

[570576.585198] gnocchi-metricd[30778]: segfault at 24 ip 0000000000558077 sp 00007fffd296c2d0 error 6 in python2.7[400000+2bc000]
[570576.733338] Core dump to |/usr/share/apport/apport 30778 11 0 30778 pipe failed
[570578.384777] init: gnocchi-metricd main process ended, respawning
[571985.831013] gnocchi-metricd[9803]: segfault at 24 ip 0000000000537388 sp 00007fff8e8c5880 error 6 in python2.7[400000+2bc000]
[571985.832969] gnocchi-metricd[9802]: segfault at 24 ip 0000000000558077 sp 00007fff8e8c4de0 error 6 in python2.7[400000+2bc000]
[571985.956518] Core dump to |/usr/share/apport/apport 9803 11 0 9803 pipe failed
[571987.323898] init: gnocchi-metricd main process ended, respawning

The coredump left on /usr/share/apport/apport is as follows:

---

#!/usr/bin/python3

# Collect information about a crash and create a report in the directory
# specified by apport.fileutils.report_dir.
# See https://wiki.ubuntu.com/Apport for details.
#
# Copyright (c) 2006 - 2011 Canonical Ltd.
# Author: Martin Pitt <email address hidden>
#
# This program is free software; you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the
# Free Software Foundation; either version 2 of the License, or (at your
# option) any later version. See http://www.gnu.org/copyleft/gpl.html for
# the full text of the license.

import sys, os, os.path, subprocess, time, traceback, pwd, io
import signal, inspect, grp, fcntl

import apport, apport.fileutils

#################################################################
#
# functions
#
#################################################################

def check_lock():
    '''Abort if another instance of apport is already running.

    This avoids bringing down the system to its knees if there is a series of
    crashes.'''

    # create a lock file
    lockfile = os.path.join(apport.fileutils.report_dir, '.lock')
    try:
        fd = os.open(lockfile, os.O_WRONLY | os.O_CREAT | os.O_NOFOLLOW)
    except OSError as e:
        error_log('cannot create lock file (uid %i): %s' % (os.getuid(), str(e)))
        sys.exit(1)

    try:
        fcntl.lockf(fd, fcntl.LOCK_EX | fcntl.LOCK_NB)
    except IOError:
        error_log('another apport instance is already running, aborting')
        sys.exit(1)

def drop_privileges(pid, partial=False):
    '''Change user and group to match the given target process.'''

    stat = None
    try:
        stat = os.stat('/proc/%s/stat' % pid)
    except OSError as e:
        raise ValueError('Invalid process ID: ' + str(e))

    if partial:
        effective_gid = os.getegid()
        effective_uid = os.geteuid()
    else:
        effective_gid = stat.st_gid
        effective_uid = stat.st_uid

    os.setregid(stat.st_gid, effective_gid)
    os.setreuid(stat.st_uid, effective_uid)
    assert os.getegid() == effective_gid
    assert os.getgid() == stat.st_gid...

Revision history for this message
Julien Danjou (jdanjou) wrote :

Apport is the Python script provided by Canonical to send the core dump to their server. And it seems it crashed too :)

The best way to debug this at this stage is the following:

1. Disable apport with /etc/init.d/apport stop
2. Stop gnocchi-metricd
3. Make sure you have no limit on core dump size: ulimit -c unlimited
4. Run gnocchi-metricd in foreground: gnocchi-metricd --debug
5. Wait for it to crash :)
6. When it segfaults, you should have a "core" file in the current directory
7. With that core file, run gdb as follow: gdb /usr/bin/python2.7 core
8. At the (gdb) prompt type: bt full

That should give you the full backtrace of the segmentation fault and what caused it. It's likely caused by a bug in librados, so it's pretty likely we won't be able to fix, but it can give us a hint.

Revision history for this message
Nicolas Vila (nvlan) wrote :

Hello Julien,

I followed your steps, but sadly gdb can't recognize the 5.9GB coredump file:

root@gnocchi-metricd:~# file core
core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), too many program header sections (451)
root@gnocchi-metricd:~# gdb /usr/bin/python2.7 core
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/python2.7...Reading symbols from /usr/lib/debug//usr/bin/python2.7...done.
done.
"/root/core" is not a core dump: File format not recognized
(gdb) bt full
No stack.

Is there any other way we could proceed?

Many thanks, regards.

Revision history for this message
Julien Danjou (jdanjou) wrote :

Try that:

1. Disable apport with /etc/init.d/apport stop
2. Stop gnocchi-metricd
3. Run gnocchi-metricd in foreground with gdb: gdb --args /usr/bin/python2.7 /usr/bin/python/gnocchi-metricd --debug
4. (gdb) run
5. Wait for it to crash :)
6. When it segfaults, at the (gdb) prompt type: bt full

The program might be slower while running with gdb.

Revision history for this message
Nicolas Vila (nvlan) wrote :

Hello Julien,

Still no luck. I've followed your last steps but gdb still says no stack:

root@gnocchi-metricd:~# gdb --args /usr/bin/python2.7 /usr/local/bin/gnocchi-metricd --debug
GNU gdb (GDB) 7.6.2
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/bin/python2.7...(no debugging symbols found)...done.
(gdb) run
Starting program: /usr/bin/python2.7 /usr/local/bin/gnocchi-metricd --debug
warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7ffff7ffa000
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Option "verbose" from group "DEFAULT" is deprecated for removal. Its value may be silently ignored in the future.
2016-01-20 13:50:15.543 23251 DEBUG gnocchi.service [-] ******************************************************************************** log_opt_values /usr/local/lib/python2.7/dist-packages/oslo_config/cfg.py:2367
(...)
2016-01-20 13:50:16.672 23264 DEBUG gnocchi.storage [-] Processing new and to delete measures process_background_tasks /usr/local/lib/python2.7/dist-packages/gnocchi/storage/__init__.py:182
[Inferior 1 (process 23251) exited normally]
(gdb) bt full
No stack.
(gdb)

I'm going to deploy a fresh metricd host just to be 100% sure this issue is reproducible, will update when confirmed.

Many thanks, regards.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to gnocchi (stable/1.3)

Reviewed: https://review.openstack.org/269531
Committed: https://git.openstack.org/cgit/openstack/gnocchi/commit/?id=ee1740bf12c13f546990f01ef76a4f0b29a78aeb
Submitter: Jenkins
Branch: stable/1.3

commit ee1740bf12c13f546990f01ef76a4f0b29a78aeb
Author: Julien Danjou <email address hidden>
Date: Thu Jan 14 17:31:14 2016 +0100

    ceph: fix the metric list to process with new measures

    Currently, the list returned in the Ceph driver contains a lot of
    doubloons because it returns a list and not a set. If 1 metric has N new
    measures batch waiting to be processed, the list return will be size N
    and not 1.

    Using a set() avoids that issue and the memory draining implied.

    Closes-Bug: #1533793
    Change-Id: I3a0b726aae14a17a23a365babc1a2537fb4d1052
    (cherry picked from commit 614e13d47fdcaeea9d41bebc214014e0c83a0e83)

Julien Danjou (jdanjou)
Changed in gnocchi:
milestone: none → 2.0.0
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.