Bug #1533793 “gnocchi-metricd uses all memory and get killed by ...” : Bugs : Gnocchi

Revision history for this message

Nicolas Vila (nvlan) wrote on 2016-01-13:

#1

gnocchi.conf Edit (931 bytes, text/plain)

Revision history for this message

Julien Danjou (jdanjou) wrote on 2016-01-13:

#2

Which version of Gnocchi?

affects:

ceilometer → gnocchi

Revision history for this message

Julien Danjou (jdanjou) wrote on 2016-01-13:

#3

Please also run gnocchi-metricd with debug enabled and paste the log it outputs before dying.

Changed in gnocchi:
status:	New → Incomplete

Revision history for this message

Nicolas Vila (nvlan) wrote on 2016-01-13:

#4

I've installed from Git repository, branch master (cloned on Dec 16 2015). I'm attaching gnocchi-metricd.log.

Thanks!

Revision history for this message

Nicolas Vila (nvlan) wrote on 2016-01-13:

#5

Debug log from gnocchi metricd Edit (76.9 KiB, text/plain)

Revision history for this message

Julien Danjou (jdanjou) wrote on 2016-01-13:

#6

It looks like you have millions of metrics in your database. If that the case? can you check in the metric table of the gnocchi database how many records you got? (try "SELECT COUNT(*) FROM metric")

Revision history for this message

Nicolas Vila (nvlan) wrote on 2016-01-13:

#7

I believe it's not the case:

mysql> select count(*) from metric;
+----------+
| count(*) |
+----------+
| 4854 |
+----------+
1 row in set (0.00 sec)

I did at one point have over 50.000 messages in the metering.sample queue that weren't being processed in time but that's no longer the case.
Gnocchi-metricd starts using little memory, but as time passes it ramps up:

root@gnocchi-api-2:/var/lib/gnocchi# free -m
total used free shared buffers cached
Mem: 16049 465 15583 203 15 280
-/+ buffers/cache: 169 15879
Swap: 0 0 0
root@gnocchi-api-2:/var/lib/gnocchi# service gnocchi-metricd start
gnocchi-metricd start/running, process 2536
root@gnocchi-api-2:/var/lib/gnocchi# free -m
total used free shared buffers cached
Mem: 16049 576 15472 203 15 280
-/+ buffers/cache: 280 15768
Swap: 0 0 0

Revision history for this message

Nicolas Vila (nvlan) wrote on 2016-01-13:

#8

I do have many objects in the gnocchi ceph pool (compared to the pools we use for images, volumes, etc):
root@gnocchi-api-2:~# rados df

pool name KB objects clones degraded unfound rd rd KB wr wr KB
gnocchi 663410 4515911 0 0 0 61512026 745333319 40978835 219043438
nubeliu_backups 0 1 0 0 0 103 79 633 843779
nubeliu_images 223115799 27449 0 0 0 65848 268587731 68336 255736063
nubeliu_vms 0 0 0 0 0 0 0 0 0
nubeliu_volumes 45406537 21598 0 0 0 3818172 28632843 15970713 269035507
nubeliutest 52484609 12828 0 0 0 445 574 207451 52483585
  total used 737807500 4577787
  total avail 15066454868
  total space 15804262368

Revision history for this message

Julien Danjou (jdanjou) wrote on 2016-01-14:

#9

Ok, let's assume it's a MySQL memory error, because well, it's MySQL after all.
Remember kid, we do recommend PostgreSQL.

You're using the MySQLdb driver – let's try with pymysql. Replace mysql:// in your connection string by mysql+pymysql:// (and install pymysql in necessary). Then launch metricd and let me know if you still the issue.

Revision history for this message

Nicolas Vila (nvlan) wrote on 2016-01-14:

#10

Download full text (10.1 KiB)

Hello Julien,

I applied the mysql+pymysql change you suggested but with the same result, gnocchi-metricd was running for almos 30 minutes before segfaulting again. The gnocchi-metricd.log file shows:

(...)
2016-01-14 15:24:47.765 31023 DEBUG gnocchi.service [-] ******************************************************************************** log_opt_values /usr/local/lib/python2.7/dist-packages/oslo_config/cfg.py:2343
2016-01-14 15:25:02.653 2114 DEBUG oslo_db.sqlalchemy.engines [-] MySQL server mode set to STRICT_TRANS_TABLES,STRICT_ALL_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,TRADITIONAL,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION _check_effective_sql_mode /usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/engines.py:256
2016-01-14 15:25:02.654 2115 DEBUG oslo_db.sqlalchemy.engines [-] MySQL server mode set to STRICT_TRANS_TABLES,STRICT_ALL_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,TRADITIONAL,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION _check_effective_sql_mode /usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/engines.py:256
2016-01-14 15:25:02.658 2114 DEBUG gnocchi.storage [-] Processing new and to delete measures process_background_tasks /usr/local/lib/python2.7/dist-packages/gnocchi/storage/__init__.py:171
2016-01-14 15:25:02.658 2115 DEBUG gnocchi.storage [-] Processing new and to delete measures process_background_tasks /usr/local/lib/python2.7/dist-packages/gnocchi/storage/__init__.py:171
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage [-] Unexpected error during measures processing
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage Traceback (most recent call last):
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage File "/usr/local/lib/python2.7/dist-packages/gnocchi/storage/__init__.py", line 173, in process_background_tasks
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage self.process_measures(index, sync)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage File "/usr/local/lib/python2.7/dist-packages/gnocchi/storage/_carbonara.py", line 159, in process_measures
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage metrics = indexer.get_metrics(metrics_to_process)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage File "/usr/local/lib/python2.7/dist-packages/gnocchi/indexer/sqlalchemy.py", line 148, in get_metrics
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage metrics = list(query.all())
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2584, in all
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage return list(self)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2732, in __iter__
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage return self._execute_and_instances(context)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2747, in _execute_and_instances
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage result = conn.execute(querycontext.statement, self._params)
2016-01-14 15:52:10.018 2115 ERROR gnocc...

Hello Julien,

I applied the mysql+pymysql change you suggested but with the same result, gnocchi-metricd was running for almos 30 minutes before segfaulting again. The gnocchi-metricd.log file shows:

(...)
2016-01-14 15:24:47.765 31023 DEBUG gnocchi.service [-] ******************************************************************************** log_opt_values /usr/local/lib/python2.7/dist-packages/oslo_config/cfg.py:2343
2016-01-14 15:25:02.653 2114 DEBUG oslo_db.sqlalchemy.engines [-] MySQL server mode set to STRICT_TRANS_TABLES,STRICT_ALL_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,TRADITIONAL,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION _check_effective_sql_mode /usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/engines.py:256
2016-01-14 15:25:02.654 2115 DEBUG oslo_db.sqlalchemy.engines [-] MySQL server mode set to STRICT_TRANS_TABLES,STRICT_ALL_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,TRADITIONAL,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION _check_effective_sql_mode /usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/engines.py:256
2016-01-14 15:25:02.658 2114 DEBUG gnocchi.storage [-] Processing new and to delete measures process_background_tasks /usr/local/lib/python2.7/dist-packages/gnocchi/storage/__init__.py:171
2016-01-14 15:25:02.658 2115 DEBUG gnocchi.storage [-] Processing new and to delete measures process_background_tasks /usr/local/lib/python2.7/dist-packages/gnocchi/storage/__init__.py:171
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage [-] Unexpected error during measures processing
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage Traceback (most recent call last):
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage   File "/usr/local/lib/python2.7/dist-packages/gnocchi/storage/__init__.py", line 173, in process_background_tasks
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage     self.process_measures(index, sync)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage   File "/usr/local/lib/python2.7/dist-packages/gnocchi/storage/_carbonara.py", line 159, in process_measures
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage     metrics = indexer.get_metrics(metrics_to_process)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage   File "/usr/local/lib/python2.7/dist-packages/gnocchi/indexer/sqlalchemy.py", line 148, in get_metrics
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage     metrics = list(query.all())
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2584, in all
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage     return list(self)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2732, in __iter__
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage     return self._execute_and_instances(context)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2747, in _execute_and_instances
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage     result = conn.execute(querycontext.statement, self._params)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 914, in execute
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage     return meth(self, multiparams, params)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/elements.py", line 323, in _execute_on_connection
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage     return connection._execute_clauseelement(self, multiparams, params)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1003, in _execute_clauseelement
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage     inline=len(distilled_params) > 1)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage   File "<string>", line 1, in <lambda>
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/elements.py", line 494, in compile
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage     return self._compiler(dialect, bind=bind, **kw)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/elements.py", line 500, in _compiler
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage     return dialect.statement_compiler(dialect, self, **kw)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/compiler.py", line 392, in __init__
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage     Compiled.__init__(self, dialect, statement, **kwargs)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/compiler.py", line 190, in __init__
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage     self.string = self.process(self.statement, **compile_kwargs)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/compiler.py", line 213, in process
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage     return obj._compiler_dispatch(self, **kwargs)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/visitors.py", line 81, in _compiler_dispatch
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage     return meth(self, **kw)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/compiler.py", line 1602, in visit_select
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage     text, select, inner_columns, froms, byfrom, kwargs)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/compiler.py", line 1699, in _compose_select_body
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage     t = select._whereclause._compiler_dispatch(self, **kwargs)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/visitors.py", line 93, in _compiler_dispatch
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage     return meth(self, **kw)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/compiler.py", line 745, in visit_clauselist
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage     for c in clauselist.clauses)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/compiler.py", line 742, in <genexpr>
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage     s for s in
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/compiler.py", line 745, in <genexpr>
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage     for c in clauselist.clauses)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/visitors.py", line 93, in _compiler_dispatch
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage     return meth(self, **kw)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/compiler.py", line 921, in visit_binary
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage     return self._generate_generic_binary(binary, opstring, **kw)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/compiler.py", line 938, in _generate_generic_binary
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage     binary.right._compiler_dispatch(self, **kw)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/visitors.py", line 93, in _compiler_dispatch
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage     return meth(self, **kw)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/compiler.py", line 524, in visit_grouping
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage     return "(" + grouping.element._compiler_dispatch(self, **kwargs) + ")"
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/visitors.py", line 93, in _compiler_dispatch
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage     return meth(self, **kw)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/compiler.py", line 745, in visit_clauselist
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage     for c in clauselist.clauses)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/compiler.py", line 742, in <genexpr>
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage     s for s in
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/compiler.py", line 745, in <genexpr>
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage     for c in clauselist.clauses)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/visitors.py", line 93, in _compiler_dispatch
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage     return meth(self, **kw)
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/compiler.py", line 1091, in visit_bindparam
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage     self.binds[bindparam.key] = self.binds[name] = bindparam
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage MemoryError
2016-01-14 15:52:10.018 2115 ERROR gnocchi.storage 
2016-01-14 15:52:11.810 2115 DEBUG gnocchi.storage [-] Expunging deleted metrics process_background_tasks /usr/local/lib/python2.7/dist-packages/gnocchi/storage/__init__.py:179
2016-01-14 15:52:12.567 2115 DEBUG gnocchi.storage [-] Processing new and to delete measures process_background_tasks /usr/local/lib/python2.7/dist-packages/gnocchi/storage/__init__.py:171

Is 16GB of RAM too little? Should I try to distribute metricd among several hosts?

Thanks, regards.

Revision history for this message

Julien Danjou (jdanjou) wrote on 2016-01-14:

#11

Ok, so the error is different, but the problem is the same. I'm trying to find what may causes it. I'll probably send a few patches for you to try in the next days.

Revision history for this message

Julien Danjou (jdanjou) wrote on 2016-01-14:

#12

To reply to your question: no, this is a bug, one host with 16GB is _largely_ enough.

Revision history for this message

Nicolas Vila (nvlan) wrote on 2016-01-14:

#13

I've just realized that the log extract I posted doesn't seem to be complete, would you like me to upload it as an attachment?

Thanks, regards.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-14: Fix proposed to gnocchi (master)

#14

Fix proposed to branch: master
Review: https://review.openstack.org/267692

Changed in gnocchi:
assignee:	nobody → Julien Danjou (jdanjou)
status:	Incomplete → In Progress

Revision history for this message

Julien Danjou (jdanjou) wrote on 2016-01-14: Re: gnocchi-metricd uses all memory and segfaults

#15

I tried to guess the problem, and I wrote https://review.openstack.org/267692

Can you give it a try and tell me if it fixes your issue?

Revision history for this message

Julien Danjou (jdanjou) wrote on 2016-01-14:

#16

It is complete, there's a download full text at the top of your message :) I got it, don't worry! :)

Revision history for this message

Nicolas Vila (nvlan) wrote on 2016-01-14:

#17

Hello Julien,

I just tried it out, but sadly with the same result. The trace is very similar to the one in comment #10, last 10 lines are:

(...)
2016-01-14 17:16:34.002 13837 ERROR gnocchi.storage return self._generate_generic_binary(binary, opstring, **kw)
2016-01-14 17:16:34.002 13837 ERROR gnocchi.storage File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/compiler.py", line 938, in _generate_generic_binary
2016-01-14 17:16:34.002 13837 ERROR gnocchi.storage binary.right._compiler_dispatch(self, **kw)
2016-01-14 17:16:34.002 13837 ERROR gnocchi.storage File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/visitors.py", line 93, in _compiler_dispatch
2016-01-14 17:16:34.002 13837 ERROR gnocchi.storage return meth(self, **kw)
2016-01-14 17:16:34.002 13837 ERROR gnocchi.storage File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/compiler.py", line 524, in visit_grouping
2016-01-14 17:16:34.002 13837 ERROR gnocchi.storage return "(" + grouping.element._compiler_dispatch(self, **kwargs) + ")"
2016-01-14 17:16:34.002 13837 ERROR gnocchi.storage MemoryError
2016-01-14 17:16:34.002 13837 ERROR gnocchi.storage
2016-01-14 17:16:34.831 13837 DEBUG gnocchi.storage [-] Expunging deleted metrics process_background_tasks /usr/local/lib/python2.7/dist-packages/gnocchi/storage/__init__.py:179

Thanks, regards.

Revision history for this message

Nicolas Vila (nvlan) wrote on 2016-01-14:

#18

Hello Julien,

After more testing, it turns out the patch *does* works. Gnocchi-metricd segfaulted when the API received a /v1/status call. As long as no such request is sent, gnocchi-metricd has been working fine for the last couple of hours.

Thanks, regards.

Revision history for this message

Julien Danjou (jdanjou) wrote on 2016-01-15:

#19

Cool, Nicolas, that confirms my first intuition. It's normal that the API call breaks it, since I didn't patch that one.

My patch is not complete yet, it only covers your particular problem with metricd and Ceph, I'll complete it soon! Thanks for your help!

Revision history for this message

Julien Danjou (jdanjou) wrote on 2016-01-15:

#20

Nicolas, I've sent a different and simpler version of the patch.
Can you stop metricd for a while so the Ceph pool fills with some new metrics, and then start metricd with this patch to see if it works without any problem?
The API should work fine with that new patch too.

Changed in gnocchi:
importance:	Undecided → Critical

Revision history for this message

Nicolas Vila (nvlan) wrote on 2016-01-15:

#21

Hello Julien,

So far, so good! The new patch has been working for nearly one hour, and memory usage is still within normal values (nearly 9GB out of 16GB). I have 5 days of metrics to process, so it's going to be working for a while. I've executed a "gnocchi status" from the client, waiting for its reply. I'll let you know how it goes.

Thanks, regards!

Revision history for this message

Julien Danjou (jdanjou) wrote on 2016-01-15:

#22

Cool! Yeah seeing the number of measures you accumulated, I'm not surprised. If you have some statistics on the processing rate, etc, I'd be glad to hear – even if it's not related to this bug. ;)

Revision history for this message

Nicolas Vila (nvlan) wrote on 2016-01-15:

#23

Hello Julien,

Unfortunately, the "gnocchi status" caused metricd to segfault. I'll stop everyone from executing that for now, and stop the ceilometer-agents to allow metricd to come closer to the current date (it's almost six days behind). Plus, it'll allow me to gather some statistics, and see if metricd can process a couple of days uninterrupted.

Many thanks, regards.

Revision history for this message

Julien Danjou (jdanjou) wrote on 2016-01-16:

#24

Hm, "gnocchi" status has 0 interaction with metricd (in theory). Can you paste me your segfault trace? I imagine it's not a segfault but just the OOM killer kicking in?

Revision history for this message

Nicolas Vila (nvlan) wrote on 2016-01-18:

#25

Hello Julien,

I've been doing several tests this weekend, and if I use more than one metricd worker (as per gnocchi.conf), patch set 1 and 3 eventually segfault. Now with only one worker, patch set 1 has been working non-stop for over 24 hours. I'll try now to configure more than one metricd in separate hosts with a redis server for coordination, using patch set 1 in one and patch set 3 in the other, and let you know the results.

Many thanks, kind regards.

Revision history for this message

Julien Danjou (jdanjou) wrote on 2016-01-18:

#26

Hi Nicolas,

You use 'segfault', but I don't think it's really a "segmentation fault". I understand it crashes, but do you have a backtrace, a log, or anything? It should not crash with 3 workers. Is it the OOM killer killing a process? Do they use too much memory? Are they processing metrics?

I'd be interested in having more info to help improve the current situation!

Revision history for this message

Nicolas Vila (nvlan) wrote on 2016-01-18:

#27

Download full text (10.2 KiB)

Hello Julien,

Without any patch, dmesg shows "segmentation fault". Currently, I have one node using patch set 1. It's been working without issues for over 24 hours with just one worker. I changed the configuration about 30 minutes ago, set workers = 2 and restarted. The gnocchi-metricd.log file shows that it is processing measures, but after a while gnocchi-metricd invoked oom-killer:

[500580.805296] gnocchi-metricd invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[500580.805307] gnocchi-metricd cpuset=/ mems_allowed=0
[500580.805318] CPU: 2 PID: 14524 Comm: gnocchi-metricd Tainted: G D 3.13.0-46-generic #75-Ubuntu
[500580.805321] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.7.5-20150310_111955-batsu 04/01/2014
[500580.805323] 0000000000000000 ffff8802aeab9a68 ffffffff817212c6 ffff8800a3b51800
[500580.805329] ffff8802aeab9af0 ffffffff8171bb81 0000000000000000 0000000000000000
[500580.805331] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[500580.805333] Call Trace:
[500580.805369] [<ffffffff817212c6>] dump_stack+0x45/0x56
[500580.805375] [<ffffffff8171bb81>] dump_header+0x7f/0x1f1
[500580.805388] [<ffffffff8115299e>] oom_kill_process+0x1ce/0x330
[500580.805403] [<ffffffff812d70e5>] ? security_capable_noaudit+0x15/0x20
[500580.805405] [<ffffffff811530d4>] out_of_memory+0x414/0x450
[500580.805410] [<ffffffff81159440>] __alloc_pages_nodemask+0xa60/0xb80
[500580.805422] [<ffffffff81199caa>] alloc_pages_vma+0x9a/0x140
[500580.805428] [<ffffffff8117a623>] handle_mm_fault+0xb23/0xf00
[500580.805433] [<ffffffff8172d2d4>] __do_page_fault+0x184/0x560
[500580.805436] [<ffffffff81182705>] ? change_protection+0x65/0xb0
[500580.805439] [<ffffffff811828a1>] ? mprotect_fixup+0x151/0x290
[500580.805441] [<ffffffff8172d6ca>] do_page_fault+0x1a/0x70
[500580.805444] [<ffffffff8172cd49>] do_async_page_fault+0x29/0xe0
[500580.805446] [<ffffffff81729b58>] async_page_fault+0x28/0x30
[500580.805448] Mem-Info:
[500580.805450] Node 0 DMA per-cpu:
[500580.805451] CPU 0: hi: 0, btch: 1 usd: 0
[500580.805453] CPU 1: hi: 0, btch: 1 usd: 0
[500580.805454] CPU 2: hi: 0, btch: 1 usd: 0
[500580.805455] CPU 3: hi: 0, btch: 1 usd: 0
[500580.805455] Node 0 DMA32 per-cpu:
[500580.805457] CPU 0: hi: 186, btch: 31 usd: 0
[500580.805458] CPU 1: hi: 186, btch: 31 usd: 0
[500580.805459] CPU 2: hi: 186, btch: 31 usd: 0
[500580.805460] CPU 3: hi: 186, btch: 31 usd: 169
[500580.805461] Node 0 Normal per-cpu:
[500580.805462] CPU 0: hi: 186, btch: 31 usd: 73
[500580.805463] CPU 1: hi: 186, btch: 31 usd: 9
[500580.805464] CPU 2: hi: 186, btch: 31 usd: 0
[500580.805466] CPU 3: hi: 186, btch: 31 usd: 178
[500580.805469] active_anon:3994472 inactive_anon:134 isolated_anon:0
[500580.805469] active_file:160 inactive_file:332 isolated_file:0
[500580.805469] unevictable:0 dirty:0 writeback:0 unstable:0
[500580.805469] free:34674 slab_reclaimable:3233 slab_unreclaimable:5086
[500580.805469] mapped:206 shmem:160 pagetables:10093 bounce:0
[500580.805469] free_cma:0
[500580.805472] Node 0 DMA free:15908kB min:64kB low:80kB high:96...

Hello Julien,

Without any patch, dmesg shows "segmentation fault". Currently, I have one node using patch set 1. It's been working without issues for over 24 hours with just one worker. I changed the configuration about 30 minutes ago, set workers = 2 and restarted. The gnocchi-metricd.log file shows that it is processing measures, but after a while gnocchi-metricd invoked oom-killer:

[500580.805296] gnocchi-metricd invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[500580.805307] gnocchi-metricd cpuset=/ mems_allowed=0
[500580.805318] CPU: 2 PID: 14524 Comm: gnocchi-metricd Tainted: G      D       3.13.0-46-generic #75-Ubuntu
[500580.805321] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.7.5-20150310_111955-batsu 04/01/2014
[500580.805323]  0000000000000000 ffff8802aeab9a68 ffffffff817212c6 ffff8800a3b51800
[500580.805329]  ffff8802aeab9af0 ffffffff8171bb81 0000000000000000 0000000000000000
[500580.805331]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
[500580.805333] Call Trace:
[500580.805369]  [<ffffffff817212c6>] dump_stack+0x45/0x56
[500580.805375]  [<ffffffff8171bb81>] dump_header+0x7f/0x1f1
[500580.805388]  [<ffffffff8115299e>] oom_kill_process+0x1ce/0x330
[500580.805403]  [<ffffffff812d70e5>] ? security_capable_noaudit+0x15/0x20
[500580.805405]  [<ffffffff811530d4>] out_of_memory+0x414/0x450
[500580.805410]  [<ffffffff81159440>] __alloc_pages_nodemask+0xa60/0xb80
[500580.805422]  [<ffffffff81199caa>] alloc_pages_vma+0x9a/0x140
[500580.805428]  [<ffffffff8117a623>] handle_mm_fault+0xb23/0xf00
[500580.805433]  [<ffffffff8172d2d4>] __do_page_fault+0x184/0x560
[500580.805436]  [<ffffffff81182705>] ? change_protection+0x65/0xb0
[500580.805439]  [<ffffffff811828a1>] ? mprotect_fixup+0x151/0x290
[500580.805441]  [<ffffffff8172d6ca>] do_page_fault+0x1a/0x70
[500580.805444]  [<ffffffff8172cd49>] do_async_page_fault+0x29/0xe0
[500580.805446]  [<ffffffff81729b58>] async_page_fault+0x28/0x30
[500580.805448] Mem-Info:
[500580.805450] Node 0 DMA per-cpu:
[500580.805451] CPU    0: hi:    0, btch:   1 usd:   0
[500580.805453] CPU    1: hi:    0, btch:   1 usd:   0
[500580.805454] CPU    2: hi:    0, btch:   1 usd:   0
[500580.805455] CPU    3: hi:    0, btch:   1 usd:   0
[500580.805455] Node 0 DMA32 per-cpu:
[500580.805457] CPU    0: hi:  186, btch:  31 usd:   0
[500580.805458] CPU    1: hi:  186, btch:  31 usd:   0
[500580.805459] CPU    2: hi:  186, btch:  31 usd:   0
[500580.805460] CPU    3: hi:  186, btch:  31 usd: 169
[500580.805461] Node 0 Normal per-cpu:
[500580.805462] CPU    0: hi:  186, btch:  31 usd:  73
[500580.805463] CPU    1: hi:  186, btch:  31 usd:   9
[500580.805464] CPU    2: hi:  186, btch:  31 usd:   0
[500580.805466] CPU    3: hi:  186, btch:  31 usd: 178
[500580.805469] active_anon:3994472 inactive_anon:134 isolated_anon:0
[500580.805469]  active_file:160 inactive_file:332 isolated_file:0
[500580.805469]  unevictable:0 dirty:0 writeback:0 unstable:0
[500580.805469]  free:34674 slab_reclaimable:3233 slab_unreclaimable:5086
[500580.805469]  mapped:206 shmem:160 pagetables:10093 bounce:0
[500580.805469]  free_cma:0
[500580.805472] Node 0 DMA free:15908kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[500580.805478] lowmem_reserve[]: 0 2847 15902 15902
[500580.805480] Node 0 DMA32 free:65580kB min:12088kB low:15108kB high:18132kB active_anon:2837216kB inactive_anon:100kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3129212kB managed:2919140kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:124kB slab_reclaimable:1328kB slab_unreclaimable:3092kB kernel_stack:1024kB pagetables:8212kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:180 all_unreclaimable? yes
[500580.805484] lowmem_reserve[]: 0 0 13054 13054
[500580.805486] Node 0 Normal free:57208kB min:55424kB low:69280kB high:83136kB active_anon:13140672kB inactive_anon:436kB active_file:636kB inactive_file:1324kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:13631488kB managed:13368116kB mlocked:0kB dirty:0kB writeback:0kB mapped:824kB shmem:516kB slab_reclaimable:11604kB slab_unreclaimable:17252kB kernel_stack:2752kB pagetables:32160kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:3027 all_unreclaimable? yes
[500580.805489] lowmem_reserve[]: 0 0 0 0
[500580.805491] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (R) 3*4096kB (M) = 15908kB
[500580.805503] Node 0 DMA32: 355*4kB (UEM) 163*8kB (UEM) 129*16kB (UEM) 64*32kB (UEM) 33*64kB (UE) 21*128kB (UEM) 29*256kB (UE) 19*512kB (UE) 18*1024kB (UE) 3*2048kB (M) 3*4096kB (MR) = 65652kB
[500580.805512] Node 0 Normal: 298*4kB (UEM) 444*8kB (UE) 416*16kB (UEM) 232*32kB (UEM) 126*64kB (UEM) 52*128kB (UEM) 44*256kB (UEM) 16*512kB (UEM) 0*1024kB 0*2048kB 1*4096kB (R) = 57096kB
[500580.805524] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[500580.805525] 720 total pagecache pages
[500580.805526] 0 pages in swap cache
[500580.805527] Swap cache stats: add 0, delete 0, find 0/0
[500580.805528] Free swap  = 0kB
[500580.805529] Total swap = 0kB
[500580.805530] 4194173 pages RAM
[500580.805531] 0 pages HighMem/MovableOnly
[500580.805532] 65843 pages reserved
[500580.805532] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[500580.805544] [  386]     0   386     4868       69      13        0             0 upstart-udev-br
[500580.805555] [  402]     0   402    12443      190      27        0         -1000 systemd-udevd
[500580.805557] [  543]     0   543     3814       64      12        0             0 upstart-socket-
[500580.805559] [  583]     0   583     5855       71      17        0             0 rpcbind
[500580.805561] [  612]   108   612     5385      116      16        0             0 rpc.statd
[500580.805563] [  645]     0   645     2555      578       7        0             0 dhclient
[500580.805565] [  895]   102   895     9804       95      23        0             0 dbus-daemon
[500580.805567] [  911]     0   911     3818       66      12        0             0 upstart-file-br
[500580.805569] [  942]     0   942    10862      105      26        0             0 systemd-logind
[500580.805571] [ 1041]     0  1041     3634       41      12        0             0 getty
[500580.805573] [ 1043]     0  1043     3634       39      12        0             0 getty
[500580.805574] [ 1045]   101  1045    65018      828      31        0             0 rsyslogd
[500580.805576] [ 1046]     0  1046     6926       58      18        0             0 rpc.idmapd
[500580.805578] [ 1051]     0  1051     3634       42      11        0             0 getty
[500580.805580] [ 1053]     0  1053     3634       42      12        0             0 getty
[500580.805582] [ 1055]     0  1055     3634       41      12        0             0 getty
[500580.805584] [ 1086]     0  1086    15342      169      33        0         -1000 sshd
[500580.805586] [ 1094]     0  1094     5913       71      18        0             0 cron
[500580.805588] [ 1095]     0  1095     4784       43      13        0             0 atd
[500580.805590] [ 1109]     0  1109     1091       36       8        0             0 acpid
[500580.805591] [ 1184]     0  1184     4796       70      15        0             0 irqbalance
[500580.805593] [ 1485]   107  1485    11417      702      26        0             0 snmpd
[500580.805595] [ 1569]     0  1569     3634       40      12        0             0 getty
[500580.805597] [ 1585]     0  1585     3196       40      12        0             0 getty
[500580.805599] [ 1761]   106  1761     6804      157      18        0             0 ntpd
[500580.805602] [12174]     0 12174    26408      264      54        0             0 sshd
[500580.805604] [12262]  1000 12262    26408      259      52        0             0 sshd
[500580.805605] [12263]  1000 12263     5397      563      15        0             0 bash
[500580.805607] [12284]     0 12284    17492      127      39        0             0 sudo
[500580.805609] [12285]     0 12285    16330      136      37        0             0 su
[500580.805610] [12286]     0 12286     5420      613      15        0             0 bash
[500580.805612] [ 1186]     0  1186    22118      447      46        0             0 apache2
[500580.805614] [ 1189]  1002  1189   716612    66971     440        0             0 apache2
[500580.805616] [ 1190]  1002  1190   715482    68849     427        0             0 apache2
[500580.805618] [ 1191]  1002  1191   724041    75264     494        0             0 apache2
[500580.805620] [ 1192]    33  1192   113314     2894      85        0             0 apache2
[500580.805621] [ 1193]    33  1193   113318     2815      85        0             0 apache2
[500580.805624] [29609]     0 29609    43845    11823      91        0             0 gnocchi-metricd
[500580.805626] [29622]     0 29622  1230075   978831    2026        0             0 gnocchi-metricd
[500580.805627] [29623]     0 29623  1684707  1460157    2963        0             0 gnocchi-metricd
[500580.805629] [29624]     0 29624  1553637  1335219    2719        0             0 gnocchi-metricd
[500580.805631] [ 6243]    33  6243    95832     1461      78        0             0 apache2
[500580.805633] Out of memory: Kill process 29623 (gnocchi-metricd) score 348 or sacrifice child
[500580.807870] Killed process 29623 (gnocchi-metricd) total-vm:6738828kB, anon-rss:5840552kB, file-rss:76kB

I mention segmentation fault because without patch or with patch set 3, dmesg shows that:

[478185.652077] device eth0 entered promiscuous mode
[478186.408662] device eth0 left promiscuous mode
[499251.835729] gnocchi-metricd[29547]: segfault at f4 ip 00007f3d601cb0d0 sp 00007f3d5d997150 error 4 in librados.so.2.0.0[7f3d6011b000+53d000]
[499541.640103] device eth0 entered promiscuous mode
[499541.832630] device eth0 left promiscuous mode

Please let me know what further information I can provide.

Many thanks, kind regards.

Revision history for this message

Julien Danjou (jdanjou) wrote on 2016-01-19:

#28

Hi Nicolas,

Thank you for all this information!

Ok so we have 2 different issues! The OOM killer is killing metricd *without* the patch, and that makes sense since the patch reduces the memory consumption a lot – it's really what this bug is about.

The segmentation fault you see is probably not a bug in Gnocchi itself, but in Ceph (either in python-rados or librados). That's another issue which is going to be a bit more complicated to track.

So keep it running with the patch. I'll send this bug to Mehdi in case he'd have an idea on the segfault due to Rados. Can you tell me which version of Ceph and rados libs you have installed on your system?

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-19: Fix merged to gnocchi (master)

#29

Reviewed: https://review.openstack.org/267692
Committed: https://git.openstack.org/cgit/openstack/gnocchi/commit/?id=614e13d47fdcaeea9d41bebc214014e0c83a0e83
Submitter: Jenkins
Branch: master

commit 614e13d47fdcaeea9d41bebc214014e0c83a0e83
Author: Julien Danjou <email address hidden>
Date: Thu Jan 14 17:31:14 2016 +0100

ceph: fix the metric list to process with new measures

    Currently, the list returned in the Ceph driver contains a lot of
    doubloons because it returns a list and not a set. If 1 metric has N new
    measures batch waiting to be processed, the list return will be size N
    and not 1.

Using a set() avoids that issue and the memory draining implied.

Closes-Bug: #1533793
Change-Id: I3a0b726aae14a17a23a365babc1a2537fb4d1052

Changed in gnocchi:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-19: Fix proposed to gnocchi (stable/1.3)

#30

Fix proposed to branch: stable/1.3
Review: https://review.openstack.org/269531

Revision history for this message

Nicolas Vila (nvlan) wrote on 2016-01-19: Re: gnocchi-metricd uses all memory and segfaults

#31

Hello Julien,

These are the versions in use on the gnocchi-api host:

root@gnocchi-api-1:~# dpkg -l | egrep -e "rados|ceph"
ii ceph 0.94.5-1trusty amd64 distributed storage and file system
ii ceph-common 0.94.5-1trusty amd64 common utilities to mount and interact with a ceph storage cluster
ii ceph-fs-common 0.94.5-1trusty amd64 common utilities to mount and interact with a ceph file system
ii ceph-fuse 0.94.5-1trusty amd64 FUSE-based client for the Ceph distributed file system
ii ceph-mds 0.94.5-1trusty amd64 metadata server for the ceph distributed file system
ii libcephfs1 0.94.5-1trusty amd64 Ceph distributed file system client library
ii librados2 0.94.5-1trusty amd64 RADOS distributed object store client library
ii libradosstriper1 0.94.5-1trusty amd64 RADOS striping interface
ii python-cephfs 0.94.5-1trusty amd64 Python libraries for the Ceph libcephfs library
ii python-rados 0.94.5-1trusty amd64 Python libraries for the Ceph librados library

Should I open a new bug report for Mehdi to proceed?

Many thanks, regards.

Revision history for this message

Julien Danjou (jdanjou) wrote on 2016-01-19:

#32

Nicolas I guess you can open a bug, but we'll just mark it as incomplete as we cannot fix it in Gnocchi, since it's likely a Ceph bug, and we don't have much information.

It looks like your Ceph version is a bit old, maybe you could try with a more recent version?

summary:	- gnocchi-metricd uses all memory and segfaults + gnocchi-metricd uses all memory and get killed by OO
summary:	- gnocchi-metricd uses all memory and get killed by OO + gnocchi-metricd uses all memory and get killed by OOM
summary:	- gnocchi-metricd uses all memory and get killed by OOM + gnocchi-metricd uses all memory and get killed by OOM killer

Revision history for this message

Nicolas Vila (nvlan) wrote on 2016-01-19:

#33

Download full text (18.0 KiB)

Hello Julien,

I've separated gnocchi-api from gnocchi-metricd, and moved the metricd daemon to a new separate host. I applied patch set 3, configured gnocchi.conf with 1 worker and 1 aggregation_workers_number. Had gnocchi-metricd running (ps shows 3 processes) and it is segfaulting:

[570576.585198] gnocchi-metricd[30778]: segfault at 24 ip 0000000000558077 sp 00007fffd296c2d0 error 6 in python2.7[400000+2bc000]
[570576.733338] Core dump to |/usr/share/apport/apport 30778 11 0 30778 pipe failed
[570578.384777] init: gnocchi-metricd main process ended, respawning
[571985.831013] gnocchi-metricd[9803]: segfault at 24 ip 0000000000537388 sp 00007fff8e8c5880 error 6 in python2.7[400000+2bc000]
[571985.832969] gnocchi-metricd[9802]: segfault at 24 ip 0000000000558077 sp 00007fff8e8c4de0 error 6 in python2.7[400000+2bc000]
[571985.956518] Core dump to |/usr/share/apport/apport 9803 11 0 9803 pipe failed
[571987.323898] init: gnocchi-metricd main process ended, respawning

The coredump left on /usr/share/apport/apport is as follows:

---

#!/usr/bin/python3

# Collect information about a crash and create a report in the directory
# specified by apport.fileutils.report_dir.
# See https://wiki.ubuntu.com/Apport for details.
#
# Copyright (c) 2006 - 2011 Canonical Ltd.
# Author: Martin Pitt <email address hidden>
#
# This program is free software; you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the
# Free Software Foundation; either version 2 of the License, or (at your
# option) any later version. See http://www.gnu.org/copyleft/gpl.html for
# the full text of the license.

import sys, os, os.path, subprocess, time, traceback, pwd, io
import signal, inspect, grp, fcntl

import apport, apport.fileutils

#################################################################
#
# functions
#
#################################################################

def check_lock():
'''Abort if another instance of apport is already running.

This avoids bringing down the system to its knees if there is a series of
crashes.'''

    # create a lock file
    lockfile = os.path.join(apport.fileutils.report_dir, '.lock')
    try:
        fd = os.open(lockfile, os.O_WRONLY | os.O_CREAT | os.O_NOFOLLOW)
    except OSError as e:
        error_log('cannot create lock file (uid %i): %s' % (os.getuid(), str(e)))
        sys.exit(1)

    try:
        fcntl.lockf(fd, fcntl.LOCK_EX | fcntl.LOCK_NB)
    except IOError:
        error_log('another apport instance is already running, aborting')
        sys.exit(1)

def drop_privileges(pid, partial=False):
'''Change user and group to match the given target process.'''

    stat = None
    try:
        stat = os.stat('/proc/%s/stat' % pid)
    except OSError as e:
        raise ValueError('Invalid process ID: ' + str(e))

    if partial:
        effective_gid = os.getegid()
        effective_uid = os.geteuid()
    else:
        effective_gid = stat.st_gid
        effective_uid = stat.st_uid

    os.setregid(stat.st_gid, effective_gid)
    os.setreuid(stat.st_uid, effective_uid)
    assert os.getegid() == effective_gid
    assert os.getgid() == stat.st_gid...

Hello Julien,

I've separated gnocchi-api from gnocchi-metricd, and moved the metricd daemon to a new separate host. I applied patch set 3, configured gnocchi.conf with 1 worker and 1 aggregation_workers_number. Had gnocchi-metricd running (ps shows 3 processes) and it is segfaulting:

[570576.585198] gnocchi-metricd[30778]: segfault at 24 ip 0000000000558077 sp 00007fffd296c2d0 error 6 in python2.7[400000+2bc000]
[570576.733338] Core dump to |/usr/share/apport/apport 30778 11 0 30778 pipe failed
[570578.384777] init: gnocchi-metricd main process ended, respawning
[571985.831013] gnocchi-metricd[9803]: segfault at 24 ip 0000000000537388 sp 00007fff8e8c5880 error 6 in python2.7[400000+2bc000]
[571985.832969] gnocchi-metricd[9802]: segfault at 24 ip 0000000000558077 sp 00007fff8e8c4de0 error 6 in python2.7[400000+2bc000]
[571985.956518] Core dump to |/usr/share/apport/apport 9803 11 0 9803 pipe failed
[571987.323898] init: gnocchi-metricd main process ended, respawning

The coredump left on /usr/share/apport/apport is as follows:

---

#!/usr/bin/python3

# Collect information about a crash and create a report in the directory
# specified by apport.fileutils.report_dir.
# See https://wiki.ubuntu.com/Apport for details.
#
# Copyright (c) 2006 - 2011 Canonical Ltd.
# Author: Martin Pitt <martin.pitt@ubuntu.com>
#
# This program is free software; you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the
# Free Software Foundation; either version 2 of the License, or (at your
# option) any later version.  See http://www.gnu.org/copyleft/gpl.html for
# the full text of the license.

import sys, os, os.path, subprocess, time, traceback, pwd, io
import signal, inspect, grp, fcntl

import apport, apport.fileutils

#################################################################
#
# functions
#
#################################################################

def check_lock():
    '''Abort if another instance of apport is already running.

This avoids bringing down the system to its knees if there is a series of
    crashes.'''

# create a lock file
    lockfile = os.path.join(apport.fileutils.report_dir, '.lock')
    try:
        fd = os.open(lockfile, os.O_WRONLY | os.O_CREAT | os.O_NOFOLLOW)
    except OSError as e:
        error_log('cannot create lock file (uid %i): %s' % (os.getuid(), str(e)))
        sys.exit(1)

try:
        fcntl.lockf(fd, fcntl.LOCK_EX | fcntl.LOCK_NB)
    except IOError:
        error_log('another apport instance is already running, aborting')
        sys.exit(1)

def drop_privileges(pid, partial=False):
    '''Change user and group to match the given target process.'''

stat = None
    try:
        stat = os.stat('/proc/%s/stat' % pid)
    except OSError as e:
        raise ValueError('Invalid process ID: ' + str(e))

if partial:
        effective_gid = os.getegid()
        effective_uid = os.geteuid()
    else:
        effective_gid = stat.st_gid
        effective_uid = stat.st_uid

os.setregid(stat.st_gid, effective_gid)
    os.setreuid(stat.st_uid, effective_uid)
    assert os.getegid() == effective_gid
    assert os.getgid() == stat.st_gid
    assert os.geteuid() == effective_uid
    assert os.getuid() == stat.st_uid

def init_error_log():
    '''Open a suitable error log if sys.stderr is not a tty.'''

if not os.isatty(2):
        log = os.environ.get('APPORT_LOG_FILE', '/var/log/apport.log')
        try:
            f = os.open(log, os.O_WRONLY | os.O_CREAT | os.O_APPEND, 0o600)
            try:
                admgid = grp.getgrnam('adm')[2]
                os.chown(log, -1, admgid)
                os.chmod(log, 0o640)
            except KeyError:
                pass  # if group adm doesn't exist, just leave it as root
        except OSError:  # on a permission error, don't touch stderr
            return
        os.dup2(f, 1)
        os.dup2(f, 2)
        sys.stderr = os.fdopen(2, 'wb')
        if sys.version_info.major >= 3:
            sys.stderr = io.TextIOWrapper(sys.stderr)
        sys.stdout = sys.stderr

def error_log(msg):
    '''Output something to the error log.'''

apport.error('apport (pid %s) %s: %s', os.getpid(), time.asctime(), msg)

def _log_signal_handler(sgn, frame):
    '''Internal apport signal handler. Just log the signal handler and exit.'''

# reset handler so that we do not get stuck in loops
    signal.signal(sgn, signal.SIG_IGN)
    try:
        error_log('Got signal %i, aborting; frame:' % sgn)
        for s in inspect.stack():
            error_log(str(s))
    except:
        pass
    sys.exit(1)

def setup_signals():
    '''Install a signal handler for all crash-like signals, so that apport is
    not called on itself when apport crashed.'''

signal.signal(signal.SIGILL, _log_signal_handler)
    signal.signal(signal.SIGABRT, _log_signal_handler)
    signal.signal(signal.SIGFPE, _log_signal_handler)
    signal.signal(signal.SIGSEGV, _log_signal_handler)
    signal.signal(signal.SIGPIPE, _log_signal_handler)
    signal.signal(signal.SIGBUS, _log_signal_handler)

def write_user_coredump(pid, cwd, limit, from_report=None):
    '''Write the core into the current directory if ulimit requests it.'''

# three cases:
    # limit == 0: do not write anything
    # limit < 0: unlimited, write out everything
    # limit nonzero: crashed process' core size ulimit in bytes

if limit == 0:
        return

core_path = os.path.join(cwd, 'core')
    try:
        with open('/proc/sys/kernel/core_uses_pid') as f:
            if f.read().strip() != '0':
                core_path += '.' + str(pid)
        core_file = os.open(core_path, os.O_WRONLY | os.O_CREAT | os.O_EXCL, 0o600)
    except (OSError, IOError):
        return

error_log('writing core dump to %s (limit: %s)' % (core_path, str(limit)))

written = 0

# Priming read
    if from_report:
        r = apport.Report()
        with open(from_report, 'rb') as f:
            r.load(f)
        core_size = len(r['CoreDump'])
        if limit > 0 and core_size > limit:
            error_log('aborting core dump writing, size %i exceeds current limit' % core_size)
            os.close(core_file)
            os.unlink(core_path)
            return
        error_log('writing core dump %s of size %i' % (core_path, core_size))
        os.write(core_file, r['CoreDump'])
    else:
        # read from stdin
        block = os.read(0, 1048576)

while True:
            size = len(block)
            if size == 0:
                break
            written += size
            if limit > 0 and written > limit:
                error_log('aborting core dump writing, size exceeds current limit %i' % limit)
                os.close(core_file)
                os.unlink(core_path)
                return
            if os.write(core_file, block) != size:
                error_log('aborting core dump writing, could not write')
                os.close(core_file)
                os.unlink(core_path)
                return
            block = os.read(0, 1048576)

os.close(core_file)
    return core_path

def usable_ram():
    '''Return how many bytes of RAM is currently available that can be
    allocated without causing major thrashing.'''

# abuse our excellent RFC822 parser to parse /proc/meminfo
    r = apport.Report()
    with open('/proc/meminfo', 'rb') as f:
        r.load(f)

memfree = int(r['MemFree'].split()[0])
    cached = int(r['Cached'].split()[0])
    writeback = int(r['Writeback'].split()[0])

return (memfree + cached - writeback) * 1024

def is_closing_session(pid, uid):
    '''Check if pid is in a closing user session.

During that, crashes are common as the session D-BUS and X.org are going
    away, etc. These crash reports are mostly noise, so should be ignored.
    '''
    with open('/proc/%s/environ' % pid) as e:
        env = e.read().split('\0')
    for e in env:
        if e.startswith('DBUS_SESSION_BUS_ADDRESS='):
            dbus_addr = e.split('=', 1)[1]
            break
    else:
        error_log('is_closing_session(): no DBUS_SESSION_BUS_ADDRESS in environment')
        return False

orig_uid = os.geteuid()
    os.setresuid(uid, uid, -1)
    try:
        gdbus = subprocess.Popen(['/usr/bin/gdbus', 'call', '-e', '-d',
                                  'org.gnome.SessionManager', '-o', '/org/gnome/SessionManager', '-m',
                                  'org.gnome.SessionManager.IsSessionRunning'], stdout=subprocess.PIPE,
                                 stderr=subprocess.PIPE, env={'DBUS_SESSION_BUS_ADDRESS': dbus_addr})
        (out, err) = gdbus.communicate()
        if err:
            error_log('gdbus call error: ' + err.decode('UTF-8'))
    except OSError as e:
        error_log('gdbus call failed, cannot determine running session: ' + str(e))
        return False
    finally:
        os.setresuid(orig_uid, orig_uid, -1)
    error_log('debug: session gdbus call: ' + out.decode('UTF-8'))
    if out.startswith(b'(false,'):
        return True

return False

#################################################################
#
# main
#
#################################################################

if len(sys.argv) not in (4, 5):
    try:
        print('Usage: %s <pid> <signal number> <core file ulimit> [global pid]' % sys.argv[0])
        print('The core dump is read from stdin.')
    except IOError:
        # sys.stderr might not actually exist, expecially not when being called
        # from the kernel
        pass
    sys.exit(1)

init_error_log()

# Check if we received a valid global PID (kernel >= 3.12). If we do,
# then compare it with the local PID. If they don't match, it's an
# indication that the crash originated from another PID namespace. In that
# case, attempt to forward the crash to apport in that namespace. If
# apport can't be found, then simply log an entry in the host error log
# and exit 0.
if len(sys.argv) == 5 and sys.argv[4].isdigit() and sys.argv[4] != sys.argv[1]:
    if os.path.exists('/proc/%s/root/%s' % (sys.argv[4], __file__)):
        error_log('pid %s (host pid %s) crashed in a container with apport '
                  'support, forwarding' % (sys.argv[1], sys.argv[4]))
        sys.stderr.flush()
        os.execv('/usr/sbin/chroot', ('chroot', '/proc/%s/root/' % sys.argv[4],
                                      __file__, sys.argv[1], sys.argv[2],
                                      sys.argv[3]))
    else:
        error_log('pid %s crashed in a container without apport support' % sys.argv[4])
        sys.exit(0)

check_lock()

try:
    setup_signals()

(pid, signum, core_ulimit) = sys.argv[1:4]

# drop our process priority level to not disturb userspace so much
    try:
        os.nice(10)
    except OSError:
        pass  # *shrug*, we tried

# Partially drop privs to gain proper os.access() checks
    drop_privileges(pid, True)

# try to find the core dump file; if path is relative, prepend cwd of
    # crashed process
    cwd = os.readlink('/proc/' + pid + '/cwd')

error_log('called for pid %s, signal %s, core limit %s' % (pid, signum, core_ulimit))

try:
        core_ulimit = int(core_ulimit)
    except ValueError:
        error_log('core limit is invalid, disabling core files')
        core_ulimit = 0
    # clamp core_ulimit to a sensible size, for -1 the kernel reports something
    # absurdly big
    if core_ulimit > 9223372036854775807:
        error_log('ignoring implausibly big core limit, treating as unlimited')
        core_ulimit = -1
    # ulimit specifies blocks, which are kB
    if core_ulimit > 0:
        core_ulimit *= 1024

# ignore SIGQUIT (it's usually deliberately generated by users)
    if signum == str(signal.SIGQUIT):
        drop_privileges(pid)
        write_user_coredump(pid, cwd, core_ulimit)
        sys.exit(0)

try:
        pidstat = os.stat('/proc/%s/stat' % pid)
    except OSError:
        error_log('Invalid PID')
        sys.exit(1)

# check if the executable was modified after the process started (e. g.
    # package got upgraded in between)
    exe_mtime = os.stat('/proc/%s/exe' % pid).st_mtime
    process_start = os.lstat('/proc/%s/cmdline' % pid).st_mtime
    if not os.path.exists(os.readlink('/proc/%s/exe' % pid)) or exe_mtime > process_start:
        error_log('executable was modified after program start, ignoring')
        sys.exit(1)

info = apport.Report('Crash')
    info['Signal'] = signum
    if sys.version_info.major < 3:
        info['CoreDump'] = (sys.stdin, True, usable_ram() * 3 / 4, True)
    else:
        # read binary data from stdio
        info['CoreDump'] = (sys.stdin.detach(), True, usable_ram() * 3 / 4, True)

# We already need this here to figure out the ExecutableName (for scripts,
    # etc).
    info.add_proc_info(pid)

if 'ExecutablePath' not in info:
        error_log('could not determine ExecutablePath, aborting')
        sys.exit(1)

subject = info['ExecutablePath'].replace('/', '_')
    base = '%s.%s.%s.hanging' % (subject, str(pidstat.st_uid), pid)
    hanging = os.path.join(apport.fileutils.report_dir, base)

if os.path.exists(hanging):
        if (os.stat('/proc/uptime').st_ctime < os.stat(hanging).st_mtime):
            info['ProblemType'] = 'Hang'
        os.unlink(hanging)

if 'InterpreterPath' in info:
        error_log('script: %s, interpreted by %s (command line "%s")' %
                  (info['ExecutablePath'], info['InterpreterPath'],
                   info['ProcCmdline']))
    else:
        error_log('executable: %s (command line "%s")' %
                  (info['ExecutablePath'], info['ProcCmdline']))

# ignore non-package binaries (unless configured otherwise)
    if not apport.fileutils.likely_packaged(info['ExecutablePath']):
        if not apport.fileutils.get_config('main', 'unpackaged', False, bool=True):
            error_log('executable does not belong to a package, ignoring')
            # check if the user wants a core dump
            drop_privileges(pid)
            write_user_coredump(pid, cwd, core_ulimit)
            sys.exit(1)

# ignore SIGXCPU and SIGXFSZ since this indicates some external
    # influence changing soft RLIMIT values when running programs.
    if signum in [str(signal.SIGXCPU), str(signal.SIGXFSZ)]:
        error_log('Ignoring signal %s (caused by exceeding soft RLIMIT)' % signum)
        drop_privileges(pid)
        write_user_coredump(pid, cwd, core_ulimit)
        sys.exit(0)

# ignore blacklisted binaries
    if info.check_ignored():
        error_log('executable version is blacklisted, ignoring')
        sys.exit(1)

if is_closing_session(pid, pidstat.st_uid):
        error_log('happens for shutting down session, ignoring')
        sys.exit(1)

crash_counter = 0

# Create crash report file descriptor for writing the report into
    # report_dir
    try:
        report = '%s/%s.%i.crash' % (apport.fileutils.report_dir, info['ExecutablePath'].replace('/', '_'), pidstat.st_uid)
        if os.path.exists(report):
            if apport.fileutils.seen_report(report):
                # do not flood the logs and the user with repeated crashes
                with open(report, 'rb') as f:
                    crash_counter = apport.fileutils.get_recent_crashes(f)
                crash_counter += 1
                if crash_counter > 1:
                    drop_privileges(pid)
                    write_user_coredump(pid, cwd, core_ulimit)
                    error_log('this executable already crashed %i times, ignoring' % crash_counter)
                    sys.exit(1)
                # remove the old file, so that we can create the new one with
                # os.O_CREAT|os.O_EXCL
                os.unlink(report)
            else:
                error_log('apport: report %s already exists and unseen, doing nothing to avoid disk usage DoS' % report)
                drop_privileges(pid)
                write_user_coredump(pid, cwd, core_ulimit)
                sys.exit(1)
        reportfile = os.fdopen(os.open(report, os.O_WRONLY | os.O_CREAT | os.O_EXCL, 0), 'wb')
        assert reportfile.fileno() > sys.stderr.fileno()

# Make sure the crash reporting daemon can read this report
        try:
            gid = pwd.getpwnam('whoopsie').pw_gid
            os.chown(report, pidstat.st_uid, gid)
        except (OSError, KeyError):
            os.chown(report, pidstat.st_uid, pidstat.st_gid)
    except (OSError, IOError) as e:
        error_log('Could not create report file: %s' % str(e))
        sys.exit(1)

# Totally drop privs before writing out the reportfile.
    drop_privileges(pid)

info.add_user_info()
    info.add_os_info()

if crash_counter > 0:
        info['CrashCounter'] = '%i' % crash_counter

try:
        info.write(reportfile)
        if reportfile != sys.stderr:
            # Ensure that the file gets written to disk in the event of an
            # Upstart crash.
            if info.get('ExecutablePath', '') == '/sbin/init':
                reportfile.flush()
                os.fsync(reportfile.fileno())
                parent_directory = os.path.dirname(report)
                try:
                    fd = os.open(parent_directory, os.O_RDONLY)
                    os.fsync(fd)
                finally:
                    os.close(fd)
            reportfile.close()
    except IOError:
        if reportfile != sys.stderr:
            os.unlink(report)
        raise
    if report:
        os.chmod(report, 0o640)
    if reportfile != sys.stderr:
        error_log('wrote report %s' % report)

# Check if the user wants a core file. We need to create that from the
    # written report, as we can only read stdin once and write_user_coredump()
    # might abort reading from stdin and remove the written core file when
    # core_ulimit is > 0 and smaller than the core size.
    write_user_coredump(pid, cwd, core_ulimit, from_report=report)

except (SystemExit, KeyboardInterrupt):
    raise
except Exception as e:
    error_log('Unhandled exception:')
    traceback.print_exc()
    error_log('pid: %i, uid: %i, gid: %i, euid: %i, egid: %i' % (
              os.getpid(), os.getuid(), os.getgid(), os.geteuid(), os.getegid()))
    error_log('environment: %s' % str(os.environ))

---

Ceph & rados versions are the same as in the gnocchi-api-1 host... Would this segfault also be from ceph?

Many thanks, regards.

Revision history for this message

Julien Danjou (jdanjou) wrote on 2016-01-19:

#34

Apport is the Python script provided by Canonical to send the core dump to their server. And it seems it crashed too :)

The best way to debug this at this stage is the following:

1. Disable apport with /etc/init.d/apport stop
2. Stop gnocchi-metricd
3. Make sure you have no limit on core dump size: ulimit -c unlimited
4. Run gnocchi-metricd in foreground: gnocchi-metricd --debug
5. Wait for it to crash :)
6. When it segfaults, you should have a "core" file in the current directory
7. With that core file, run gdb as follow: gdb /usr/bin/python2.7 core
8. At the (gdb) prompt type: bt full

That should give you the full backtrace of the segmentation fault and what caused it. It's likely caused by a bug in librados, so it's pretty likely we won't be able to fix, but it can give us a hint.

Revision history for this message

Nicolas Vila (nvlan) wrote on 2016-01-19:

#35

Hello Julien,

I followed your steps, but sadly gdb can't recognize the 5.9GB coredump file:

root@gnocchi-metricd:~# file core
core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), too many program header sections (451)
root@gnocchi-metricd:~# gdb /usr/bin/python2.7 core
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/python2.7...Reading symbols from /usr/lib/debug//usr/bin/python2.7...done.
done.
"/root/core" is not a core dump: File format not recognized
(gdb) bt full
No stack.

Is there any other way we could proceed?

Many thanks, regards.

Revision history for this message

Julien Danjou (jdanjou) wrote on 2016-01-19:

#36

Try that:

1. Disable apport with /etc/init.d/apport stop
2. Stop gnocchi-metricd
3. Run gnocchi-metricd in foreground with gdb: gdb --args /usr/bin/python2.7 /usr/bin/python/gnocchi-metricd --debug
4. (gdb) run
5. Wait for it to crash :)
6. When it segfaults, at the (gdb) prompt type: bt full

The program might be slower while running with gdb.

Revision history for this message

Nicolas Vila (nvlan) wrote on 2016-01-20:

#37

Hello Julien,

Still no luck. I've followed your last steps but gdb still says no stack:

root@gnocchi-metricd:~# gdb --args /usr/bin/python2.7 /usr/local/bin/gnocchi-metricd --debug
GNU gdb (GDB) 7.6.2
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/bin/python2.7...(no debugging symbols found)...done.
(gdb) run
Starting program: /usr/bin/python2.7 /usr/local/bin/gnocchi-metricd --debug
warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7ffff7ffa000
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Option "verbose" from group "DEFAULT" is deprecated for removal. Its value may be silently ignored in the future.
2016-01-20 13:50:15.543 23251 DEBUG gnocchi.service [-] ******************************************************************************** log_opt_values /usr/local/lib/python2.7/dist-packages/oslo_config/cfg.py:2367
(...)
2016-01-20 13:50:16.672 23264 DEBUG gnocchi.storage [-] Processing new and to delete measures process_background_tasks /usr/local/lib/python2.7/dist-packages/gnocchi/storage/__init__.py:182
[Inferior 1 (process 23251) exited normally]
(gdb) bt full
No stack.
(gdb)

I'm going to deploy a fresh metricd host just to be 100% sure this issue is reproducible, will update when confirmed.

Many thanks, regards.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-22: Fix merged to gnocchi (stable/1.3)

#38

Reviewed: https://review.openstack.org/269531
Committed: https://git.openstack.org/cgit/openstack/gnocchi/commit/?id=ee1740bf12c13f546990f01ef76a4f0b29a78aeb
Submitter: Jenkins
Branch: stable/1.3

commit ee1740bf12c13f546990f01ef76a4f0b29a78aeb
Author: Julien Danjou <email address hidden>
Date: Thu Jan 14 17:31:14 2016 +0100

ceph: fix the metric list to process with new measures

    Currently, the list returned in the Ceph driver contains a lot of
    doubloons because it returns a list and not a set. If 1 metric has N new
    measures batch waiting to be processed, the list return will be size N
    and not 1.

Using a set() avoids that issue and the memory draining implied.

    Closes-Bug: #1533793
    Change-Id: I3a0b726aae14a17a23a365babc1a2537fb4d1052
    (cherry picked from commit 614e13d47fdcaeea9d41bebc214014e0c83a0e83)

Julien Danjou (jdanjou) on 2016-02-19

Changed in gnocchi:
milestone:	none → 2.0.0
status:	Fix Committed → Fix Released

Gnocchi

gnocchi-metricd uses all memory and get killed by OOM killer

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Affects		Status	Importance	Assigned to	Milestone
	Gnocchi	Fix Released	Critical	Julien Danjou	Gnocchi 2.0.0
	1.3	Fix Released	Critical	Julien Danjou	Gnocchi 1.3.4