[MongoDB] couldn't start after rebooting mongo nodes due to old pid file.

Bug #1682400 reported by Michael Kraynov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Invalid
High
Michael Kraynov

Bug Description

Detailed bug description:
MongoDB can't be started due to old pid file.
Steps to reproduce:
Deploy latest MOS 9.2 env
Reboot all 3 controllers/mongoDB nodes
Expected results:
MongoDB is started, ceilometer is connected to Mongo
Actual result:
2017-04-12T09:32:35.606011+02:00 cic-2 mongod.27017[27919]: [initandlisten] User Assertion: 12596:old lock file
2017-04-12T09:32:35.606113+02:00 cic-2 mongod.27017[27919]: [initandlisten] exception in initAndListen: 12596 old lock file, terminating
Reproducibility:
 Very often
Workaround:
remove old pid file

Description of the environment:
- Operation system: Ubuntu
- Versions of components: MOS 9.2

Additional information:

2017-04-12T09:32:35.605246+02:00 cic-2 mongod.27017[27919]: [DataFileSync] BackgroundJob starting: DataFileSync
2017-04-12T09:32:35.605301+02:00 cic-2 mongod.27017[27919]: shardObjTest passed
2017-04-12T09:32:35.605356+02:00 cic-2 mongod.27017[27919]: isInRangeTest passed
2017-04-12T09:32:35.605501+02:00 cic-2 mongod.27017[27919]: shardKeyTest passed
2017-04-12T09:32:35.605551+02:00 cic-2 mongod.27017[27919]: [initandlisten] MongoDB starting : pid=27919 port=27017 dbpath=/var/lib/mongo/mongodb 64-bit host=cic-2.domain.tld
2017-04-12T09:32:35.605629+02:00 cic-2 mongod.27017[27919]: [initandlisten] db version v2.6.10
2017-04-12T09:32:35.605657+02:00 cic-2 mongod.27017[27919]: [initandlisten] git version: nogitversion
2017-04-12T09:32:35.605682+02:00 cic-2 mongod.27017[27919]: [initandlisten] OpenSSL version: OpenSSL 1.0.1f 6 Jan 2014
2017-04-12T09:32:35.605706+02:00 cic-2 mongod.27017[27919]: [initandlisten] build info: Linux 1852ca316e35 3.13.0-55-generic #94-Ubuntu SMP Thu Jun 18 00:27:10 UTC 2015 x86_64 BOOST_LIB_VERSION=1_54
2017-04-12T09:32:35.605729+02:00 cic-2 mongod.27017[27919]: [initandlisten] allocator: tcmalloc
2017-04-12T09:32:35.605760+02:00 cic-2 mongod.27017[27919]: [initandlisten] options: { config: "/etc/mongodb.conf", net: { bindIp: "127.0.0.1,192.168.2.27", port: 27017 }, operationProfiling: { mode: "slowOp" }, replication: { replSetName: "ceilometer" }, security: { authorization: "enabled", keyFile: "/etc/mongodb.key" }, storage: { dbPath: "/var/lib/mongo/mongodb", directoryPerDB: true, journal: { enabled: true } }, systemLog: { destination: "syslog", quiet: true, verbosity: 1 } }
2017-04-12T09:32:35.606011+02:00 cic-2 mongod.27017[27919]: [initandlisten] User Assertion: 12596:old lock file
2017-04-12T09:32:35.606113+02:00 cic-2 mongod.27017[27919]: [initandlisten] exception in initAndListen: 12596 old lock file, terminating
2017-04-12T09:32:35.606141+02:00 cic-2 mongod.27017[27919]: [initandlisten] dbexit:
2017-04-12T09:32:35.606164+02:00 cic-2 mongod.27017[27919]: [initandlisten] shutdown: going to close listening sockets...
2017-04-12T09:32:35.606187+02:00 cic-2 mongod.27017[27919]: [initandlisten] shutdown: going to flush diaglog...
2017-04-12T09:32:35.606209+02:00 cic-2 mongod.27017[27919]: [initandlisten] shutdown: going to close sockets...
2017-04-12T09:32:35.606264+02:00 cic-2 mongod.27017[27919]: [initandlisten] shutdown: waiting for fs preallocator...
2017-04-12T09:32:35.606297+02:00 cic-2 mongod.27017[27919]: [initandlisten] shutdown: lock for final commit...
2017-04-12T09:32:35.606343+02:00 cic-2 mongod.27017[27919]: [initandlisten] shutdown: final commit...
2017-04-12T09:32:35.606384+02:00 cic-2 mongod.27017[27919]: [initandlisten] shutdown: closing all files...
2017-04-12T09:32:35.606414+02:00 cic-2 mongod.27017[27919]: [initandlisten] closeAllFiles() finished
2017-04-12T09:32:35.606447+02:00 cic-2 mongod.27017[27919]: [initandlisten] dbexit: really exiting now

Changed in mos:
importance: Undecided → High
Revision history for this message
Michael Kraynov (mkraynov) wrote :

Impact:
Two other nodes have the following errors:
2017-04-12T18:15:11.478877+02:00 cic-1 mongod.27017[7475]: [rsHealthPoll] couldn't connect to 192.168.2.27:27017: couldn't connect to server 192.168.2.27:27017 (192.168.2.27) failed, connection attempt failed
2017-04-12T18:15:11.487226+02:00 cic-1 mongod.27017[7475]: [rsHealthPoll] Failed to connect to 192.168.2.27:27017, reason: errno:111 Connection refused
2017-04-12T18:15:11.487542+02:00 cic-1 mongod.27017[7475]: [rsHealthPoll] Failed to connect to 192.168.2.27:27017, reason: errno:111 Connection refused
2017-04-12T18:15:11.487563+02:00 cic-1 mongod.27017[7475]: [rsHealthPoll] couldn't connect to 192.168.2.27:27017: couldn't connect to server 192.168.2.27:27017 (192.168.2.27) failed, connection attempt failed

Ceiometer-collector can't connect to primary node:

2017-04-12T09:14:26.096726+02:00 cic-1 ceilometer-collector[8529]: 2017-04-12 09:14:26.093 8529 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 192.168.2.26:5673 is unreachable: [Errno 111] Connection refused. Trying again in 1 seconds.
2017-04-12T09:14:27.116837+02:00 cic-1 ceilometer-collector[8529]: 2017-04-12 09:14:27.116 8529 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 192.168.2.25:5673 via [amqp] client
2017-04-12T09:14:42.687743+02:00 cic-1 ceilometer-collector[8518]: 2017-04-12 09:14:42.687 8518 WARNING oslo_config.cfg [-] Option "metering_secret" from group "publisher" is deprecated. Use option "telemetry_secret" from group "publisher".
2017-04-12T09:14:42.690329+02:00 cic-1 ceilometer-collector[8517]: 2017-04-12 09:14:42.689 8517 WARNING oslo_config.cfg [-] Option "metering_secret" from group "publisher" is deprecated. Use option "telemetry_secret" from group "publisher".

Changed in mos:
milestone: 9.x-updates → 9.2-mu-2
assignee: nobody → MOS Maintenance (mos-maintenance)
Revision history for this message
Dmitry Teselkin (teselkin-d) wrote :

In short - having a lock file in place means that mongodb
either running or wasn't stopped correctly. How exactly
servers was rebooted?

From [1]:
> Mongodb always creates the mongodb.lock file when the
> server starts and drops it before mongodb is stopped.
>
> Removing mongodb.lock does not affect any data it just
> means that mongodb was not stopped correctly. So, you
> are correct in removing this file and running with the
> -repair option should fix database.

Please also take a look at [2].

[1] http://stackoverflow.com/questions/6857781/what-is-the-use-of-the-mongo-lock-file/6857973#6857973
[2] http://stackoverflow.com/questions/13700261/mongodb-wont-start-after-server-crash

Revision history for this message
Michael Kraynov (mkraynov) wrote :

> How exactly servers was rebooted?

It was rebooted like reset button. Can we add additional check to mongo service which can remove old lock file?

Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

According to the https://docs.mongodb.com/manual/tutorial/recover-data-following-unexpected-shutdown/ manual, it's not correct just to remove a lock file. There should be some additional steps made in order to recover mongodb correctly. Not sure we can (or should) automate this - recovery might be a very non-straight procedure.

Revision history for this message
Michael Kraynov (mkraynov) wrote :
Download full text (4.6 KiB)

But why the other two nodes couldn't accept connections?

2017-04-12T07:39:04.132922+02:00 cic-3 ceilometer-collector[8245]: 2017-04-12 07:39:04.130 8245 ERROR ceilometer.collector [-] Dispatcher failed to handle the sample, requeue it.
2017-04-12 07:39:04.130 8245 ERROR ceilometer.collector Traceback (most recent call last):
2017-04-12 07:39:04.130 8245 ERROR ceilometer.collector File "/usr/lib/python2.7/dist-packages/ceilometer/collector.py", line 156, in sample
2017-04-12 07:39:04.130 8245 ERROR ceilometer.collector self.dispatcher_manager.map_method(self.method, samples)
2017-04-12 07:39:04.130 8245 ERROR ceilometer.collector File "/usr/lib/python2.7/dist-packages/stevedore/extension.py", line 252, in map_method
2017-04-12 07:39:04.130 8245 ERROR ceilometer.collector method_name, *args, **kwds)
2017-04-12 07:39:04.130 8245 ERROR ceilometer.collector File "/usr/lib/python2.7/dist-packages/stevedore/extension.py", line 225, in map
2017-04-12 07:39:04.130 8245 ERROR ceilometer.collector self._invoke_one_plugin(response.append, func, e, args, kwds)
2017-04-12 07:39:04.130 8245 ERROR ceilometer.collector File "/usr/lib/python2.7/dist-packages/stevedore/extension.py", line 256, in _invoke_one_plugin
2017-04-12 07:39:04.130 8245 ERROR ceilometer.collector response_callback(func(e, *args, **kwds))
2017-04-12 07:39:04.130 8245 ERROR ceilometer.collector File "/usr/lib/python2.7/dist-packages/stevedore/extension.py", line 230, in _call_extension_method
2017-04-12 07:39:04.130 8245 ERROR ceilometer.collector return getattr(extension.obj, method_name)(*args, **kwds)
2017-04-12 07:39:04.130 8245 ERROR ceilometer.collector File "/usr/lib/python2.7/dist-packages/ceilometer/dispatcher/database.py", line 95, in record_metering_data
2017-04-12 07:39:04.130 8245 ERROR ceilometer.collector self.meter_conn.record_metering_data(meter)
2017-04-12 07:39:04.130 8245 ERROR ceilometer.collector File "/usr/lib/python2.7/dist-packages/ceilometer/storage/impl_mongodb.py", line 266, in record_metering_data
2017-04-12 07:39:04.130 8245 ERROR ceilometer.collector return_document=pymongo.ReturnDocument.AFTER,
2017-04-12 07:39:04.130 8245 ERROR ceilometer.collector File "/usr/lib/python2.7/dist-packages/ceilometer/storage/mongo/utils.py", line 415, in closure
2017-04-12 07:39:04.130 8245 ERROR ceilometer.collector return call(*args, **kwargs)
2017-04-12 07:39:04.130 8245 ERROR ceilometer.collector File "/usr/lib/python2.7/dist-packages/ceilometer/storage/mongo/utils.py", line 437, in __call__
2017-04-12 07:39:04.130 8245 ERROR ceilometer.collector return self.method(*args, **kwargs)
2017-04-12 07:39:04.130 8245 ERROR ceilometer.collector File "/usr/lib/python2.7/dist-packages/pymongo/collection.py", line 1881, in find_one_and_update
2017-04-12 07:39:04.130 8245 ERROR ceilometer.collector sort, upsert, return_document, **kwargs)
2017-04-12 07:39:04.130 8245 ERROR ceilometer.collector File "/usr/lib/python2.7/dist-packages/pymongo/collection.py", line 1689, in __find_and_modify
2017-04-12 07:39:04.130 8245 ERROR ceilometer.collector with self._socket_for_writes() as sock_info:
2017-04-12 07:39:04.130 8245 ERRO...

Read more...

Revision history for this message
Dmitry Teselkin (teselkin-d) wrote :
Download full text (6.2 KiB)

MongoDB should work with one node down, we found in its logs that new primary node was re-elected during failover:

==========

2017-04-12T01:24:19.092267+02:00 cic-1 mongod.27017[10826]: [rsSync] replSet SECONDARY
2017-04-12T01:24:20.092510+02:00 cic-1 mongod.27017[10826]: [rsMgr] replSet PRIMARY
2017-04-12T01:25:57.500547+02:00 cic-2 mongod.27017[14059]: [rsHealthPoll] replSet member 192.168.2.26:27017 is now in state PRIMARY
2017-04-12T01:26:12.610586+02:00 cic-2 mongod.27017[14059]: [rsSync] replSet SECONDARY
2017-04-12T01:26:12.875462+02:00 cic-1 mongod.27017[10826]: [rsHealthPoll] replSet member 192.168.2.27:27017 is now in state SECONDARY
2017-04-12T01:26:16.190612+02:00 cic-3 mongod.27017[12860]: [rsHealthPoll] replSet member 192.168.2.27:27017 is now in state SECONDARY
2017-04-12T01:26:16.190814+02:00 cic-3 mongod.27017[12860]: [rsHealthPoll] replSet member 192.168.2.26:27017 is now in state PRIMARY
2017-04-12T01:26:31.313368+02:00 cic-3 mongod.27017[12860]: [rsSync] replSet SECONDARY
2017-04-12T01:26:31.407522+02:00 cic-1 mongod.27017[10826]: [rsHealthPoll] replSet member 192.168.2.25:27017 is now in state SECONDARY
2017-04-12T01:26:31.514937+02:00 cic-2 mongod.27017[14059]: [rsHealthPoll] replSet member 192.168.2.25:27017 is now in state SECONDARY

Summary
-------
01:24:19 -> 01:26:31
cic-1 [192.168.2.26] PRIMARY
cic-2 [192.168.2.27] SECONDARY
cic-3 [192.168.2.25] SECONDARY

==========

2017-04-12T07:25:22.451440+02:00 cic-1 mongod.27017[10826]: [rsMgr] replSet SECONDARY
2017-04-12T07:27:04.642237+02:00 cic-3 mongod.27017[7278]: [rsHealthPoll] replSet member 192.168.2.26:27017 is now in state SECONDARY
2017-04-12T07:27:04.872019+02:00 cic-3 mongod.27017[7278]: [rsSync] replSet SECONDARY
2017-04-12T07:27:05.989145+02:00 cic-1 mongod.27017[10826]: [rsHealthPoll] replSet member 192.168.2.25:27017 is now in state SECONDARY
2017-04-12T07:27:12.123031+02:00 cic-1 mongod.27017[10826]: [rsMgr] replSet PRIMARY
2017-04-12T07:27:12.644713+02:00 cic-3 mongod.27017[7278]: [rsHealthPoll] replSet member 192.168.2.26:27017 is now in state PRIMARY

Summary
-------
07:25:22 -> 07:27:12
cic-1 [192.168.2.26] PRIMARY
cic-2 [192.168.2.27] DOWN?
cic-3 [192.168.2.25] SECONDARY

==========

2017-04-12T07:39:06.153556+02:00 cic-1 mongod.27017[7613]: [rsHealthPoll] replSet member 192.168.2.25:27017 is now in state SECONDARY
2017-04-12T07:39:06.156648+02:00 cic-1 mongod.27017[7613]: [rsSync] replSet SECONDARY
2017-04-12T07:39:07.156990+02:00 cic-1 mongod.27017[7613]: [rsMgr] replSet PRIMARY
2017-04-12T07:39:07.978172+02:00 cic-3 mongod.27017[7278]: [rsHealthPoll] replSet member 192.168.2.26:27017 is now in state PRIMARY

Summary
-------
07:39:06 -> 07:39:07
cic-1 [192.168.2.26] PRIMARY
cic-2 [192.168.2.27] DOWN?
cic-3 [192.168.2.25] SECONDARY

==========

2017-04-12T07:40:37.026948+02:00 cic-1 mongod.27017[28754]: [rsSync] replSet SECONDARY
2017-04-12T07:40:37.036988+02:00 cic-1 mongod.27017[28754]: [rsHealthPoll] replSet member 192.168.2.25:27017 is now in state SECONDARY
2017-04-12T07:40:38.015850+02:00 cic-3 mongod.27017[7278]: [rsHealthPoll] replSet member 192.168.2.26:27017 is now in state SECONDARY
2017-04-12T07:40:44.373900+02:00 cic-3 mongod.27017[727...

Read more...

Changed in mos:
assignee: MOS Maintenance (mos-maintenance) → MOS Ceilometer (mos-ceilometer)
Changed in mos:
assignee: MOS Ceilometer (mos-ceilometer) → Alexey Stupnikov (astupnikov)
Revision history for this message
Alexey Stupnikov (astupnikov) wrote :

I have tried to reproduce described issue and understand what is going on there. I have used our virtualized lab and destroyed controller's VMs with "virsh destroy" commands and started them with "virsh create".

It was impossible for me the issue: mongod started successfully when lock file wasn't deleted [1]

IMO mongod tries to automatically recover from the failure and cancels mongod initialization if it fails [2]

From my point of view, we have the following situation: mongod fails to recover and breaks the cluster. I don't think that there is something wrong with this situation.

[1] https://paste.mirantis.net/show/11402/
[2] https://paste.mirantis.net/show/11403/

Revision history for this message
Alexey Stupnikov (astupnikov) wrote :

Moving bug to Incomplete. I would like to ask reporter to clarify steps to reproduce (cluster load, cluster configuration, reboot method, lab environment).

Changed in mos:
status: New → Incomplete
assignee: Alexey Stupnikov (astupnikov) → Michael Kraynov (mkraynov)
milestone: 9.2-mu-2 → 9.x-updates
Revision history for this message
Alexey Stupnikov (astupnikov) wrote :

Moving to Invalid, since the bug wasn't updated for more than 40 days now.

Changed in mos:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.