Multiple node online/offline

Bug #1612670 reported by Sergey Galkin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Georgy Kibardin
Mitaka
Invalid
High
Georgy Kibardin

Bug Description

During deployment cluster with 483 nodes several nodes jump from online to offline and back.
Ping to this nodes is ok.

200 packets transmitted, 200 received, 0% packet loss, time 199001ms
rtt min/avg/max/mdev = 0.109/0.198/0.361/0.044 ms

All nodes messages.log has this

2016-08-12T13:57:19.540306+00:00 notice: nailgun-agent: I, [2016-08-12T13:57:19.126847 #37361] INFO -- : API URL is https://10.21.0.2:8443/api
2016-08-12T13:57:34.482304+00:00 notice: nailgun-agent: /usr/lib/ruby/vendor_ruby/httpclient/session.rb:803:in `initialize': execution expired (HTTPClient::ConnectTimeoutError)
2016-08-12T13:57:34.482399+00:00 notice: nailgun-agent: #011from /usr/lib/ruby/vendor_ruby/httpclient/session.rb:803:in `new'
2016-08-12T13:57:34.482589+00:00 notice: nailgun-agent: #011from /usr/lib/ruby/vendor_ruby/httpclient/session.rb:803:in `create_socket'
2016-08-12T13:57:34.482673+00:00 notice: nailgun-agent: #011from /usr/lib/ruby/vendor_ruby/httpclient/session.rb:752:in `block in connect'
2016-08-12T13:57:34.482886+00:00 notice: nailgun-agent: #011from /usr/lib/ruby/vendor_ruby/httpclient/session.rb:751:in `connect'
2016-08-12T13:57:34.483030+00:00 notice: nailgun-agent: #011from /usr/lib/ruby/vendor_ruby/httpclient/session.rb:609:in `query'
2016-08-12T13:57:34.483131+00:00 notice: nailgun-agent: #011from /usr/lib/ruby/vendor_ruby/httpclient/session.rb:164:in `query'
2016-08-12T13:57:34.483262+00:00 notice: nailgun-agent: #011from /usr/lib/ruby/vendor_ruby/httpclient.rb:1083:in `do_get_block'
2016-08-12T13:57:34.483411+00:00 notice: nailgun-agent: #011from /usr/lib/ruby/vendor_ruby/httpclient.rb:887:in `block in do_request'
2016-08-12T13:57:34.483538+00:00 notice: nailgun-agent: #011from /usr/lib/ruby/vendor_ruby/httpclient.rb:981:in `protect_keep_alive_disconnected'
2016-08-12T13:57:34.483672+00:00 notice: nailgun-agent: #011from /usr/lib/ruby/vendor_ruby/httpclient.rb:886:in `do_request'
2016-08-12T13:57:34.483805+00:00 notice: nailgun-agent: #011from /usr/lib/ruby/vendor_ruby/httpclient.rb:774:in `request'
2016-08-12T13:57:34.483961+00:00 notice: nailgun-agent: #011from /usr/lib/ruby/vendor_ruby/httpclient.rb:677:in `get'
2016-08-12T13:57:34.484066+00:00 notice: nailgun-agent: #011from /usr/bin/nailgun-agent:166:in `get_scheme_and_port'
2016-08-12T13:57:34.484220+00:00 notice: nailgun-agent: #011from /usr/bin/nailgun-agent:154:in `initialize'
2016-08-12T13:57:34.484342+00:00 notice: nailgun-agent: #011from /usr/bin/nailgun-agent:1131:in `new'
2016-08-12T13:57:34.484487+00:00 notice: nailgun-agent: #011from /usr/bin/nailgun-agent:1131:in `<main>'
2016-08-12T13:57:56.606626+00:00 notice: nailgun-agent: at depth 0 - 18: self signed certificate
2016-08-12T13:57:56.606789+00:00 notice: nailgun-agent: I, [2016-08-12T13:57:56.261143 #37361] INFO -- : MCollective is up to date with identity = 635
2016-08-12T13:57:56.607001+00:00 notice: nailgun-agent: I, [2016-08-12T13:57:56.261407 #37361] INFO -- : Wrote data to file '/etc/nailgun_uid'. Data: 635
2016-08-12T13:58:21.651079+00:00 notice: nailgun-agent: at depth 0 - 18: self signed certificate
2016-08-12T13:58:21.651250+00:00 notice: nailgun-agent: I, [2016-08-12T13:58:21.508248 #37773] INFO -- : API URL is https://10.21.0.2:8443/api
2016-08-12T13:58:30.083656+00:00 notice: nailgun-agent: /usr/lib/ruby/vendor_ruby/httpclient/session.rb:803:in `initialize': execution expired (HTTPClient::ConnectTimeoutError)
2016-08-12T13:58:30.083776+00:00 notice: nailgun-agent: #011from /usr/lib/ruby/vendor_ruby/httpclient/session.rb:803:in `new'
2016-08-12T13:58:30.083929+00:00 notice: nailgun-agent: #011from /usr/lib/ruby/vendor_ruby/httpclient/session.rb:803:in `create_socket'
2016-08-12T13:58:30.084079+00:00 notice: nailgun-agent: #011from /usr/lib/ruby/vendor_ruby/httpclient/session.rb:752:in `block in connect'
2016-08-12T13:58:30.084220+00:00 notice: nailgun-agent: #011from /usr/lib/ruby/vendor_ruby/httpclient/session.rb:751:in `connect'
2016-08-12T13:58:30.084360+00:00 notice: nailgun-agent: #011from /usr/lib/ruby/vendor_ruby/httpclient/session.rb:609:in `query'
2016-08-12T13:58:30.084495+00:00 notice: nailgun-agent: #011from /usr/lib/ruby/vendor_ruby/httpclient/session.rb:164:in `query'
2016-08-12T13:58:30.084628+00:00 notice: nailgun-agent: #011from /usr/lib/ruby/vendor_ruby/httpclient.rb:1083:in `do_get_block'
2016-08-12T13:58:30.084764+00:00 notice: nailgun-agent: #011from /usr/lib/ruby/vendor_ruby/httpclient.rb:887:in `block in do_request'
2016-08-12T13:58:30.084905+00:00 notice: nailgun-agent: #011from /usr/lib/ruby/vendor_ruby/httpclient.rb:981:in `protect_keep_alive_disconnected'
2016-08-12T13:58:30.085040+00:00 notice: nailgun-agent: #011from /usr/lib/ruby/vendor_ruby/httpclient.rb:886:in `do_request'
2016-08-12T13:58:30.085172+00:00 notice: nailgun-agent: #011from /usr/lib/ruby/vendor_ruby/httpclient.rb:774:in `request'
2016-08-12T13:58:30.085305+00:00 notice: nailgun-agent: #011from /usr/lib/ruby/vendor_ruby/httpclient.rb:677:in `get'
2016-08-12T13:58:30.085436+00:00 notice: nailgun-agent: #011from /usr/bin/nailgun-agent:166:in `get_scheme_and_port'
2016-08-12T13:58:30.085569+00:00 notice: nailgun-agent: #011from /usr/bin/nailgun-agent:154:in `initialize'
2016-08-12T13:58:30.085701+00:00 notice: nailgun-agent: #011from /usr/bin/nailgun-agent:1131:in `new'
2016-08-12T13:58:30.085832+00:00 notice: nailgun-agent: #011from /usr/bin/nailgun-agent:1131:in `<main>'
2016-08-12T13:59:01.718390+00:00 notice: nailgun-agent: at depth 0 - 18: self signed certificate

Revision history for this message
Sergey Galkin (sgalkin) wrote :
Changed in fuel:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Fuel Sustaining (fuel-sustaining-team)
milestone: none → 10.0
tags: added: area-python
Revision history for this message
Georgy Kibardin (gkibardin) wrote :

Could you please gather a diagnostic snapshot.

Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Sergey Galkin (sgalkin) wrote :

Snapshot available on http://mos-scale-share.mirantis.com/fuel-snapshot-2016-08-15_12-20-55.tar.gz

screenshot with nodes attached

Changed in fuel:
status: Incomplete → Confirmed
Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Georgy Kibardin (gkibardin)
Revision history for this message
Georgy Kibardin (gkibardin) wrote :

BTW, log rotation, at least for nailgun, is broken - I only can see first 3 hours of logs for Aug 11 and 12

Revision history for this message
Georgy Kibardin (gkibardin) wrote :

There are 3 nodes with such error. Only one of them has a lot of the errors mentioned: 10.21.1.243.

This node also contains strange messages like this:

2016-08-12T14:59:55.736124+00:00 debug: 14:59:55.548728 #2061] DEBUG -- : Response: status: 409 body: {"message": "Node with mac 3C:FD:FE:9C:B1:1C already exists - doing nothing", "errors": []}

kernel.log from 10.21.1.36 contains the following:

2016-08-11T19:04:18.215408+00:00 info: [ 4.440441] i40e 0000:02:00.0: MAC address: 3c:fd:fe:9c:b1:1c

So the main version of what has happened is MAC address clash, however I cannot prove it - there is no data (except some logs) from these nodes in the snapshot.

Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Stanislaw Bogatkin (sbogatkin) wrote :

I believe we can't catch such problem just in usual tests so I don't want to close it so far. Sergey, could you please tell more information about this problem? Is your cluster still alive?

Changed in fuel:
assignee: Georgy Kibardin (gkibardin) → Sergey Galkin (sgalkin)
Changed in fuel:
status: Incomplete → New
status: New → Incomplete
Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Invalid as the bug staued in Incomplete for more than month. Please reopen if you get additional information.

Changed in fuel:
status: Incomplete → Invalid
Revision history for this message
Sergey Novikov (snovikov) wrote :

I've caught the similar error. Node is offline for nailgun after the reboot, but node is available by ssh/ping. bellow is part of nailgun-agent.log from this node:

I, [2016-12-27T09:53:39.328070 #2914] INFO -- : API URL is https://10.109.36.2:8443/api
/usr/bin/nailgun-agent:338:in `_network': undefined method `gsub' for nil:NilClass (NoMethodError)
        from /usr/bin/nailgun-agent:168:in `initialize'
        from /usr/bin/nailgun-agent:1363:in `new'
        from /usr/bin/nailgun-agent:1363:in `<main>'
at depth 0 - 18: self signed certificate

if you need env with the reproduced issue I can prepare it for you, just inform me about that

Changed in fuel:
status: Invalid → Confirmed
Revision history for this message
Alexander Kurenyshev (akurenyshev) wrote :

@Georgy, please, contact Sergey Novikov, and take a look on his env, maybe it's the similar problem, or maybe not.
If you need some more info, please, tell us what exactly you want to get.

Changed in fuel:
assignee: Sergey Galkin (sgalkin) → Georgy Kibardin (gkibardin)
Revision history for this message
Georgy Kibardin (gkibardin) wrote :

The reason seems to be different, one node just went offline completely with the following in nailgun-agent.log:
I, [2017-01-10T13:29:13.228846 #5841] INFO -- : API URL is https://10.109.5.2:8443/api
/usr/bin/nailgun-agent:338:in `_network': undefined method `gsub' for nil:NilClass (NoMethodError)
        from /usr/bin/nailgun-agent:168:in `initialize'
        from /usr/bin/nailgun-agent:1385:in `new'
        from /usr/bin/nailgun-agent:1385:in `<main>'

Revision history for this message
Georgy Kibardin (gkibardin) wrote :

The reason of this exception is absence of default route:

root@node-2:~# ip route
10.109.5.0/24 dev br-fw-admin proto kernel scope link src 10.109.5.4

Revision history for this message
Georgy Kibardin (gkibardin) wrote :

And there is only one interface which is also wrong. Lets create a new bug to figure the reason of incorrect setup and make nailgun-agent more robust.

Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Sergey Novikov (snovikov) wrote :
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Marking as Invalid, because of no activity for more than a month.

Changed in fuel:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.