ohai: yet another sporadic segfault

Bug #1572485 reported by Dmitry Guryanov
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
MOS Linux
Mitaka
Fix Released
High
MOS Linux

Bug Description

There is a bug in ohai, if you run this script:

require 'ohai'
100.times do
  GC.disable
  500.times do
    os = Ohai::System.new
    os.all_plugins
  end
  GC.enable
  GC.start
  sleep 2
end

if will fail with a segfault after 1000 - 10000 iterations with a backtrace similar to

http://www.paste.org/80856

Dmitry Pyzhov (dpyzhov)
Changed in fuel:
milestone: none → 10.0
assignee: nobody → MOS Linux (mos-linux)
Revision history for this message
Ivan Suzdal (isuzdal) wrote :

Please, provide link to results where this bug is affecting our tests

Changed in fuel:
status: Confirmed → Incomplete
status: Incomplete → New
status: New → Incomplete
importance: High → Undecided
status: Incomplete → New
Revision history for this message
Maksim Malchuk (mmalchuk) wrote :
Download full text (6.8 KiB)

https://custom-ci.infra.mirantis.net/view/9.0/job/9.0.custom.system_test/218/artifact/logs/fail_error_reinstall_single_regular_controller_node-fuel-snapshot-2016-04-22_10-51-54.tar.xz

nailgun.test.domain.local/var/log/remote/node-4.test.domain.local.bak/bootstrap/agent.log

2016-04-22T09:29:24.305979+00:00 info: /usr/lib/ruby/vendor_ruby/ohai/mixin/from_file.rb:29: [BUG] rb_gc_mark(): unknown data type 0x10(0x7fc3759ab7b8) non object
2016-04-22T09:29:24.306107+00:00 info: ruby 1.9.3p484 (2013-11-22 revision 43786) [x86_64-linux]
2016-04-22T09:29:24.306235+00:00 info:
2016-04-22T09:29:24.306370+00:00 info: -- Control frame information -----------------------------------------------
2016-04-22T09:29:24.306501+00:00 info: c:0022 p:---- s:0092 b:0092 l:000091 d:000091 CFUNC :instance_eval
2016-04-22T09:29:24.306626+00:00 info: c:0021 p:0066 s:0086 b:0086 l:000085 d:000085 METHOD /usr/lib/ruby/vendor_ruby/ohai/mixin/from_file.rb:29
2016-04-22T09:29:24.306750+00:00 info: c:0020 p:0094 s:0082 b:0082 l:000071 d:000081 BLOCK /usr/lib/ruby/vendor_ruby/ohai/system.rb:215
2016-04-22T09:29:24.306874+00:00 info: c:0019 p:---- s:0077 b:0077 l:000076 d:000076 FINISH
2016-04-22T09:29:24.306997+00:00 info: c:0018 p:---- s:0075 b:0075 l:000074 d:000074 CFUNC :each
2016-04-22T09:29:24.307132+00:00 info: c:0017 p:0145 s:0072 b:0072 l:000071 d:000071 METHOD /usr/lib/ruby/vendor_ruby/ohai/system.rb:210
2016-04-22T09:29:24.307132+00:00 info: c:0016 p:0118 s:0066 b:0066 l:000046 d:000065 BLOCK /usr/lib/ruby/vendor_ruby/ohai/system.rb:139
2016-04-22T09:29:24.307279+00:00 info: c:0015 p:---- s:0060 b:0060 l:000059 d:000059 FINISH
2016-04-22T09:29:24.307402+00:00 info: c:0014 p:---- s:0058 b:0058 l:000057 d:000057 CFUNC :each
2016-04-22T09:29:24.307525+00:00 info: c:0013 p:0079 s:0055 b:0055 l:000046 d:000054 BLOCK /usr/lib/ruby/vendor_ruby/ohai/system.rb:132
2016-04-22T09:29:24.307648+00:00 info: c:0012 p:---- s:0052 b:0052 l:000051 d:000051 FINISH
2016-04-22T09:29:24.307819+00:00 info: c:0011 p:---- s:0050 b:0050 l:000049 d:000049 CFUNC :each
2016-04-22T09:29:24.307951+00:00 info: c:0010 p:0035 s:0047 b:0047 l:000046 d:000046 METHOD /usr/lib/ruby/vendor_ruby/ohai/system.rb:130
2016-04-22T09:29:24.307951+00:00 info: c:0009 p:0031 s:0044 b:0044 l:000640 d:000043 BLOCK /usr/bin/nailgun-agent:195
2016-04-22T09:29:24.308088+00:00 info: c:0008 p:0111 s:0041 b:0041 l:000750 d:000750 METHOD /usr/lib/ruby/1.9.1/timeout.rb:69
2016-04-22T09:29:24.308215+00:00 info: c:0007 p:0019 s:0029 b:0029 l:000640 d:000640 METHOD /usr/bin/nailgun-agent:193
2016-04-22T09:29:24.308343+00:00 info: c:0006 p:0125 s:0025 b:0025 l:000024 d:000024 METHOD /usr/bin/nailgun-agent:157
2016-04-22T09:29:24.308472+00:00 info: c:0005 p:---- s:0019 b:0019 l:000018 d:000018 FINISH
2016-04-22T09:29:24.308599+00:00 info: c:0004 p:---- s:0017 b:0017 l:000016 d:000016 CFUNC :new
2016-04-22T09:29:24.308726+00:00 info: c:0003 p:0598 s:0013 b:0013 l:001398 d:001880 EVAL /usr/bin/nailgun-agent:1058
2016-04-22T09:29:24.308852+00:00 info: c:0002 p:---- s:0004 b:0004 l:000003 d:000003 FINISH
2016-04-22T09:29:24.308976+00:00 info: c:0001 p:0000 s:0002 b:0002 l:001398 d:001398 TOP
2016-04-22T09:29:24.309099+00...

Read more...

tags: added: swarm-blocker
Revision history for this message
Maksim Malchuk (mmalchuk) wrote :
Revision history for this message
Alexander Gordeev (a-gordeev) wrote :

actually, it produces scary kernel stack traces due to that.

Revision history for this message
Alexander Gordeev (a-gordeev) wrote :
Revision history for this message
Alexander Gordeev (a-gordeev) wrote :

however, the operating system still runs fine, it's possible to use it.

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to packages/trusty/ohai (master)

Fix proposed to branch: master
Change author: Ivan Suzdal <email address hidden>
Review: https://review.fuel-infra.org/20030

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote : Re: MOS version of ohai fails randomly

The "fix" is wrong and reintroduces LP #1463835. Perhaps there's more unwrapped Sigar.new calls in the code.
To solve the problem for real one should find such calls and wrap them as this patch does: https://review.fuel-infra.org/gitweb?p=packages/trusty/ohai.git;a=blob;f=debian/patches/Mirantis-sigar-segfault-workaround.patch;h=e76e28e20a50c5fbedcd65492e55ac5dcedcab91;hb=7219d5622e11e8a49f5e4713d7405986ea63c176

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> Perhaps there's more unwrapped Sigar.new calls in the code.

The actual error in the stack trace [1]

> /usr/lib/ruby/vendor_ruby/ohai/mixin/from_file.rb:29: [BUG] rb_gc_mark(): unknown data type 0x10(0x7fe220c587b8) non object

looks very similar to that in LP: #1463835, see the comment [2] therein. That might be either unwrapped Sigar object,
or a different careless usage of rg_gc_mark

[1] https://bugs.launchpad.net/fuel/+bug/1572485/+attachment/4643100/+files/nailgun-agent.log
[2] https://bugs.launchpad.net/fuel/+bug/1463835/comments/12

Happy debugging,
          Alexey

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> it produces scary kernel stack traces due to that

2016-04-22T12:05:11Z info kernel: [ 381.351591] potentially unexpected fatal signal 6.
2016-04-22T12:05:11Z warning kernel: [ 381.352169] CPU: 0 PID: 3086 Comm: ruby Not tainted 3.13.0-86-generic #130-Ubuntu
2016-04-22T12:05:11Z warning kernel: [ 381.352939] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20151012_155810-obs-1 04/01/2014
2016-04-22T12:05:11Z warning kernel: [ 381.353925] task: ffff8800b7890000 ti: ffff8800a6444000 task.ti: ffff8800a6444000
2016-04-22T12:05:11Z warning kernel: [ 381.354677] RIP: 0033:[<00007fdd508dccc9>] [<00007fdd508dccc9>] 0x7fdd508dccc9
2016-04-22T12:05:11Z warning kernel: [ 381.355478] RSP: 002b:00007ffe21e13f98 EFLAGS: 00000202
2016-04-22T12:05:11Z warning kernel: [ 381.356041] RAX: 0000000000000000 RBX: 00007fdd50dd09a0 RCX: ffffffffffffffff
2016-04-22T12:05:11Z warning kernel: [ 381.356770] RDX: 0000000000000006 RSI: 0000000000000c0e RDI: 0000000000000c0e
2016-04-22T12:05:11Z warning kernel: [ 381.357487] RBP: 0000000000f4a1a0 R08: 00007fdd50c669d0 R09: 00007fdd50dcf058
2016-04-22T12:05:11Z warning kernel: [ 381.358201] R10: 0000000000000008 R11: 0000000000000202 R12: 00007fdd50dd05a4
2016-04-22T12:05:11Z warning kernel: [ 381.358916] R13: 00007fdd50dd0620 R14: 00007ffe21e141d0 R15: 0000000000000000
2016-04-22T12:05:11Z warning kernel: [ 381.359636] FS: 00007fdd5129a740(0000) GS:ffff8800bec00000(0000) knlGS:0000000000000000
2016-04-22T12:05:11Z warning kernel: [ 381.360470] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2016-04-22T12:05:11Z warning kernel: [ 381.361072] CR2: 00007fdd508e0033 CR3: 00000000a6a6c000 CR4: 00000000001407f0

The program has been terminated due to an assertion failure in ruby interpreter (SIGABRT). What's exactly scary about it?

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> if will fail after 100-500 iterations with

It runs just fine after 1400 iterations. What I'm doing wrong?

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

Marking as Incomplete since there's no systematic way to reproduce the problem (and to test if proposed fixes actually solve it)

Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

Alexey the bug 100% reproduced on custom test. Please read all the comments above!

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

@mmalchuk:

> the bug 100% reproduced on custom test

The `custom test' (whatever it is) is absolutely useless since a ruby/C developer has no way to run it.
The script in the bug description is an example of a reasonable test - anyone who knows ruby
can run it and get the result in a few minutes. Unfortunately that script does not reproduce the problem for me, hence I've marked the bug as Incomplete.

description: updated
summary: - MOS version of ohai fails randomly
+ ohai: yet another sporadic segfault
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to packages/trusty/ruby-sigar (master)

Fix proposed to branch: master
Change author: Alexei Sheplyakov <email address hidden>
Review: https://review.fuel-infra.org/20149

tags: added: centos-72-target
tags: added: centos72-target
removed: centos-72-target
tags: removed: centos72-target
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to packages/trusty/ruby-sigar (master)

Reviewed: https://review.fuel-infra.org/20149
Submitter: Pkgs Jenkins <email address hidden>
Branch: master

Commit: f7c2f5a5b80577014117a9ff5f69417692154f80
Author: Alexei Sheplyakov <email address hidden>
Date: Wed Apr 27 11:31:34 2016

Added custom ruby-sigar package to fix segfaults for real

rb_sigar_new does not initialize rb_sigar_t->logger, as a result ruby
runtime garbage collection routines sometimes fail either with a segfault
or an assertion failure:

$ cat test2.rb

 require 'sigar'
 100.times do
   GC.disable
   sigars = []
   100.times do
     sigar = Sigar.new
     sigars << sigar
   end
   GC.enable
   GC.start
   sleep 2
 end

$ ruby test2.rb
/home/asheplyakov/tmp/test2.rb:11: [BUG] rb_gc_mark(): unknown data type 0x0(0x7fbff5cc07c8) non object
ruby 1.9.3p484 (2013-11-22 revision 43786) [x86_64-linux]
[skipped]

Initialize rb_sigar_t->logger to avoid the problem.

The original source has been downloaded from
http://archive.ubuntu.com/ubuntu/pool/universe/r/ruby-sigar/ruby-sigar_0.7.2.orig.tar.gz
http://archive.ubuntu.com/ubuntu/pool/universe/r/ruby-sigar/ruby-sigar_0.7.2-2.debian.tar.gz

Change-Id: I8a1e95fada16ac7b8fa0ace0cfcf83a624935675
Closes-Bug: #1572485

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to packages/trusty/ruby-sigar (9.0)

Fix proposed to branch: 9.0
Change author: Alexei Sheplyakov <email address hidden>
Review: https://review.fuel-infra.org/20748

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to packages/trusty/ruby-sigar (9.0)

Reviewed: https://review.fuel-infra.org/20748
Submitter: Pkgs Jenkins <email address hidden>
Branch: 9.0

Commit: 26f4870970baec3c137a7516118817ce57675bd8
Author: Alexei Sheplyakov <email address hidden>
Date: Tue May 17 12:54:23 2016

Added custom ruby-sigar package to fix segfaults for real

rb_sigar_new does not initialize rb_sigar_t->logger, as a result ruby
runtime garbage collection routines sometimes fail either with a segfault
or an assertion failure:

$ cat test2.rb

 require 'sigar'
 100.times do
   GC.disable
   sigars = []
   100.times do
     sigar = Sigar.new
     sigars << sigar
   end
   GC.enable
   GC.start
   sleep 2
 end

$ ruby test2.rb
/home/asheplyakov/tmp/test2.rb:11: [BUG] rb_gc_mark(): unknown data type 0x0(0x7fbff5cc07c8) non object
ruby 1.9.3p484 (2013-11-22 revision 43786) [x86_64-linux]
[skipped]

Initialize rb_sigar_t->logger to avoid the problem.

The original source has been downloaded from
http://archive.ubuntu.com/ubuntu/pool/universe/r/ruby-sigar/ruby-sigar_0.7.2.orig.tar.gz
http://archive.ubuntu.com/ubuntu/pool/universe/r/ruby-sigar/ruby-sigar_0.7.2-2.debian.tar.gz

Change-Id: I8a1e95fada16ac7b8fa0ace0cfcf83a624935675
Closes-Bug: #1572485
(cherry picked from commit f7c2f5a5b80577014117a9ff5f69417692154f80)

Revision history for this message
Alexandr Kostrikov (akostrikov-mirantis) wrote :

Checked 3 times on 9.0 RC1

no longer affects: fuel/newton
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on packages/trusty/ohai (master)

Change abandoned by Ivan Suzdal <email address hidden> on branch: master
Review: https://review.fuel-infra.org/20030

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.