machine lock-up, upgrade kernel on MF cluster

Bug #1042172 reported by Muharem Hrnjadovic
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenQuake (deprecated)
Fix Released
High
Muharem Hrnjadovic

Bug Description

gemmicro01 became unresponsive a week ago. Today Urs was finally granted access to the data center and we could take a look at the machine. We hooked up a console and a keyboard but the machine was hung up and we could not log into it.

We also took a quick look at gemmicro01 and noticed an error on the console: It read approx. as follows:

    gemmicro02 kernel: [1546823.628018] INFO: rcu_bh detected stall on CPU 6 (t=0 jiffies)

The same error was found in the /var/log/syslog of other machines e.g.

    bigstar04.log:Aug 27 09:04:54 bigstar04 kernel: [2986792.080036] INFO: rcu_bh detected stall on CPU 35 (t=0 jiffies)
    gemmicro01.log:Aug 17 08:16:39 gemmicro01 kernel: [1546823.628018] INFO: rcu_bh detected stall on CPU 6 (t=0 jiffies)
    gemmicro01.log:Aug 17 18:06:45 gemmicro01 kernel: [1582229.668026] INFO: rcu_bh detected stall on CPU 46 (t=0 jiffies)

Whether this is what caused gemmicro01 to lock up remains to be determined.

There is a possibility that these kinds of errors were fixed in the 3.4 kernel (see https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1003081)

Also please see: https://lkml.org/lkml/2012/3/27/169

And: http://www.kernel.org/doc/Documentation/RCU/stallwarn.txt

Changed in openquake:
status: New → Confirmed
importance: Undecided → Medium
tags: added: devop mfcluster
description: updated
Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

The kernel will be upgraded on all MF cluster machine in order to prevent this issue from occurring

summary: - gemmicro01 lock-up
+ machine lock-up, upgrade kernel on MF cluster
Changed in openquake:
assignee: nobody → Muharem Hrnjadovic (al-maisan)
importance: Medium → High
milestone: none → 0.8.3
Changed in openquake:
status: Confirmed → Fix Committed
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.