Bootspeed completes 16-17 runs and then fails to connect to machine

Bug #1099519 reported by Max Brustkern
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
UTAH
Invalid
High
Max Brustkern

Bug Description

Sometimes, the bootspeed process will complete 16 or 17 runs worth of data gathering, but then when the machine reboots we time out attempting to connect to it again after 10 minutes.

Changed in utah:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Max Brustkern (nuclearbob) wrote :

I got a good look at this today. The syslog on the machine has no entries for 17 minutes. I guess if we had tried for 20 minutes instead of 10, we would have been able to continue. I'm going to look into raising this timeout to 30 minutes to be safe.

Revision history for this message
Max Brustkern (nuclearbob) wrote :

I raised the timeout to 30 minutes, so this should be fixed.

Changed in utah:
status: Triaged → Fix Released
Revision history for this message
Max Brustkern (nuclearbob) wrote :

I saw the process fail after a single collection attempt here:
http://10.97.0.1:8080/job/bootspeed-backfill-desktop-i386-acer-veriton-01/29/console
If this continues to occur, we can reopen the bug and attempt to recreate.

Changed in utah:
assignee: nobody → Max Brustkern (nuclearbob)
Revision history for this message
Gema Gomez (gema) wrote :

may this be due to the CPU thermal protection? If it happens again, can we check with rick in the lab if this is due to the CPU refusing to restart due to thermal issues? or PSU thermal protection?

We could set up lm-sensors on the bootspeed tests setup and monitor the temperature of the CPU in case it overheats, and wait for it to cool down if needed:
https://help.ubuntu.com/community/SensorInstallHowto

Revision history for this message
Gema Gomez (gema) wrote :

Here is an example of how the output looks on my machine:
10:33:47 gema-pc:~ % sensors
atk0110-acpi-0
Adapter: ACPI interface
Vcore Voltage: +0.86 V (min = +0.80 V, max = +1.60 V)
 +3.3 Voltage: +3.47 V (min = +2.97 V, max = +3.63 V)
 +5 Voltage: +5.09 V (min = +4.50 V, max = +5.50 V)
 +12 Voltage: +12.32 V (min = +10.20 V, max = +13.80 V)
CPU FAN Speed: 1095 RPM (min = 600 RPM)
CHASSIS1 FAN Speed: 0 RPM (min = 600 RPM)
CHASSIS2 FAN Speed: 0 RPM (min = 600 RPM)
POWER FAN Speed: 0 RPM (min = 600 RPM)
CPU Temperature: +41.5°C (high = +60.0°C, crit = +95.0°C)
MB Temperature: +37.0°C (high = +45.0°C, crit = +75.0°C)

So if any of the values approaches maxs or highs for the test machine, we should insert a wait time in the tests.
coretemp-isa-0000
Adapter: ISA adapter
Core 0: +36.0°C (high = +83.0°C, crit = +99.0°C)
Core 1: +32.0°C (high = +83.0°C, crit = +99.0°C)
Core 2: +35.0°C (high = +83.0°C, crit = +99.0°C)
Core 3: +31.0°C (high = +83.0°C, crit = +99.0°C)

Revision history for this message
Gema Gomez (gema) wrote :

This could be related to the X hang on bug 1096943

Changed in utah:
status: Fix Released → In Progress
Revision history for this message
Max Brustkern (nuclearbob) wrote :

This still happens occasionally, but less since we've moved back to desktop.

Revision history for this message
Max Brustkern (nuclearbob) wrote :

The new bootspeed process doesn't have this issue. Possibly underlying utah improvements fixed it.

Changed in utah:
status: In Progress → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.