pt-heartbeat crashes with sleep argument error

Bug #1406390 reported by Robert Barabas on 2014-12-29
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona Toolkit moved to https://jira.percona.com/projects/PT
Fix Released
Medium
Frank Cizmich

Bug Description

The exception which kills the daemon:

Time::HiRes::sleep(-9.53674e-07): negative time not invented yet at /usr/local/bin/pt-heartbeat line 5279

The error is triggered by this statement:

   5279 sleep $next_interval - time;

This was noticed in v2.2.6 but the related function (make_interval_iter) has not changed in current (v2.2.12 at the time of report.)

Corresponding code segment:

   5259 # ########################################################################
   5260 # Monitor or update the heartbeat table.
   5261 # ########################################################################
   5262 my $end = $o->get('run-time') ? int(time + $o->get('run-time')) : 0;
   5263 PTDEBUG && _d($end ? ('Will exit at', ts($end)) : 'Running forever');
   5264
   5265 my $get_next_interval = make_interval_iter($interval, $skew);
   5266
   5267 while ( # Stop if...
   5268 (!$end || int(time) < $end) # runtime exceeded, or
   5269 && !-f $sentinel # sentinel file created
   5270 ) {
   5271 eval {
   5272 my $next_interval = $get_next_interval->();
   5273 if ( time >= $next_interval ) {
   5274 do { $next_interval = $get_next_interval->() }
   5275 until $next_interval > time;
   5276 PTDEBUG && _d("Missed last interval; next interval:",
   5277 ts($next_interval));
   5278 }
   5279 sleep $next_interval - time;
   5280 PTDEBUG && _d('Woke up at', ts(time));
   5281

The "make_interval_iter" subroutine:

   5444 sub make_interval_iter {
   5445 my ( $interval, $skew ) = @_;
   5446 die "I need an interval argument" unless defined $interval;
   5447 my ($s) = gettimeofday();
   5448 my $start_s = $s + 1;
   5449 my $i = 0;
   5450 my $get_next_interval = sub {
   5451 return $start_s + ($interval * $i++) + $skew;
   5452 };
   5453 return $get_next_interval;
   5454 }

summary: - pt-heartbeat crashes with error due to integer overflows
+ pt-heartbeat crashes with sleep argument error

Confirmed code with pt-heartbeat 2.2.12

nilnandan@desktop:~$ pt-heartbeat --version
pt-heartbeat 2.2.12
nilnandan@desktop:~$

# ########################################################################
   # Monitor or update the heartbeat table.
   # ########################################################################
   my $end = $o->get('run-time') ? int(time + $o->get('run-time')) : 0;
   PTDEBUG && _d($end ? ('Will exit at', ts($end)) : 'Running forever');

   my $get_next_interval = make_interval_iter($interval, $skew);

   while ( # Stop if...
      (!$end || int(time) < $end) # runtime exceeded, or
      && !-f $sentinel # sentinel file created
   ) {
      eval {
         my $next_interval = $get_next_interval->();
         if ( time >= $next_interval ) {
            do { $next_interval = $get_next_interval->() }
               until $next_interval > time;
            PTDEBUG && _d("Missed last interval; next interval:",
               ts($next_interval));
         }
         sleep $next_interval - time;
         PTDEBUG && _d('Woke up at', ts(time));
..
...
....

sub make_interval_iter {
   my ( $interval, $skew ) = @_;
   die "I need an interval argument" unless defined $interval;
   my ($s) = gettimeofday();
   my $start_s = $s + 1;
   my $i = 0;
   my $get_next_interval = sub {
      return $start_s + ($interval * $i++) + $skew;
   };
   return $get_next_interval;
}

Changed in percona-toolkit:
status: New → Confirmed
tags: added: pt-heartbeat
Frank Cizmich (frank-cizmich) wrote :

This looks like a race condition.

The code prior to the sleep statement is supposed to check that $next_interval is big enough, but that doesn't account for the nanoseconds that transpire after the loop is exited and before the sleep is executed.

Couldn't reproduce this but the flaw in the code seems clear enough.

Will fix.

Changed in percona-toolkit:
status: Confirmed → In Progress
milestone: none → 2.2.14
assignee: nobody → Frank Cizmich (frank-cizmich)
importance: Undecided → Medium
Changed in percona-toolkit:
status: In Progress → Fix Committed
Frank Cizmich (frank-cizmich) wrote :

Fixed by saving time value in a variable before entering race condition prone code.

Changed in percona-toolkit:
status: Fix Committed → Fix Released

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PT-670

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers