pt-heartbeat crashes with sleep argument error

Bug #1406390 reported by Robert Barabas
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona Toolkit moved to https://jira.percona.com/projects/PT
Fix Released
Medium
Frank Cizmich

Bug Description

The exception which kills the daemon:

Time::HiRes::sleep(-9.53674e-07): negative time not invented yet at /usr/local/bin/pt-heartbeat line 5279

The error is triggered by this statement:

   5279 sleep $next_interval - time;

This was noticed in v2.2.6 but the related function (make_interval_iter) has not changed in current (v2.2.12 at the time of report.)

Corresponding code segment:

   5259 # ########################################################################
   5260 # Monitor or update the heartbeat table.
   5261 # ########################################################################
   5262 my $end = $o->get('run-time') ? int(time + $o->get('run-time')) : 0;
   5263 PTDEBUG && _d($end ? ('Will exit at', ts($end)) : 'Running forever');
   5264
   5265 my $get_next_interval = make_interval_iter($interval, $skew);
   5266
   5267 while ( # Stop if...
   5268 (!$end || int(time) < $end) # runtime exceeded, or
   5269 && !-f $sentinel # sentinel file created
   5270 ) {
   5271 eval {
   5272 my $next_interval = $get_next_interval->();
   5273 if ( time >= $next_interval ) {
   5274 do { $next_interval = $get_next_interval->() }
   5275 until $next_interval > time;
   5276 PTDEBUG && _d("Missed last interval; next interval:",
   5277 ts($next_interval));
   5278 }
   5279 sleep $next_interval - time;
   5280 PTDEBUG && _d('Woke up at', ts(time));
   5281

The "make_interval_iter" subroutine:

   5444 sub make_interval_iter {
   5445 my ( $interval, $skew ) = @_;
   5446 die "I need an interval argument" unless defined $interval;
   5447 my ($s) = gettimeofday();
   5448 my $start_s = $s + 1;
   5449 my $i = 0;
   5450 my $get_next_interval = sub {
   5451 return $start_s + ($interval * $i++) + $skew;
   5452 };
   5453 return $get_next_interval;
   5454 }

Tags: pt-heartbeat
summary: - pt-heartbeat crashes with error due to integer overflows
+ pt-heartbeat crashes with sleep argument error
Revision history for this message
Robert Barabas (robert-barabas) wrote :
Revision history for this message
Nilnandan Joshi (nilnandan-joshi) wrote :

Confirmed code with pt-heartbeat 2.2.12

nilnandan@desktop:~$ pt-heartbeat --version
pt-heartbeat 2.2.12
nilnandan@desktop:~$

# ########################################################################
   # Monitor or update the heartbeat table.
   # ########################################################################
   my $end = $o->get('run-time') ? int(time + $o->get('run-time')) : 0;
   PTDEBUG && _d($end ? ('Will exit at', ts($end)) : 'Running forever');

   my $get_next_interval = make_interval_iter($interval, $skew);

   while ( # Stop if...
      (!$end || int(time) < $end) # runtime exceeded, or
      && !-f $sentinel # sentinel file created
   ) {
      eval {
         my $next_interval = $get_next_interval->();
         if ( time >= $next_interval ) {
            do { $next_interval = $get_next_interval->() }
               until $next_interval > time;
            PTDEBUG && _d("Missed last interval; next interval:",
               ts($next_interval));
         }
         sleep $next_interval - time;
         PTDEBUG && _d('Woke up at', ts(time));
..
...
....

sub make_interval_iter {
   my ( $interval, $skew ) = @_;
   die "I need an interval argument" unless defined $interval;
   my ($s) = gettimeofday();
   my $start_s = $s + 1;
   my $i = 0;
   my $get_next_interval = sub {
      return $start_s + ($interval * $i++) + $skew;
   };
   return $get_next_interval;
}

Changed in percona-toolkit:
status: New → Confirmed
tags: added: pt-heartbeat
Revision history for this message
Frank Cizmich (frank-cizmich) wrote :

This looks like a race condition.

The code prior to the sleep statement is supposed to check that $next_interval is big enough, but that doesn't account for the nanoseconds that transpire after the loop is exited and before the sleep is executed.

Couldn't reproduce this but the flaw in the code seems clear enough.

Will fix.

Changed in percona-toolkit:
status: Confirmed → In Progress
milestone: none → 2.2.14
assignee: nobody → Frank Cizmich (frank-cizmich)
importance: Undecided → Medium
Changed in percona-toolkit:
status: In Progress → Fix Committed
Revision history for this message
Frank Cizmich (frank-cizmich) wrote :

Fixed by saving time value in a variable before entering race condition prone code.

Changed in percona-toolkit:
status: Fix Committed → Fix Released
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PT-670

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.