sscanf extremely slow on large strings

Bug #1391510 reported by Shahbaz Youssefi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
glibc (Ubuntu)
New
Undecided
Unassigned

Bug Description

This is a bug on glib, and I'm not sure if it applies upstream too (although I suspect it does). Upstream bug link:

    https://sourceware.org/bugzilla/show_bug.cgi?id=17577

The bug is that `sscanf` (`vsscanf` actually) is very slow on large strings. It _seems_ like `vsscanf` is first trying to find the end of the string (with `rawmemchr`), which is expensive given that most of the string is not going to be read.

Here's a test code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <errno.h>

#define N 50000

static int _debug_helper(const char *src, int *a, int *n)
{
#if 1
 return sscanf(src, "%d%n", a, n);
#else
 char *end;

 errno = 0;
 *a = strtol(src, &end, 10);
 *n = end - src;

 return errno == 0 && *n > 0;
#endif
}

int main()
{
 int i;
 int a;
 int n;
 int so_far = 0;

 char *big_string = malloc(N * 4 + 1);
 for (i = 0; i < N; ++i)
  strcpy(big_string + i * 4, "123 ");
 big_string[N * 4] = '\0';

 while (1)
 {
  if (_debug_helper(big_string + so_far, &a, &n) != 1)
   break;
  so_far += n;
 }

 return 0;
}

Compile with: gcc -Wall -g -O0 main.c

Running this code with `N = 50000` and using `sscanf`, I get:

$ time ./a.out

real 0m1.602s
user 0m1.596s
sys 0m0.000s

Running it with `N = 50000` and using the substitute code, I get:

$ time ./a.out

real 0m0.002s
user 0m0.000s
sys 0m0.000s

Which is considrably smaller. Note that this shows that the part with `malloc` and initialization take very small time. Running callgrind shows that almost all of the time when using `sscanf` is spent in `rawmemchr`. Indeed, using gdb and randomly hitting CTRL+C, you always end up with a stack trace like this:

#0 __rawmemchr_ia32 () at ../sysdeps/i386/i686/multiarch/../../rawmemchr.S:167
#1 0xb7e78a06 in _IO_str_init_static_internal () at strops.c:44
#2 0xb7e5c857 in __GI___isoc99_vsscanf () at isoc99_vsscanf.c:41
#3 0xb7e5c7cf in __isoc99_sscanf () at isoc99_sscanf.c:31
#4 0x08048494 in _debug_helper () at main.c:11
#5 0x08048517 in main () at main.c:41

This means that `rawmemchr` is slowing down `sscanf` by an unnecessary degree. To further prove this point (and to confirm my guess that `rawmemchr` is reading the whole string), here are a couple more tests:

With `N = 25000` and using `sscanf`:

$ time ./a.out

real 0m0.407s
user 0m0.404s
sys 0m0.000s

With `N = 12500` and using `sscanf`:

$ time ./a.out

real 0m0.106s
user 0m0.104s
sys 0m0.000s

This clearly shows an `O(N^2)` behavior. The main loop of the program is `O(N)`, which means `sscanf` is running at `O(N)`. For large `N`, this is significant. On the other hand, the actual behavior of `sscanf` should be to read from the string according to the format string and no more, which in this case (using `%d` and "123" as values) is of constant time.

---

Note: I'm on Ubuntu 14.04, using gcc 4.8.2. Most importantly, this is glibc 2.19.

Tags: performance
description: updated
description: updated
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.