STL std::nth_element bug (fixed upstream)

Bug #1246802 reported by Garth Wells on 2013-10-31
50
This bug affects 10 people
Affects Status Importance Assigned to Milestone
FEniCS Project
Undecided
Unassigned
gcc
Fix Released
High
gcc-4.8 (Ubuntu)
Undecided
Unassigned
Saucy
Undecided
Unassigned

Bug Description

The package llibstdc++-4.8 (4.8.1-10ubuntu8) has a nasty STL std::nth_element bug that has been fixed upstream, see http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58800. Could you release an update with the fix? It is a bad bug, i.e. it completely breaks some programs on Ubuntu 13.10 that use the STL.

This program segfaults on gcc 4.8.1 from Ubuntu 13.10. I'm building a copy of the 4.8 branch to see if I can reproduce it there.

This is on x86_64. I used a gross hack to specify the input so I don't have to think about precision issues. The numbers are all well-behaved values near 0.9.

I'll attach preprocessed source. The preprocessed source crashes even if built with gcc 4.7.something.

#include <algorithm>
#include <stdint.h>
//#include <iostream>

double to_double(uint64_t x)
{
        union {double d; uint64_t x;} u;
        u.x = x;
        return u.d;
}

int main()
{
        std::vector<double> v = {
                to_double(4606672070890243784),
                to_double(4606672025854247510),
                to_double(4606671800674266141),
                to_double(4606671575494284772),
                to_double(4606672115926240057),
                to_double(4606672160962236330),
                to_double(4606672070890243784),
        };

// for (auto i : v)
// std::cout << i << std::endl;

        std::nth_element(v.begin(), v.begin() + 3, v.end());
        return 0;
}

Created attachment 31047
preprocessed source

This was generated with g++ -std=gnu++11 -E sort.cc on Ubuntu 13.10.

Please figure out a testcase involving plain integers.

Unfortunately the issue seems real. I can reproduce it with:

        std::vector<uint64_t> v = {
                4606672070890243784,
                4606672025854247510,
                4606671800674266141,
                4606671575494284772,
                4606672115926240057,
                4606672160962236330,
                4606672070890243784,
        };

Chris?

I can also reproduce with something this simple:

        std::vector<int> v = {
                207089,
                202585,
                180067,
                157549,
                211592,
                216096,
                207089
        };

-D_GLIBCXX_DEBUG reveals that we are dereferencing a past-the-end iterator. We badly need a quick fix. Say 2-3 day max or we have to revert the fix for 58437.

Don't tell me is just this:

Index: include/bits/stl_algo.h
===================================================================
--- include/bits/stl_algo.h (revision 203835)
+++ include/bits/stl_algo.h (working copy)
@@ -1917,7 +1917,7 @@
     _RandomAccessIterator __last, _Compare __comp)
     {
       _RandomAccessIterator __mid = __first + (__last - __first) / 2;
- std::__move_median_to_first(__first, __first + 1, __mid, (__last - 2),
+ std::__move_median_to_first(__first, __first + 1, __mid, (__last - 1),
       __comp);
       return std::__unguarded_partition(__first + 1, __last, __first, __comp);
     }

??

Andy, while waiting for Chris' feedback, you may want to test the above on your real applications. In 4.8.x some nth_algorithm details are different (the above is versus mainline), but you can quickly find a couple of (__last - 2) which I'm asking you to change to (__last - 1)

I changed two instances of __last - 2 to __last - 1. I'm running against r203836 on gcc-4_8-branch.

The thing that caught it was an experiment, not part of my test suite, so this is a little tricky. It already survived longer than the unpatched version did. The crash wasn't the same segfault, though -- it was a much later crash due to memory corruption. I'll let it run for a while under valgrind to see what, if anything, pops up.

FWIW, replacing that list with a random permutation seems to crash about 5% of the time.

Thanks Andy. We really want to fix this as soon as possible. In the meanwhile in my tests the tweak passed the testsuite, without regressing on the performance of std::sort (libstdc++/58437). Thus apparently we are on the right track.

This code has survived my smoke test so far, which is a good sign. If it was causing some kind of bad problem, I'd expect something to have exploded by now.

{0, 1, 2, 0} should be enough.

__unguarded_partition only stops increasing __first when *__first is not smaller than the pivot. If all elements are smaller than the pivot, boom. And when we have fewer than 5 elements (the check in __introselect is only for > 3), the pivot selection code doesn't guarantee that (some of the 3 pointers used to compute the median are the same, and with Paolo's change they become distinct).

Hi Marc. I'm coming to the conclusion that the '- 2' in the patch sent by Chris was in any case a typo. Now, sorry I didn't investigate the issue further, I'm not sure to understand if you think changing it to '- 1' is enough or not: it certainly passes or my tests so far ({0, 1, 2, 0} included)

Yes, I didn't trace all the members which call __unguarded_partition_pivot.

The better (in my opinion) fix is to change __introselect from:

__last - __first > 3

to:

__last - __first > int(_S_threshold)

Like the other call the __unguarded_partition_pivot.

I am currently testing that change.

The 'last - 2' was on purpose (although, it is causing the problem!), as 'last - 1' is the the last element of the array which we want to avoid (think of it as 'last - 1 - 1', to go with 'first + 1'. Obviously we couldn't de-ref 'last - 1'.

Ok, note that I was reviewing the description you originally sent to the mailing list and it never talked about the - 2...

Let's figure out something safe, please.

(In reply to Chris Jefferson from comment #12)
> The better (in my opinion) fix is to change __introselect from:
>
> __last - __first > 3
>
> to:
>
> __last - __first > int(_S_threshold)
>
> Like the other call the __unguarded_partition_pivot.

Note that the optimal threshold is probably not the same for sort and for select (which drops to full sorting below that threshold!).

In my last comment, "Obviously we couldn't de-ref 'last - 1'" should have been "Obviously we couldn't de-ref 'last'"

Created attachment 31049
random test

In order to catch any other issues, and stop this kind of thing happening again, I want to quickly and massively expand the algorithms testsuite.

The attached test tests a whole set of random nth_element cases (it would catch the problem we are currently having).

The problem is on my computer, unoptimised, this test takes 30 seconds, and I want to add a similar test to all the sorting related algorithms. Is there a standard way of submitting tests that take a lot of time and memory?

At the moment there isn't, but some tests self adjust to run less on simulators. For the time being I think we could certainly add a few (not 10 or 20) of the tests tou are talking about, but remember all with the bits for the simulators. We could also easily add some tests to the performance testsuite (there are no limits there and isn't run by default) which, outside the timing loop also check that the computation is correct (well, we should probably allways do that)

(In reply to Paolo Carlini from comment #13)
> Ok, note that I was reviewing the description you originally sent to the
> mailing list and it never talked about the - 2...
>
> Let's figure out something safe, please.

I miswrote in my final paragraph:

...first+1, mid, last-1 (there is no reason in this bug to consider
last-1 instead of last, but I thought it wouldn't do any harm, and
might avoid other issues of accidentally choosing a bad pivot.

Should have said:

...first+1, mid, last-2 (there is no reason in this bug to consider *last-2* instead of *last-1*, but I thought it wouldn't do any harm, and might avoid other issues of accidentally choosing a bad pivot.

So, the original code chose last-1, I moved that to last-2, not checking all the places that called the code enough (although, I could also have broken it with the first+1 if the ranges had been small enough).

Changing 'last-2' to 'last-1' might well be the easiest, smallest fix, if it does not effect the performance testsuite. We can then, once we have a much more comphrehensive testsuite, make another pass at performance improvements and tweaks.

Created attachment 31051
nth_element patch

Here is a patch. The actual patch is just changing '__last - 2' to '__last - 1' (***NOTE*** When back-applying this patch, before the recent merge of algorithms, there will be two copies of __last - 2 to change).

The larger part is a patch just for this bug, and a general big improvement in the test procedure, to sanity check no other related methods have any issues (they do not). I intend to continue to add new testsuite functions in the future, but this will do to start.

Chris, <random> isn't unconditionally available (requires _GLIBCXX_USE_C99_STDINT_TR1 defined). Please massage the new tests (and the new header) to just do nothing when <random> isn't available and send the result separately to the library mailing list. Let's first commit everywhere fix + minimal testcase.

I think you can probably simply add at the beginning of the new tests:

// { dg-require-cstdint "" }

and then the new include doesn't require any change. But please double check, we don't want new Bugzillas: say, damage check_v3_target_cstdint to artificially return false on your machine and double check that the new testcases are simply skipped.

Author: paolo
Date: Sun Oct 20 09:07:36 2013
New Revision: 203872

URL: http://gcc.gnu.org/viewcvs?rev=203872&root=gcc&view=rev
Log:
2013-10-20 Chris Jefferson <email address hidden>
     Paolo Carlini <email address hidden>

 PR libstdc++/58800
 * include/bits/stl_algo.h (__unguarded_partition_pivot): Change
 __last - 2 to __last - 1.
 * testsuite/25_algorithms/nth_element/58800.cc: New

Added:
    trunk/libstdc++-v3/testsuite/25_algorithms/nth_element/58800.cc
Modified:
    trunk/libstdc++-v3/ChangeLog
    trunk/libstdc++-v3/include/bits/stl_algo.h

Author: paolo
Date: Sun Oct 20 09:08:12 2013
New Revision: 203873

URL: http://gcc.gnu.org/viewcvs?rev=203873&root=gcc&view=rev
Log:
2013-10-20 Chris Jefferson <email address hidden>
     Paolo Carlini <email address hidden>

 PR libstdc++/58800
 * include/bits/stl_algo.h (__unguarded_partition_pivot): Change
 __last - 2 to __last - 1.
 * testsuite/25_algorithms/nth_element/58800.cc: New

Added:
    branches/gcc-4_8-branch/libstdc++-v3/testsuite/25_algorithms/nth_element/58800.cc
Modified:
    branches/gcc-4_8-branch/libstdc++-v3/ChangeLog
    branches/gcc-4_8-branch/libstdc++-v3/include/bits/stl_algo.h

Author: paolo
Date: Sun Oct 20 09:08:26 2013
New Revision: 203874

URL: http://gcc.gnu.org/viewcvs?rev=203874&root=gcc&view=rev
Log:
2013-10-20 Chris Jefferson <email address hidden>
     Paolo Carlini <email address hidden>

 PR libstdc++/58800
 * include/bits/stl_algo.h (__unguarded_partition_pivot): Change
 __last - 2 to __last - 1.
 * testsuite/25_algorithms/nth_element/58800.cc: New

Added:
    branches/gcc-4_7-branch/libstdc++-v3/testsuite/25_algorithms/nth_element/58800.cc
Modified:
    branches/gcc-4_7-branch/libstdc++-v3/ChangeLog
    branches/gcc-4_7-branch/libstdc++-v3/include/bits/stl_algo.h

Fixed for 4.7.4/4.8.3/4.9.0.

tags: added: saucy
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in gcc-4.8 (Ubuntu):
status: New → Confirmed
Phil Weir (phil-weir) wrote :

Just went to file a report and spotted the name - small world - I suspect we found this the same way, which is reassuring. Can confirm presence on 13.10.

Changed in gcc:
importance: Unknown → High
status: Unknown → Fix Released
Garth Wells (garth-wells) wrote :

Any news on getting a fix out for Saucy? This is a very bad bug and the fix is available upstream.

*** Bug 59116 has been marked as a duplicate of this bug. ***

Matthias Klose (doko) wrote :

fixed in trusty

Changed in gcc-4.8 (Ubuntu Saucy):
status: New → Confirmed
Changed in gcc-4.8 (Ubuntu):
status: Confirmed → Fix Released

Hello Garth, or anyone else affected,

Accepted gcc-4.8 into saucy-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/gcc-4.8/4.8.1-10ubuntu9 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in gcc-4.8 (Ubuntu Saucy):
status: Confirmed → Fix Committed
tags: added: verification-needed
Garth Wells (garth-wells) wrote :

gcc-4.8 4.8.1-10ubuntu9 appears to fix this bug. The runtime failures in my library are resolved by the updated package. Thanks.

tags: added: verification-done
removed: verification-needed
Moritz Hassert (mhassert) wrote :

gcc-4.8 4.8.1-10ubuntu9 in saucy-proposed fixes the issue for me.

Nico Schlömer (nschloe) wrote :

Finding out that this bug affects me cost me a day, upgrading to gcc-4.8 4.8.1-10ubuntu9 in saucy-proposed fixes the issue.

The verification of the Stable Release Update for gcc-4.8 has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package gcc-4.8 - 4.8.1-10ubuntu9

---------------
gcc-4.8 (4.8.1-10ubuntu9) saucy-proposed; urgency=low

  * Fix miscompilation of tar on armhf. LP: #1243656.
  * Fix PR libstdc++/58800, taken from the 4.8 branch. LP: #1246802.
 -- Matthias Klose <email address hidden> Fri, 15 Nov 2013 12:20:27 +0100

Changed in gcc-4.8 (Ubuntu Saucy):
status: Fix Committed → Fix Released

*** Bug 59557 has been marked as a duplicate of this bug. ***

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.