# Error in linalg.svd for large matrix

Bug #134277 reported by Mark Poolman on 2007-08-23
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Unknown
Unknown
Invalid
Unknown
Undecided
Unassigned

### Bug Description

Binary package hint: python-scipy

When calculating a kernel, k, of a matrix A, then the matrix product A.k _must_ (by definition) = 0 (ie the zero matrix) .
Svd can be used to calculate a kernel with the following function:

def null(A,tol=1e-8):
"orthogonal nullspace, elements are floats"

u,w,vt=svd(A)
v,i = PosOfFirst(w.tolist(),lambda x: x <= tol)
if i == -1:
i = len(w)
return scipy.transpose(vt[i:])

This can readily be tested thus:

def TestNull(a):

k = null(A.A,1e-8)
zeroes = mul(a,k)
print "max=", max(max(zeroes)), "min=", min(min(zeroes))

The max and min values should be smaller than (or at least the same order of maginitude as) tol.
Results on various platforms for the same matrix are as follows:

opteron/warty max= 3.62882467131e-16 min= -3.61473004307e-16 GOOD
pentium/dapper max= 0.109976286782 min= -0.0770718487184 BAD
AMD64/dapper max= 0.0599265087162 min= -0.13143867524 BAD
athlon/dapper max= 2.28048070053e-16 min= -4.583786292e-16 GOOD (but weird eh ?)
pentium/edgy max= 2.28048070053e-16 min= -4.583786292e-16 GOOD

The matrix is quite large (702x635) but well conditioned.
Not all matrices of comparable size exhibit the same problem.
All package configurations are defalt.
The problem couldn't be replicated in octave (which probably lets linpack et al of the hook).

The test script and matrix can be found here:
http://mudshark.brookes.ac.uk/mark/SciTest/

Upgrading to edgy isn't really a starter (LTS matter to me).
From my POV this is a serious problem, and I'll be happy to offer any further info,
however I'm away from the computer until wednesday so I'll not be able to respond before then.

/best/*

Mark

 Mark Poolman (mgpoolman) wrote on 2007-08-23: #1

PS it's not the value of tol that's causing the problem.

Now I've seen the "Attachment" button, SciTest.tgz contains the test script and matrix.

 Cesare Tirabassi (norsetto) wrote on 2007-08-25: #2

By giving a quick look at the runtime libraries, I don't think the problem is related to these:

dapper:
atlas3-base (3.6.0-20)
lapack3 (3.0.20000531a-6ubuntu2)
refblas3 (1.2-8ubuntu1)

edgy:
atlas3-base (3.6.0-20.2)
lapack3 (3.0.20000531a-6ubuntu2)
refblas3 (1.2-8ubuntu1)

the only differences between the two atlas3-blase libraries being in the packaging (and of course the base libraries against which they were compiled).

Unless there is a copy&paste problem, these results:

athlon/dapper max= 2.28048070053e-16 min= -4.583786292e-16 GOOD
pentium/edgy max= 2.28048070053e-16 min= -4.583786292e-16 GOOD

seem to confirm it.
I guess in both cases you were using the 32bit version?
It could actually be an hardware problem, would it be possible to test this on the same machine with the two different versions (edgy and dapper)? Perhaps some of the test cases above covers this?

 Changed in python-scipy: status: New → Incomplete
 Mark Poolman (mgpoolman) wrote on 2007-08-31: Re: [Bug 134277] Re: Error in linalg.svd for large matrix #3

Cesare,

On Sat, 2007-08-25 at 13:09 +0000, Cesare Tirabassi wrote:

> By giving a quick look at the runtime libraries, I don't think the
> problem is related to these:
>
> dapper:
> atlas3-base (3.6.0-20)
> lapack3 (3.0.20000531a-6ubuntu2)
> refblas3 (1.2-8ubuntu1)
>
> edgy:
> atlas3-base (3.6.0-20.2)
> lapack3 (3.0.20000531a-6ubuntu2)
> refblas3 (1.2-8ubuntu1)
>
> the only differences between the two atlas3-blase libraries being in the
> packaging (and of course the base libraries against which they were
> compiled).
>
> Unless there is a copy&paste problem, these results:
>
Yes there was a copy&paste mistake, but both results were good.

> athlon/dapper max= 2.28048070053e-16 min= -4.583786292e-16 GOOD
> pentium/edgy max= 2.28048070053e-16 min= -4.583786292e-16 GOOD
>
> seem to confirm it.

Maybe, but the 64 bit libraries will presumably have been compiled with
a different compiler (at least it will have been the 64 bit version of
gcc) and possibly different flags. For example I have known gcc -O3 ...
to cause a problem on some architectures and not others with the same
source, and I'm guessing (wildly) that this is a possible cause.

> I guess in both cases you were using the 32bit version?
No, 64 bit versions on the 64 bit hardware.

> It could actually be an hardware problem,
I really hope not, it would be a *MAJOR* PITA.
> would it be possible to test this on the same machine with the two different versions (edgy and dapper)?

> Perhaps some of the test cases above covers this?
>

To summarise the relevant results plus a couple more:

Machine 1: Opteron 246
Warty: max= 3.62882467131e-16 min= -3.61473004307e-16 GOOD

Machine 2: Athlon 64 3500+
Dapper: max= 0.0599265087162 min= -0.13143867524 BAD
Edgy: max = 0.97897895303 min= -0.935360320766 BAD
Feisty: max = 0.97897895303 min= -0.935360320766 BAD

Edgy and Feisty were run from live cd, and yes, the results were
identical - it wasn't a cut&paste issue this time.

I'm not sure how far forward this takes things although the difference
in the results for dapper and edgy on the same machine presumably
suggests the involvement of software to some extent. If you know of any
working repositories for breezy I'll try that as well.

Cheers,

Mark

--
Dr. Mark Poolman
Senior Research Fellow
Cell Systems Modelling Group
School of Life Science
Oxford Brookes University

 Mark Poolman (mgpoolman) wrote on 2007-08-31: #4

Also the fact that the error is seen on the dapper/pentium argues against a hardware problem.

Mark

 Cesare Tirabassi (norsetto) wrote on 2007-08-31: #5

Right now, I think the best would be to send the test script and matrix upstream, to see if they can reproduce it.
This could give us some clues on where to start.

http://projects.scipy.org/scipy/scipy

If you open a ticket, please let us know the number so that we can track the discussion.

I've checked their tracker to see if there was any similar problem reported, but couldn't find any:

http://projects.scipy.org/scipy/scipy/search?q=svd&noquickjump=1&ticket=on&changeset=on&wiki=on

 Scott Kitterman (kitterman) wrote on 2007-08-31: #6

Also, there is a new scipy release (5.2.1). We looked at it for inclusion in Gutsy and the changes are (despite the version number) invasive and wide ranging so we decided to pass. If 5.2.1 from upstream does not demonstrate this problem, then maybe we could work together to pull out a patch from the 5.2.1 changes.

 Mark Poolman (mgpoolman) wrote on 2007-08-31: #7

On Fri, 2007-08-31 at 14:45 +0000, Cesare Tirabassi wrote:
> Right now, I think the best would be to send the test script and matrix upstream, to see if they can reproduce it.
> This could give us some clues on where to start.
>
> http://projects.scipy.org/scipy/scipy
>
> If you open a ticket, please let us know the number so that we can track
> the discussion.

 Changed in scipy: status: Unknown → New
 Mark Poolman (mgpoolman) wrote on 2007-09-03: Some progress - Re: [Bug 134277] Error in linalg.svd for large matrix #8

The cause seems almost certainly to be in the atlas libs rather than
scipy per se: Scipy will load the atlas versions of the lapack/blas libs
if available, but if not falls back to lapack3/refblas3.

Warty has atlas 3.6.0-13, and does not have the problem, dapper has
atlas 3.6.0.20 and does. The problem can be made come and go simply by
installing/removing atlas, which does affect any other packages. This
holds true for i686 and x86_64.

Mark

 Mark Poolman (mgpoolman) wrote on 2007-09-03: Re: Some progress #9
 Changed in python-scipy: status: Incomplete → Confirmed
 Cesare Tirabassi (norsetto) wrote on 2007-09-03: #10

If you can confirm that the problem is solved with 3.6.0-13 than it is not a bug introduced upstream. Both 3.6.0-13 and 3.6.0-20 should be using the same upstream version (3.6.0).
Looking at the changelog:

http://packages.debian.org/changelogs/pool/main/a/atlas3/atlas3_3.6.0-20.6/changelog

there is nothing which immediatly pops to the eye as a possible cause to this.
Could you perhaps test by using 3.6.0-19, which is still available from the debian repository?

 Changed in python-scipy: status: Confirmed → Incomplete
 Mark Poolman (mgpoolman) wrote on 2007-09-03: #11

Bouncing email ? Apolgies if you see this twice.

On Mon, 2007-09-03 at 13:16 +0000, Cesare Tirabassi wrote:
> If you can confirm that the problem is solved with 3.6.0-13 than it is not a bug introduced upstream. Both 3.6.0-13 and 3.6.0-20 should be using the same upstream version (3.6.0).

What I can confirm is that:

1) The problem is not seen on the warty/x86_64 with atlas
installed.

2) The problem is not seen on dapper/i386/x86_64 without atlas.

3) The problem is seen on dapper/i386/x86_64 with atlas.

> Looking at the changelog:
>
> http://packages.debian.org/changelogs/pool/main/a/atlas3/atlas3_3.6.0-20.6/changelog
>
> there is nothing which immediatly pops to the eye as a possible cause to this.
Probably, but

"Slightly modify ATL_F77wrap_tr{m,s}v.c to work around gcc 4.0
bug when using -fomit-frame-pointer"

Sounds as if it could be covering a multitude of sins.

> Could you perhaps test by using 3.6.0-19, which is still available from the debian repository?

Unfortunately attempting to do so hits this bug
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=315908 (which is what
the previous quote relates to). Even if you decline the resulting
invitation to abort the install, the installation hangs anyway.

 Cesare Tirabassi (norsetto) wrote on 2007-09-03: #12

Seems unlikely:

diff -r atlas3-3.6.0-13/interfaces/blas/F77/src/f77wrap/ATL_F77wrap_trmv.c atlas3-3.6.0-20/interfaces/blas/F77/src/f77wrap/ATL_F77wrap_trmv.c
107a108,110
>
> /* This is to workaround a compiler bug in gcc 4.0.0 with -fomit-frame-pointer. 20050927 CM*/
> TYPE *j=W1N( N, X, INCX );
109c112
< W1N( N, X, INCX ), *INCX );
---
> j, *INCX );
diff -r atlas3-3.6.0-13/interfaces/blas/F77/src/f77wrap/ATL_F77wrap_trsv.c atlas3-3.6.0-20/interfaces/blas/F77/src/f77wrap/ATL_F77wrap_trsv.c
110a111,112
> /* This is to workaround a compiler bug in gcc 4.0.0 with -fomit-frame-pointer. 20050927 CM*/
> TYPE *j=W1N( N, X, INCX );
112c114
< W1N( N, X, INCX ), *INCX );
---
> j, *INCX );

I was thinking more about the compiler change, which testing -19 could have confirmed.

 Mark Poolman (mgpoolman) wrote on 2007-09-04: Re: [Bug 134277] Re: Error in linalg.svd for large matrix #13

On Mon, 2007-09-03 at 15:58 +0000, Cesare Tirabassi wrote:

> I was thinking more about the compiler change, which testing -19 could
> have confirmed.
>

OK, so after a bit of a battle (had to do a manual build), compiling -19
using gcc3.4.6/dapper, the problem is _absent_.

Mark

 Mark Poolman (mgpoolman) wrote on 2007-09-04: #14

-20 is OK with gcc3.4 as well.

 Cesare Tirabassi (norsetto) wrote on 2007-09-04: #15

Thanks Mark, super job!
Unfortunately this means, unless there is some #ifdefine in ATLAS which branches as a function of the compiler, that this is a compiler bug.
Would I be asking too much, if I'd ask you to check with the latest gcc (version 4.1.3 20070831 (prerelease) (Ubuntu 4.1.2-16ubuntu1))?

 Mark Poolman (mgpoolman) wrote on 2007-09-05: #16

On Tue, 2007-09-04 at 14:19 +0000, Cesare Tirabassi wrote:
> Thanks Mark, super job!
> Unfortunately this means, unless there is some #ifdefine in ATLAS which branches as a function of the compiler, that this is a compiler bug.
> Would I be asking too much, if I'd ask you to check with the latest gcc (version 4.1.3 20070831 (prerelease) (Ubuntu 4.1.2-16ubuntu1))?
>

Unfortunately the problem just seems to have got a tad more complex.
Firstly there was a build problem with the atlas, the atlas distribution
only makes the static libraries, and debuild fails with

"make: *** No rule to make target
`debian/tmp/usr/lib/atlas/liblapack.so', needed by `build'.
Stop."

The complexity of the atlas make system places this problem beyond my
ability to solve. However, the previous results relating to locally
built atlas are almost certainly wrong.

A second, possibly more serious, complicating factor, is that I now have
a matrix whose behaviour is complementary to the original one, i.e. it
passes when the original fails and vice-versa.

This rather tends to suggest that the cause of the problem does, after
all, lie with scipy.

Another aspect that might shed light on the issue is that the dimension
of the kernel is too large (compared a against a reference of calculated
with Gauss-Jordan and arbitrary precision rationals).

In the summary of results below, matrix null(A) had dimension 125 when
calculated by Gauss-Jordan, and 130 via svd, for matrix B the dimensions
were 126 and 130 respectively.

Updated results, matrices and scripts can be found here:
http://mudshark.brookes.ac.uk/mark/SciTest/

warty/opteron

+atlas (3.6.0-13)
matrix A : max= 3.62882467131e-16 min= -3.61473004307e-16
matrix B : max= 0.0242918127535 min= -0.0329155144634

-atlas
Can't do, breaks dependencies

dapper/pentium

+atlas (3.6.0-20)
matrix A : max= 0.030491255128 min= -0.249297772779
matrix B : max= 5.63011418035e-17 min= -7.21035472861e-17

-atlas
matrix A : max= 2.28048070053e-16 min= -4.583786292e-16
matrix B : max= 1.25730556444e-16 min= -1.65316267349e-16

dapper/amd64

+atlas
matrix A : max= 0.0599265087162 min= -0.13143867524
matrix B : max= 2.16840434497e-16 min= -1.21864324187e-16

-atlas
matrix A : max= 3.62882467131e-16 min= -3.61473004307e-16
matrix B : max= 0.0242918127535 min= -0.0329155144634

dapper/opteron - identical to dapper/amd64

edgy/amd64

+atlas
matrix A : max = 0.97897895303 min= -0.935360320766
matrix B : max = 3.05137859424e-15 min= -1.76247905159e-15

-atlas
matrix A : max = 1.26382736741e-14 min= -1.76067046684e-14
matrix B : max = 0.502358991218 min= -0.792848865561

fiesty/amd64 - identical to edgy/amd64

 Changed in scipy: status: New → Invalid