Effective 2 GB limit on blend input

Bug #373398 reported by Rob Speer
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Divisi
Fix Committed
Medium
Rob Speer

Bug Description

The blending code currently multiplies all the input data, and puts it into a sparse matrix, before running the blend SVD.

There may in fact be multiple copies of all the data: the original input tensors, the blend tensor, and the CSCMatrix.

This quickly hits the 2GB memory limit in 32-bit Python (or, equivalently, quickly eats up 4GB or more of ram in 64-bit Python). We need a way to conserve memory. Some possibilities:

* Incremental approaches (perhaps using Jayant's 'hit all the zeros at once' idea to make incremental SVD spiky like Lanczos SVD is)
* SVD of SVDs (add the svd.u's together, not the input matrices, and svd again; sigma and v need to be reconstructed in other ways)

Tags: efficiency
Rob Speer (rspeer)
Changed in divisi:
assignee: nobody → Rob Speer (rspeer)
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Ken Arnold (kenneth-arnold) wrote : Re: [Commonsense] [Bug 373398] [NEW] Effective 2 GB limit on blend input

The blend tensor only has to keep the input tensors around if you want
to adjust the blending factors. It could throw them out otherwise.

Likewise, the conversion to a CSCMatrix could be made to be
destructive. Or SVDLIBC could be ported to work on a Tensor directly.

Another option is actually storing the biggest tensors on disk, using
(gasp!) ZODB. This is actually efficient. Sorta.

We also have some low-hanging fruit: DictTensor is storing Python
objects, not raw integers.

-Ken

On Thu, May 7, 2009 at 3:46 PM, Rob Speer <email address hidden> wrote:
> Public bug reported:
>
> The blending code currently multiplies all the input data, and puts it
> into a sparse matrix, before running the blend SVD.
>
> There may in fact be multiple copies of all the data: the original input
> tensors, the blend tensor, and the CSCMatrix.
>
> This quickly hits the 2GB memory limit in 32-bit Python (or,
> equivalently, quickly eats up 4GB or more of ram in 64-bit Python). We
> need a way to conserve memory. Some possibilities:
>
> * Incremental approaches (perhaps using Jayant's 'hit all the zeros at once' idea to make incremental SVD spiky like Lanczos SVD is)
> * SVD of SVDs (add the svd.u's together, not the input matrices, and svd again; sigma and v need to be reconstructed in other ways)
>
> ** Affects: divisi
>     Importance: Medium
>     Assignee: Rob Speer (rspeer)
>         Status: Confirmed
>
>
> ** Tags: efficiency
>
> ** Changed in: divisi
>   Importance: Undecided => Medium
>
> ** Changed in: divisi
>       Status: New => Confirmed
>
> ** Changed in: divisi
>     Assignee: (unassigned) => Rob Speer (rspeer)
>
> --
> Effective 2 GB limit on blend input
> https://bugs.launchpad.net/bugs/373398
> You received this bug notification because you are a member of
> Commonsense Computing, which is the registrant for Divisi.
>
> Status in Divisi: Confirmed
>
> Bug description:
> The blending code currently multiplies all the input data, and puts it into a sparse matrix, before running the blend SVD.
>
> There may in fact be multiple copies of all the data: the original input tensors, the blend tensor, and the CSCMatrix.
>
> This quickly hits the 2GB memory limit in 32-bit Python (or, equivalently, quickly eats up 4GB or more of ram in 64-bit Python). We need a way to conserve memory. Some possibilities:
>
> * Incremental approaches (perhaps using Jayant's 'hit all the zeros at once' idea to make incremental SVD spiky like Lanczos SVD is)
> * SVD of SVDs (add the svd.u's together, not the input matrices, and svd again; sigma and v need to be reconstructed in other ways)
>
> _______________________________________________
> Mailing list: https://launchpad.net/~commonsense
> Post to     : <email address hidden>
> Unsubscribe : https://launchpad.net/~commonsense
> More help   : https://help.launchpad.net/ListHelp
>

Revision history for this message
Rob Speer (rspeer) wrote :

Ken fixed this. Input tensors can now be stored in PyTables (for a big time/space tradeoff).

Changed in divisi:
status: Confirmed → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.