apport-retrace needs more than 1.5 GB memory (when using sandbox)

Bug #2073787 reported by Benjamin Drung
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Apport
Triaged
High
Unassigned
apport (Ubuntu)
Triaged
High
Benjamin Drung

Bug Description

The system tests test_retrace_system_sandbox and test_retrace_jammy_sandbox fail on arm64, ppc64el, s390x, because apport-retrace is killed by the OOM killer. Example:

autopkgtest kernel: Out of memory: Killed process 3597 (apport-retrace) total-vm:1512420kB, anon-rss:1241460kB, file-rss:2592kB, shmem-rss:0kB, UID:0 pgtables:2554kB oom_score_adj:0

Log: https://autopkgtest.ubuntu.com/results/autopkgtest-oracular-bdrung-apport/oracular/s390x/a/apport/20240722_145904_d3c2f@/log.gz

apport-retrace should be able to retrace crashes without needing multiple gigabytes of memory.

Tags: fr-8574
Revision history for this message
Benjamin Drung (bdrung) wrote :

I tested test_retrace_system_sandbox with python3-memprof on amd64 and it showed that hundred of megabytes were consumed by file2pkg.

Revision history for this message
Benjamin Drung (bdrung) wrote :

sandbox_cache/Ubuntu 22.04/contents_mapping-jammy-amd64.pickle is 753 MB in size. `_contents_mapping` uses `pickle.load` to load the whole cache into memory.

Idea: Instead of building the whole cache in memory and pickle it, use sqlite3 (part of the Python standard library) to build a database and query the database.

Revision history for this message
Benjamin Drung (bdrung) wrote :

Quick size comparison:

```
$ ./convert
[ 0.000001] pickle.load
[ 2.983616] write sqlite3
[ 8.993280] write DBM (GNU)
[ 61.849732] write DBM (NDBM)
[ 87.584065] done
$ ls -Slh contents_mapping-jammy-amd64.*
-rw-rw-r-- 1 bdrung bdrung 2,1G Jul 23 15:10 contents_mapping-jammy-amd64.ndbm.db
-rw-rw-r-- 1 bdrung bdrung 1,2G Jul 23 15:09 contents_mapping-jammy-amd64.dbm
-rw-r--r-- 1 bdrung bdrung 802M Jul 23 15:08 contents_mapping-jammy-amd64.db
-rw-rw-r-- 1 bdrung bdrung 753M Jul 23 14:49 contents_mapping-jammy-amd64.pickle
```

Script for it:

```python3
#!/bin/python3

import dbm.gnu
import dbm.ndbm
import pickle
import sqlite3
import sys
import time

START_TIME = time.perf_counter()

def log(text: str) -> None:
    time_ = time.perf_counter() - START_TIME
    sys.stderr.write(f"[{time_:11.6f}] {text}\n")

def write_sqlite(database, file2pkg):
    connection = sqlite3.connect(database)
    cursor = connection.cursor()
    cursor.execute("CREATE TABLE file2deb(file, package)")
    cursor.executemany("INSERT INTO file2deb VALUES(?, ?)", file2pkg.items())
    connection.commit()
    connection.close()

def write_dbm_gnu(database, file2pkg):
    db = dbm.gnu.open(database, "cf")
    for filename, package in file2pkg.items():
        db[filename] = package
    db.sync()
    db.close()

def write_ndbm(database, file2pkg):
    db = dbm.ndbm.open(database, "c")
    for filename, package in file2pkg.items():
        db[filename] = package
    db.close()

def main(file):
    log("pickle.load")
    with open(f"{file}.pickle", "rb") as file2pkg_file:
        file2pkg = pickle.load(file2pkg_file)
    log("write sqlite3")
    write_sqlite(f"{file}.db", file2pkg)
    log("write DBM (GNU)")
    write_dbm_gnu(f"{file}.dbm", file2pkg)
    log("write DBM (NDBM)")
    write_ndbm(f"{file}.ndbm", file2pkg)
    log("done")

if __name__ == "__main__":
    main("contents_mapping-jammy-amd64")
```

Revision history for this message
Benjamin Drung (bdrung) wrote (last edit ):

Benchmark results on a Raspberry Pi Zero 2W:

Creating a database with only jammy-proposed (fits into memory):

* pickle: 43s for creation, 5s for querying (including pickling)
* SQLite: 45s for creation, 1s for querying
* DBM (GNU): 123s for creation, 1s for querying
* DBM (NDBM): 208s for creation, 1-2s for querying

Creating a database with jammy (does not fit into memory):

* pickle: 4h37m29s for creation, 18m50s for load and a single dict access (test_retrace_jammy_sandbox took 16075.58s = 4.5h)
* SQLite: 14m19s for creation, 1.6s for querying
* DBM (GNU): manually aborted after 10h
* DBM (NDBM): manually aborted after several hours (did not check the clock)

So DBM is disqualified and SQLite is the only remaining contender.

Benjamin Drung (bdrung)
tags: added: foundations-todo
Changed in apport (Ubuntu):
assignee: nobody → Benjamin Drung (bdrung)
Changed in apport:
milestone: none → 2.30.0
importance: Undecided → High
status: New → Triaged
Revision history for this message
Benjamin Drung (bdrung) wrote :
Download full text (6.7 KiB)

Here comes the result of benchmark with different databases. I pushed the test code to the benchmark-databases: https://github.com/bdrung/apport/tree/benchmark-databases You can execute the test file or use `run_all` to execute them all.

## Creation and query time

Results on jammy Contents with a Ryzen 7 5700G:

```
$ ./contents_leveldb.py -jd
Creation duration: 39.186 s
Access duration: 1.529 ms
Total duration: 39.566 s

$ ./contents_lmdb.py -jd
Creation duration: 42.202 s
Access duration: 0.292 ms
Total duration: 42.746 s

$ ./contents_sqlite_v1.py -jd
Creation duration: 47.238 s
Access duration: 1.090 ms
Total duration: 47.734 s

$ ./contents_sqlite_v2.py -jd
Creation duration: 47.416 s
Access duration: 1.232 ms
Total duration: 48.882 s

$ ./contents_sqlite_v3.py -jds
Creation duration: 54.331 s
Entries in packages table: 72,036
Entries in directories table: 92,679
Total entries: 8,037,545
Access duration: 1.537 ms
Total duration: 54.657 s

$ ./contents_sqlite_v3b.py -jds
Creation duration: 53.860 s
Entries in packages table: 72,036
Entries in directories table: 30,219
Total entries: 8,037,545
Access duration: 1.606 ms
Total duration: 54.120 s

$ ./contents_ndbm.py -jd
Creation duration: 1 min 41.623 s (= 101.623 s)
Access duration: 0.274 ms
Total duration: 1 min 43.074 s (= 103.074 s)

$ ./contents_dbm.py -jd
Creation duration: 2 min 6.741 s (= 126.741 s)
Access duration: 0.364 ms
Total duration: 2 min 7.168 s (= 127.168 s)

$ ./contents_pickle.py -jd
Creation duration: 36.603 s
Save duration: 4.301 s
Access duration: 0.017 ms
Total duration: 41.456 s

$ ./contents_unoptimized_pickle.py -jd
Creation duration: 59.445 s
Save duration: 4.354 s
Access duration: 0.018 ms
Total duration: 1 min 4.296 s (= 64.296 s)
```

Results on a Raspberry Pi 2W on 2024-07-26 with jammy Contents (dbm and ndbm were too slow):

```
$ ./contents_leveldb.py -dsj
Creation duration: 28 min 38.402 s (= 1718.402 s)
Total entries: 7,911,114
Access duration: 153.075 ms
Total duration: 32 min 17.500 s (= 1937.500 s)

$ ./contents_lmdb.py -dsj
Creation duration: 35 min 15.257 s (= 2115.257 s)
Total entries: 7,911,114
Access duration: 508.313 ms
Total duration: 36 min 47.932 s (= 2207.932 s)

$ ./contents_sqlite_v1.py -dsj
Creation duration: 14 min 32.801 s (= 872.801 s)
Total entries: 7,911,114
Access duration: 39.372 ms
Total duration: 16 min 0.188 s (= 960.188 s)

$ ./contents_sqlite_v2.py -dsj
Creation duration: 14 min 45.833 s (= 885.833 s)
Total entries: 7,911,114
Access duration: 37.519 ms
Total duration: 15 min 55.706 s (= 955.706 s)

$ ./contents_sqlite_v3.py -dsj
Creation duration: 16 min 20.365 s (= 980.365 s)
Entries in packages table: 72,036
Entries in directories table: 92,679
Total entries: 8,037,545
Access duration: 45.323 ms
Total duration: 18 min 6.704 s (= 1086.704 s)

$ ./contents_pickle.py -dsj
Creation duration: 24 min 36.127 s (= 1476.127 s)
Save duration: 4 h 9 min 1.947 s (= 14941.947 s)
Total entries: 7,911,114
Access duration: 33.207 ms
Total duration: 4 h 37 min 24.957 s (= 16644.957 s)
```

Results on a Raspberry Pi 2W on 2024-07-25 with noble (dbm and ndbm were too slow):

```
$ ./run_all -ds
$ ./contents_leveldb.py -ds
Creation duration: 8...

Read more...

Revision history for this message
Benjamin Drung (bdrung) wrote :

Takeaway from the benchmarks:

The current implementation (dict + pickle) is the fastest as long as there is enough memory without swapping. In case the data does not fit into the memory any more, SQLite (v1 < v2 < v3) is the fastest. In case of low memory without swap, only SQLite were not killed. So I propose we implement the SQLite solution.

The question is which variant?

* v1: simple key value store. Benefit: fastest and simplest solution. Drawback: Biggest files on disk (jammy: 1511 MB vs 777 MB of the pickle file)

* v2: norm form with packages stores in a separate table. Benefit: smaller than v1 (jammy: 1.31 GB vs 1.51 GB), makes database fetishists happy, only slightly slower than v1. Drawback: More code.

* v3: like v2 but key split into beginning and end of path. Benefit: smaller than v2 and can even beat the pickle file (jammy: 710 MB vs 777 MB pickle vs 1313 MB v2). Drawback: significantly slower than v2 and v1, only slightly more complex than v2.

Benjamin Drung (bdrung)
summary: - apport-retrace needs more than 1 GB
+ apport-retrace needs more than 1 GB memory (when using sandbox)
Revision history for this message
Benjamin Drung (bdrung) wrote : Re: apport-retrace needs more than 1 GB memory (when using sandbox)

For reference the sizes of the Contents files:

## jammy

Compressed:

```
$ du -B M -s *jammy*Contents* | sort -n
19M jammy-proposed-Contents-amd64.gz
45M jammy-Contents-amd64.gz
122M jammy-security-Contents-amd64.gz
126M jammy-updates-Contents-amd64.gz
```

Uncompressed:

```
$ du -B M -s *jammy*Contents* | sort -n
339M jammy-proposed-Contents-amd64
717M jammy-Contents-amd64
2242M jammy-security-Contents-amd64
2306M jammy-updates-Contents-amd64
```

## noble

Compressed:

```
$ du -B M -s *noble*Contents* | sort -n
3M noble-proposed-Contents-amd64.gz
11M noble-security-Contents-amd64.gz
11M noble-updates-Contents-amd64.gz
49M noble-Contents-amd64.gz
```

Uncompressed:

```
$ du -B M -s *noble*Contents* | sort -n
44M noble-proposed-Contents-amd64
182M noble-security-Contents-amd64
188M noble-updates-Contents-amd64
791M noble-Contents-amd64
```

Revision history for this message
Benjamin Drung (bdrung) wrote :

Benchmarks can be found nicely formatted on https://github.com/bdrung/apport/blob/benchmark-databases/Benchmark.md

SQLite is the fastest on a Raspberry Pi Zero 2W and the only contender for fixing this bug. For the different SQLite implementation variants: You can trade creation time with saved disk space.

Since I was too curious I spend some nights on rewriting those SQLite benchmarks in Rust (see https://github.com/bdrung/apport-rs). The Rust implementation is obviously faster. Surprisingly the creation time for the different database layouts is nearly the same. So I conclude that increased creation times in Python are caused by the Python code (and not by the more complex SQL statements).

Benjamin Drung (bdrung)
summary: - apport-retrace needs more than 1 GB memory (when using sandbox)
+ apport-retrace needs more than 1.5 GB memory (when using sandbox)
Revision history for this message
Benjamin Drung (bdrung) wrote :
Revision history for this message
Simon Chopin (schopin) wrote :

This feature should be postponed to the 2.31 version (i.e. the 25.04 cycle), as we don't fully understand the typical usage patterns of apport-retrace. We're talking about a substantial change, and having it targeted to 2.30 (which needs to be cut in the coming two weeks) would be rushing it, even though we don't actually have reports of it being an issue in production. It's only becoming a problem because we now have apport-retrace system tests during autopkgtests, which is by definition an artificial setting.

Benjamin Drung (bdrung)
Changed in apport:
milestone: 2.30.0 → 2.31.0
Benjamin Drung (bdrung)
tags: added: fr-8574
removed: foundations-todo
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.