Note that if we change the loop to:
python -c "
from bzrlib import trace, branch
trace.enable_default_logging()
b = branch.Branch.open('launchpad-2a/devel')
b.lock_read()
maps = []
for vf_name in ['revisions', 'signatures', 'inventories', 'chk_bytes', 'texts']:
vf = getattr(b.repository, vf_name)
maps.append(vf.keys())
trace.debug_memory('after %s' % (vf_name,))
b.unlock()
trace.debug_memory('after unlock')
del maps
trace.debug_memory('after del maps')
"
Internally, .keys() calls iter_all_entries() which does not cache the keys in the btree caches. At that point we have:
after revisions
VmPeak: 25900 kB
VmSize: 25900 kB
VmRSS: 23012 kB
after signatures
VmPeak: 28464 kB
VmSize: 28464 kB
VmRSS: 25560 kB
after inventories
VmPeak: 32080 kB
VmSize: 32080 kB
VmRSS: 29008 kB
after chk_bytes
VmPeak: 101212 kB
VmSize: 95252 kB
VmRSS: 92396 kB
after texts
VmPeak: 113188 kB
VmSize: 109976 kB
VmRSS: 107108 kB
after unlock
VmPeak: 113188 kB
VmSize: 109976 kB
VmRSS: 107108 kB
after del maps
VmPeak: 113188 kB
VmSize: 94596 kB
VmRSS: 91728 kB
If I add this debug loop:
from memory_dump import scanner, _scanner
scanner.dump_all_referenced(open(',,maps_refs.txt', 'wb'), maps)
size = dict.fromkeys('0123', 0)
size['0'] = _scanner.size_of(maps)
for x in maps:
size['1'] += _scanner.size_of(x)
for y in gc.get_referents(x):
size['2'] += _scanner.size_of(y)
for z in gc.get_referents(y):
size['3'] += _scanner.size_of(z)
pprint.pprint(size)
I get:
Now if I did it correctly, maps is a list of lists of keys [of strings].
{'0': 64, '1': 15729200, '2': 26713160, '3': 92853917}
135,296,341
So we have 64 bytes allocated to the overall list
15.7M bytes allocated to the lists of keys
26.7M bytes allocated to the tuples
92.9M bytes allocated to strings
Now the strings are, in theory, duplicated and deduped via intern() and my size_of() loop does not check any of that.
If I get rid of the intern() calls by editing _btree_serializer_pyx.pyx to change the call to safe_interned_string_from_size to safe_string_from_size I end up with:
VmPeak: 162088 kB
VmSize: 158876 kB
VmRSS: 155868 kB
{'0': 64, '1': 15729200, '2': 26713160, '3': 92853917}
135,296,341
However, you can see that VmPeak went from 113188 kB to 162088 kB, so the intern() does seem to be helping.
The most concerning to me is that we don't seem to be reclaiming the memory when we are done. Which is strange.
Note that if we change the loop to: default_ logging( ) Branch. open('launchpad -2a/devel' ) b.repository, vf_name) append( vf.keys( )) debug_memory( 'after %s' % (vf_name,)) memory( 'after unlock') memory( 'after del maps')
python -c "
from bzrlib import trace, branch
trace.enable_
b = branch.
b.lock_read()
maps = []
for vf_name in ['revisions', 'signatures', 'inventories', 'chk_bytes', 'texts']:
vf = getattr(
maps.
trace.
b.unlock()
trace.debug_
del maps
trace.debug_
"
Internally, .keys() calls iter_all_entries() which does not cache the keys in the btree caches. At that point we have:
after revisions
VmPeak: 25900 kB
VmSize: 25900 kB
VmRSS: 23012 kB
after signatures
VmPeak: 28464 kB
VmSize: 28464 kB
VmRSS: 25560 kB
after inventories
VmPeak: 32080 kB
VmSize: 32080 kB
VmRSS: 29008 kB
after chk_bytes
VmPeak: 101212 kB
VmSize: 95252 kB
VmRSS: 92396 kB
after texts
VmPeak: 113188 kB
VmSize: 109976 kB
VmRSS: 107108 kB
after unlock
VmPeak: 113188 kB
VmSize: 109976 kB
VmRSS: 107108 kB
after del maps
VmPeak: 113188 kB
VmSize: 94596 kB
VmRSS: 91728 kB
If I add this debug loop: dump_all_ referenced( open(', ,maps_refs. txt', 'wb'), maps) '0123', 0) size_of( maps) referents( x): referents( y):
from memory_dump import scanner, _scanner
scanner.
size = dict.fromkeys(
size['0'] = _scanner.
for x in maps:
size['1'] += _scanner.size_of(x)
for y in gc.get_
size['2'] += _scanner.size_of(y)
for z in gc.get_
size['3'] += _scanner.size_of(z)
pprint.pprint(size)
I get:
Now if I did it correctly, maps is a list of lists of keys [of strings].
{'0': 64, '1': 15729200, '2': 26713160, '3': 92853917}
135,296,341
So we have 64 bytes allocated to the overall list
15.7M bytes allocated to the lists of keys
26.7M bytes allocated to the tuples
92.9M bytes allocated to strings
Now the strings are, in theory, duplicated and deduped via intern() and my size_of() loop does not check any of that.
If I get rid of the intern() calls by editing _btree_ serializer_ pyx.pyx to change the call to safe_interned_ string_ from_size to safe_string_ from_size I end up with:
VmPeak: 162088 kB
VmSize: 158876 kB
VmRSS: 155868 kB
{'0': 64, '1': 15729200, '2': 26713160, '3': 92853917}
135,296,341
However, you can see that VmPeak went from 113188 kB to 162088 kB, so the intern() does seem to be helping.
The most concerning to me is that we don't seem to be reclaiming the memory when we are done. Which is strange.