onboard memory usage/cpu increase and lockup

Bug #1317436 reported by Zen25000
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Onboard
Invalid
Undecided
Unassigned
at-spi2-core (Ubuntu)
New
Undecided
Unassigned

Bug Description

I am running onboard in Mageia 5 (Cauldron) and recently I'm seeing a total lock-up of onboard after the machine has been running for over 24 hours.

The tray icon is there but does nothing. At this point onboard is using 25% cpu and 880MB ram..

I am attaching the memory usage analysis for onboard (after the lockup) in the hope that it will be of use.

In the meantime I will try to monitor memory usage over a day to see if this increase is gradual or happens suddenly.

Revision history for this message
Zen25000 (zen25000) wrote :
Revision history for this message
marmuta (marmuta) wrote :

Thanks for reporting this. I've been tracking down memory leaks not long ago and found two so far.

One that leaks for every update of the word suggestions is fixed in trunk:
https://bazaar.launchpad.net/~onboard/onboard/trunk/revision/1803

The other one appears to be caused by libatspi/libdbus. Heap usage still counts against Onboard, but I can reproduce the leak with a trivial C sample, meaning I can't fix it from Onboard. The effects of this leak can be quite massive under the right circumstances. Simply using the Gtk3 version of the synaptic package manager quickly adds 100s of MB to GB Onboard, but that seemed to be the onIy occasion so far. I haven't seen significant heap increases during regular usage.

If you monitor memory usage, please apply commit 1803 first. Then it would be interesting to see if a specific application triggers additional leaks or if there is a more or less consistent increase across all usage.

I'm not sure about the lockup. A crash at the right spot can incapacitate Onboard, but it would usually still show some useless reactions, not lockup completely. Maybe, If you can, keep Onboard running in the terminal and check if there are python tracebacks when it happens again.

Revision history for this message
Francesco Fumanti (frafu) wrote :

I just uploaded revision 1805 of trunk into our Snapshots PPA. It should be available as soon as launchpad has finished with building it.
https://launchpad.net/~onboard/+archive/snapshots

Revision history for this message
Zen25000 (zen25000) wrote :

I rebuilt onboard with the patch applied and ran it while the machine was idle for about 16 hours. ( Overnight until 4pm the following day)
On returning to the machine the onboard tray icon refused to open it as before.

During the test there was no direct use of onboard at all and the machine was idle except for the regular system cron jobs plus another two cron jobs one running a speed test to a server hourly and the other checking memory for this test. It was also running hexchat and connected to freenode.

The memory usage increased steadily with time and appears to be still increasing after the lock-up.
Unfortunately I did not run it in a terminal so have no output for that occasion.

I then forcibly killed the process and restarted it (in a terminal) just before the 16:30 cron job, to get an idea of the memory used at start, and followed this with a full breakdown (seconds later).

See notes in attached test results.

The terminal output is interesting at the bottom.

I will re-run with the terminal output piped to a file until it locks again.

Revision history for this message
Zen25000 (zen25000) wrote :

Following on from the previous post - I let this run overnight again and this morning it had locked up.

The terminal output file was last updated at 02.46 which coincides with memory use of 278MB.

There was nothing but the same repeating message in the terminal output, and nothing was added on clicking the tray icon.

journalctl -f showed nothing either.

As can be seen from the attached, the memory use continues increasing after the lockup.

Maybe you could give me some ideas on how to test/debug further?

Revision history for this message
marmuta (marmuta) wrote :

I see, we'd need to figure out what makes the heap grow. I'd first try to narrow it down by turning off features in preferences. Start with turning off word-suggestions, auto-capitalization and auto-show, in that order, restart Onboard and watch memory usage. Not too long, just long enough so you can tell memory usage is rising consistently.

Valgrind would give more detailed insights, but it's not an easy beast to handle. I'll have more time to look into the leak tomorrow, but if you want to try it, here is something to get started.

1) install debug symbols for Onboard's dependencies, I'd start with all things at-spi and libdbus

2) rebuild Onboard without optimization
cd onboard
CFLAGS="-g -O0" ./setup.py build --force

3) copy the attached suppression file somewhere. I'll assume it's in Onboard's project directory. This will cut down on the size of valgrind.log.

3) run ./onboard in valgrind. Do this in an xterm to avoid a feedback loop where word suggestions are updated due to Onboard's own console output. Warning: onboard will run at <1/10 of its original speed.
xterm
cd onboard
valgrind --tool=memcheck --leak-check=full --show-leak-kinds=all --suppressions=python3.supp --log-file=valgrind.log python3 -E -tt -u ./onboard

4) wait... not necessarily over night, but long enough to have the leak stand out due to its size. valgrind.log will report tons of perceived problems, only very few are interesting for us.

5) check for prominent leaks in valgrind.log. That file will be too large to post. I'm no expert on valgrind usage, but what I do is go to the end and scroll backwards until I've found an entry for a large amount of lost memory in a high number of blocks. If you know Onboard grew 10MB since the start, that's about the size I would be looking for first.

6) If you find suspicious entries, post them here including the call stack below them. Make sure the call stacks' topmost entries are not filled with question marks, but show function and file names. If there are question marks you'll have to install more debug symbols and restart from 3).

The "AT-SPI: Error in GetItems" warning used to be benign. I believe this is caused by accessibles getting lost, which happens all the time when their widgets are destroyed. Having this message periodically every 2 seconds is weird though. It might be related to the way you track memory usage for this bug. Perhaps run all diagnostics and do all Onboard launches from xterms. xterm doesn't have accessibility support and is less prone to interfere with the testing.

Revision history for this message
marmuta (marmuta) wrote :

So, I spent the day on Mageia :). Results are inconclusive, I did see a tendency for rising memory usage, but in some 6 hours it only crept up from ~40MB to ~65MB for RSS. A second shorter run had it fall down from ~55 to ~45 at some point too, could be python's garbage collection, but I'm not sure.

I did install hexchat and had it running logged into some busy channels on freenode, but I doubt that had any affect on Onboard. There is no sign of any AT-SPI activity coming from it.

I could imagine konsole having an effect, in particular if you are running frequent periodic jobs in it. I don't remember seeing this before, but now it actually sends object:text-changed events for every text output. This is the message I found to leak memory from a simple C sample when working with synaptic. I haven't figured out yet where to file that bug report.

BTW., konsole seems closer to supporting auto-show and word-suggestions in Onboard now. With some tweaking I got suggestions on a new empty tab. The reported role is still invalid, though, should be ATSPI_ROLE_TERMINAL instead of some random number, and changed text positions seem off after the first command, so Onboard sees the wrong text.

I'll attach the script I hacked up to monitor memory usage, maybe that comes in handy.

Revision history for this message
Zen25000 (zen25000) wrote :

Hi ,
I did some testing in a different Cauldron installation yesterday and today, and I fully updated it beforehand.
In this system over an hour or so I was only seeing the rise to around 75MB that you are seeing - however I then ran Mozilla Thunderbird to check my mail and it immediately jumped up by 100MB.

I then came back to my regular Cauldron and again updated it and ran it for some time without running TB - it settled at around 75MB, while I was using hexchat, dolphin, konsole and doing a lot of package building. I then opened Thunderbird and it shot up to over 650MB as soon as mail downloads started.

I closed TB and it went up another 50MB and then stayed at around 700MB for a few hours until I ran TB again when it jumped to 880 MB. Onboard is still working.

I think the crash may have been down to python3, as the python script I am using to check the memory was also misbehaving until I updated (there was a python3 update).

Regarding your memory script - I can't get it to run, as here "pidof onboard" always returns nothing when onboard is running. I am using mem.py called in a bash script loop with a 5 second delay which works well.
http://gnutoolbox.com/download/linux/scripts/mem.py
#!/bin/bash
[[ ${#@} < 2 ]] && { echo " Usage # chkmem <process_name> <full_path_to_log_file>"; exit 1; }
while true; do
mem_used=$(/mydata/bin/mem.py|grep "$1"|sed s/"$1"// |tr -s ' ' )
echo "$(date +%d/%m/%Y-%H:%M) $mem_used" >> "$2"
sleep 5
done

I spent hours with valgrind, but could not find a way to check the memory used by onboard while it was running in valgrind. :\
However I will try again maybe tomorrow and will run Thunderbird during the test.

I will also repeat the overnight long test that previously caused the lock-up in this system.

Revision history for this message
marmuta (marmuta) wrote :

Yes, I can reproduce heap usage shooting up with Thunderbird, and it happens in the updated Magaia 5 as well as in Ubuntu 14.04.
2014-16-13 23:16:33 onboard(4353) size=721720 rss=53280 shared=17224 text=3408 data=329952 heap=21340
2014-16-13 23:16:53 onboard(4353) size=721720 rss=53280 shared=17224 text=3408 data=329952 heap=21340
2014-17-13 23:17:13 onboard(4353) size=721720 rss=53280 shared=17224 text=3408 data=329952 heap=21340
2014-17-13 23:17:33 onboard(4353) size=775072 rss=106184 shared=17224 text=3408 data=383304 heap=74688
2014-17-13 23:17:53 onboard(4353) size=776432 rss=107812 shared=17228 text=3408 data=384664 heap=76048
2014-18-13 23:18:13 onboard(4353) size=776432 rss=107848 shared=17224 text=3408 data=384664 heap=76048

I created an newsgroup account in Thunderbird and in the list of available news groups entered a search term with Onboard. That was enough to triple the size of the heap as above. It might be that filling list views with lots of rows is key for the memory leak to become noticeable. Same thing happens with synaptic.

I did manage to finish a valgrind run on Magaia. The instructions above mostly worked, but I changed python.supp to the one provided by the valgrind package there. From the valgring.log it appears that most of the large left-over allocations come from libatspi and libdbus, again similar to what I saw with synaptic.

There are still question marks in valgrind.log, I couldn't figure out where to get some of the debug symbols from. I have the lib64atspi-devel package installed and tried
debuginfo-install lib64atspi0
but got
No package named at-spi2-core-debuginfo

Revision history for this message
marmuta (marmuta) wrote :

Thanks for your script. Mine had to be run as 'logmem python3' before, I've fixed it to take the process name, so 'logmem onboard' works now. Yours is more accurate for overall size, but I thought I needed slightly more fine grained info, heap size in particular.

Revision history for this message
marmuta (marmuta) wrote :

Orca is affected too, same procedure with Thunderbird filtering newsgroups as described in #9.

$ logmem orca
2014-05-14 00:52:08 orca(7036) size=530724 rss=36684 shared=12348 text=3408 data=318000 heap=11116
2014-05-14 00:52:28 orca(7036) size=531236 rss=36824 shared=12388 text=3408 data=318512 heap=11116
2014-05-14 00:52:48 orca(7036) size=531236 rss=36824 shared=12388 text=3408 data=318512 heap=11116
2014-05-14 00:53:08 orca(7036) size=531236 rss=36824 shared=12388 text=3408 data=318512 heap=11116
2014-05-14 00:53:28 orca(7036) size=540008 rss=45928 shared=12420 text=3408 data=327284 heap=18096
2014-05-14 00:53:48 orca(7036) size=631040 rss=136296 shared=12424 text=3408 data=418316 heap=106816
2014-05-14 00:54:09 orca(7036) size=631040 rss=136296 shared=12424 text=3408 data=418316 heap=106816
2014-05-14 00:54:29 orca(7036) size=631296 rss=136608 shared=12424 text=3408 data=418572 heap=106816

Onboard at the same time:

$ logmem onboard
2014-05-14 00:51:25 process 'onboard' not found
2014-05-14 00:51:45 onboard(6998) size=721792 rss=53184 shared=17196 text=3408 data=330024 heap=21416
2014-05-14 00:52:05 onboard(6998) size=721792 rss=53124 shared=17128 text=3408 data=330024 heap=21416
2014-05-14 00:52:25 onboard(6998) size=721792 rss=53176 shared=17148 text=3408 data=330024 heap=21416
2014-05-14 00:52:45 onboard(6998) size=721792 rss=53176 shared=17148 text=3408 data=330024 heap=21416
2014-05-14 00:53:05 onboard(6998) size=721792 rss=53176 shared=17148 text=3408 data=330024 heap=21416
2014-05-14 00:53:25 onboard(6998) size=721792 rss=53176 shared=17148 text=3408 data=330024 heap=21416
2014-05-14 00:53:46 onboard(6998) size=820936 rss=151388 shared=17172 text=3408 data=429168 heap=117480
2014-05-14 00:54:06 onboard(6998) size=820936 rss=151388 shared=17172 text=3408 data=429168 heap=117480
2014-05-14 00:54:26 onboard(6998) size=820936 rss=151440 shared=17224 text=3408 data=429168 heap=117480

The heap growth is almost the same:
106816-11116 = 95700
117480-21416 = 96064

Revision history for this message
Zen25000 (zen25000) wrote :

Re: at-spi2-core-debuginfo
You need to enable core-debug repo :)
Either use mcc GUI or this should do it:
# urpmi.addmedia "Core Release Debug" http://distrib-coffee.ipsl.jussieu.fr/pub/linux/Mageia/distrib/cauldron/x86_64/media/debug/core/release
Then:
# urpmi at-spi2-core-debuginfo
or what you tried before should also work, but I never used it.

marmuta (marmuta)
Changed in onboard:
status: New → Invalid
Revision history for this message
Zen25000 (zen25000) wrote :

Not sure if this is useful - short run, opened and closed TB which downloaded 25 emails then stopped onboard:

==11274== 1,890,176 bytes in 118,136 blocks are definitely lost in loss record 17,326 of 17,338
==11274== at 0x4C2863A: realloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==11274== by 0x1070577E: set_length (dbus-string.c:352)
==11274== by 0x10706054: _dbus_string_append_len (dbus-string.c:903)
==11274== by 0x106F6153: dbus_message_iter_get_signature (dbus-message.c:2157)
==11274== by 0x104C620C: _atspi_dbus_set_interfaces (atspi-misc.c:1364)
==11274== by 0x104C6FC3: add_accessible_from_iter (atspi-misc.c:460)
==11274== by 0x104C7622: process_deferred_messages.part.14 (atspi-misc.c:681)
==11274== by 0x104C7798: process_deferred_messages_callback (atspi-misc.c:748)
==11274== by 0x953CABC: g_main_context_dispatch (gmain.c:3064)
==11274== by 0x953CE27: g_main_context_iterate.isra.24 (gmain.c:3734)
==11274== by 0x953D0E9: g_main_loop_run (gmain.c:3928)
==11274== by 0xB7D77F4: gtk_main (in /usr/lib64/libgtk-3.so.0.1200.1)

Revision history for this message
marmuta (marmuta) wrote :

> Re: at-spi2-core-debuginfo
> You need to enable core-debug repo :)
Thank you :) added to my notes

I'm not sure about #13 either. Most of these allocations seem to happen in libdbus, this one might be part of something larger that got lost. I have valgrind.log that is small enough to attach as a whole, from a new C-sample that reproduces the leak. Maybe that helps, we'll see. I've opened an upstream bug report and attached everything there.
https://bugzilla.gnome.org/show_bug.cgi?id=730152

Revision history for this message
Zen25000 (zen25000) wrote :

So, it seems to me that there are two separate issues here.

1. Related to Thunderbird that you have replicated and reported.

2. The steady leak over time that finally locks up Onboard as per original report.

Regarding 2 - I left onboard running last night after stopping all of the following:
httpd
crond
firefox
thunderbird
gnote
hexchat

gwenview was open.

This morning again onboard was unresponsive and eventually had to be killed - attached is a graph created by a modified version of your logmem script (using gnuplot).

Revision history for this message
marmuta (marmuta) wrote :

OK, I thought from comment #8 that the lockup issue was gone. Have you been able to run Onboard in a terminal (xterm preferred) during the latest lockup?

That graph is certainly interesting, the rise is >30MB/h, at least ten times as much as I see here on either Mageia or Ubuntu. My working theory is still that 1) and 2) have the same cause, i.e. the leak I reported is happening all the time, not just in thunderbird and synaptic. It's just more visible there due to those applications' tendency for high frequency bursts of at-spi events. See
https://bugs.launchpad.net/ubuntu/+source/synaptic/+bug/1244474

If that is the case, then something must be continuously emitting 10 times as many events on your system as on your test install or on either of my systems. Periodic terminal output in konsole or gnome-terminal could do that. Are there and terminal jobs left that are running all the time?

I don't want to rule out Onboard itself leaking, it wouldn't be the first time. I'd even prefer that, it's way easier to fix for me, but the evidence doesn't seem to indicate this is happening. Also, there are many places that could leak, but a lot fewer without interaction. From the top of my head, the only long-term idle activities are handling incoming at-spi events and attempts to auto-save learned word-suggestions (should do nothing when idle).

Things you can try:
- Run the leaktest1 sample from the gnome-bug report and monitor its memory usage. Does it show a linear increase too?

- Stop Onboard from listening to any at-spi events and monitor memory usage. Does it still rise?
gsettings set org.onboard.typing-assistance.word-suggestions enabled false
gsettings set org.onboard.typing-assistance auto-capitalization false
gsettings set org.onboard.typing-assistance auto-correction false
gsettings set org.onboard.auto-show enabled false

- Turn off anything on your system that periodically generates text in GUI applications, in particular in konsole and gnome-terminal (xterm is fine), and monitor memory usage.

Revision history for this message
Zen25000 (zen25000) wrote :

>- Run the leaktest1 sample from the gnome-bug report and monitor its memory usage. Does it show a linear >increase too?

Been short on time, but managed to run leaktest1 on my main system. Leakage seems to be related to mouse/keyboard use to some degree. Attached is annotated graph in my regular user.
In a new user the leakage is minimal. Overnight graph attached below.

Revision history for this message
Zen25000 (zen25000) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.