RPM

rpm database extremely slow after preupgrade to f16

Bug #913615 reported by Jeff Johnson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
RPM
New
Undecided
Unassigned
Fedora
Fix Released
Medium

Bug Description

tracker: fsync is slow root root cause

Tags: fedora rpmdb
Revision history for this message
In , Rob (rob-redhat-bugs) wrote :

Description of problem:

Performing normal operations in yum took a very long time after upgrading my desktop to f16. After some prodding directed by the fine fellows in #yum, it was concluded that this was a problem with my rpm database. It seems that interacting with the Packages database takes a lot of time.

First thing that I tried was to rebuild the rpm database (following the "oh, something's wrong with RPM, better rebuild its database" strategy), after deleting the __db* files. Rebuilding the database takes a large amount of time:

[rob@zarniwoop ~(master)]$ time sudo rpm --rebuilddb

real 5m0.445s
user 0m11.372s
sys 0m3.372s

Also:

[root@zarniwoop rpm]$ time python -c 'import yum; yum.YumBase().rpmdb.simplePkgList()'
Loaded plugins: auto-update-debuginfo, fastestmirror, presto, refresh-packagekit

real 3m22.227s
user 0m1.779s
sys 0m1.097s

My desktop has 3237 packages installed.

One suspicion was the disks in my desktop: However, they appear to be completely ship-shape. There's no funk in dmesg, I can write to a file in /tmp (same FS as /var/lib/rpm) at 80MB/sec using dd, and parted reports my partitions are aligned correctly. filefrag reports that the Packages file is in 55 chunks, which doesn't seem crazy.

skvidal suggested that I tried copying my desktop's rpm database over to my laptop and tried rebuilding it on there to see if it was a hardware problem. My laptop also runs F16 and does not exhibit this slowness with its own rpm database (it takes 20s to rebuild and has 1907 packages installed). My laptop also displays slowness when rebuilding the rpm database from my desktop:

[rob@megadodo rpm]$ time rpm --root=/tmp/desktop-chroot/ --rebuilddb

real 2m2.058s
user 0m12.884s
sys 0m3.166s

For comparison, my laptop rebuilding its _own_ rpm database:

[rob@megadodo rpm]$ time sudo rpm --rebuilddb

real 0m19.878s
user 0m6.495s
sys 0m1.132s

So it appears that the problem is with the database itself, rather than the machine it is running on.

Other things of potential interest:
 * `package-cleanup --dupes` reports no duplicates after churning with RPM for a long time...
 * Upon observing that I had a lot of gpg-pubkey's installed (http://pastebin.com/YXtTvNMv) on my desktop, skvidal suggested I removed them and tried rebuilding. This didn't help. The Packages file below is from before I removed all of them.

You should find a copy of my Packages file here: https://xgoat.com/share/rpm-slowdb/Packages.xz
 (28MB)

Version numbers:
 rpm: rpm-4.9.1.2-1.fc16.i686
 uname: Linux zarniwoop.blob 3.1.0-7.fc16.i686.PAE #1 SMP Tue Nov 1 20:53:45 UTC 2011 i686 i686 i386 GNU/Linux

How reproducible: Very!

Revision history for this message
In , Rob (rob-redhat-bugs) wrote :

Created attachment 532898
xz'ed Packages file

I discovered I can attach from an URL...

Revision history for this message
In , Kevin (kevin-redhat-bugs) wrote :

Is this the same as https://bugzilla.redhat.com/show_bug.cgi?id=748790 (well, not the "every 5-10 minutes" part, that's the same as https://bugzilla.redhat.com/show_bug.cgi?id=752564 , but the "100% CPU" part)?

Revision history for this message
In , Rob (rob-redhat-bugs) wrote :

Hi Kevin,

I don't think it is the same. This situation appears to be entirely IO bound. Very little CPU time is used whilst it rebuilds the db.

You can observe this out for yourself by downloading the Packages file I uploaded and rebuilding it on a F16 machine. (You probably want to rebuild twice...)

Cheers,

Rob

Revision history for this message
In , Panu (panu-redhat-bugs) wrote :

Okay the insane slowness is reproducable for me too (for once), excellent. Will
have a look.

Just to get this straight: on F15 things were business as usual, and it only
started acting up after the upgrade to F16, right? Its pretty peculiar as the
rpm version is exactly the same on both.

Are others seeing this as well, or does this just happen to be an otherwise popular/interesting bug (judging by the number of CC's we have already)?

Revision history for this message
In , Rob (rob-redhat-bugs) wrote :

Well I only noticed the slowdown after the F16 upgrade. I believe it was not slow before that, so I guess whatever operations happened on the database during the upgrade lead to the database becoming slow.

I still have another desktop that is running F15. I rebuilt this database on there to find that it's also slow on there. That machine is running rpm-4.9.1.2-1.fc15.i686.

I think it might just be an interesting bug...

Revision history for this message
In , Panu (panu-redhat-bugs) wrote :

FWIW the "insane slowness" I referred to occurs when you do 'rpm --rebuilddb' from just the Packages file, triggering index autocreation which in this situation ends up turning fsync on for the new database being built when it shouldn't. The end result is that 'rpmdb --rebuilddb' takes over 20 *minutes* on my systems, when in normal circumstances it completes in a few seconds on the same systems. That's just a dumb (but harmless) bug, will fix.

The 2-5min times for --rebuilddb are quite likely to be something else - when the index/fsync bug is triggered you'll see "warning: Generating <N> missing indexes" message during --rebuilddb. It's a kinda known issue that rpmdb performance starts deteriorating over time (or rather, huge number of upgrades etc), rebuilding the db brings it back to normal and what you're seeing might be just that: the f15 -> f16 upgrade just happened to push it past whatever the limit where performance goes down the drain.

And that apparently different "a few minutes" time range case I'm not able to reproduce. Index generation and db rebuild when done separately run in the expected "a few seconds" timeframe:

[pmatilai@turre ~]$ mkdir /tmp/slowdb
[pmatilai@turre ~]$ xz -cd ~/Download/Packages.xz > /tmp/slowdb/Packages
[pmatilai@turre ~]$ time rpmdb --initdb --dbpath /tmp/slowdb/
warning: Generating 12 missing index(es), please wait...

real 0m5.577s
user 0m3.888s
sys 0m0.453s
[pmatilai@turre ~]$ time rpmdb --rebuilddb --dbpath /tmp/slowdb/

real 0m7.630s
user 0m4.907s
sys 0m0.698s

Revision history for this message
In , Panu (panu-redhat-bugs) wrote :

Oh, one more thing, so obvious it's easy to forget: was the system actually otherwise idle when you noticed the slowness? After a distro install/upgrade, there can be any number of other io-intensive things like prelink, mlocate and man-db cronjobs running which will severely slow down other io-intensive operations such as --rebuilddb.

Robert, if you still have the full original "slow" db (all of it, not just the Packages file) backup somewhere, can you try reproducing it with that, after ensuring the system is really idle? Also the timings for separate --initdb and --rebuilddb as shown in comment #6 might be interesting.

Revision history for this message
In , Yaroslav (yaroslav-redhat-bugs) wrote :

(In reply to comment #6)
My system has this bug.

It's a rawhide system, rpm version 4.9.1.2, yum version 3.4.3.

First I thought I'll just blame yum, which indeed deserves some blame of its own, but after profiling I've found that slowness is somewhere in the RPM itself.

Judging from strace, it seems that the Package database is being read over and over and over again, via pread(), in chunks of 4 KiB. Given that the Packages is 95 MiB in size, that is long time.

Increasing mp_mmapsize to 256 megabytes from 128, and the cache to 128 from 64, hasn't helped it a bit.

strace tells me the files are in fact mmap()'ed, but my puny C knowledge says you usually just use the file as an array of bytes in memory. However, lots of pread() calls issued after this, and lots of pwrite() calls issued when indexes are being written, somehow don't fit my picture of how mmap()ed files should be written or read. Correct me if I'm wrong.

As for timings, they are horrible. 30 minutes for --initdb step.

[root@think slowpoke]# time rpm --initdb --dbpath /root/slowpoke
warning: Generating 12 missing index(es), please wait...

real 30m17.679s
user 0m31.035s
sys 0m14.904s

I'm afraid of doing --rebuilddb; it really seems that either nofsync flag is ignored, or that mmap() is ignored.

Revision history for this message
In , Yaroslav (yaroslav-redhat-bugs) wrote :

Created attachment 533295
strace on rpm --initdb --dbpath /run/slowpoke

Notice how many 4K-pread()'s and pwrite()'s are there.

Revision history for this message
In , Rob (rob-redhat-bugs) wrote :

Hi Panu,

Yes, rebuilding the database from the Packages file takes considerably longer. On two machines that I have it took approximately 50 minutes. It's the second rebuilding that takes several minutes.

The numbers I quoted in the initial post for rebuilding and simplePkgList() are on a database that has already been rebuilt from from just the Packages file.

The system I was using was idle. When the rpm stuff was going, there was pretty much nothing else accessing the disk (was watching it in iotop and gkrellm). There was pretty much nothing else using the cpu either (was watching in top).

I'll generate some --initdb stats if you like, but since I've replicated the same behaviour on different machines with quite different hardware, I am somewhat sceptical that this'll be useful.

Cheers,

Rob

Revision history for this message
In , Shawn (shawn-redhat-bugs) wrote :

In my rawhide box:

[root@segfault ~]# time rpm --rebuilddb -v
real 0m48.277s

On a Fedora 15 -> 16 preupgrade:

[root@coredump ~]# time rpm --rebuilddb -v
real 2m35.502s

Revision history for this message
In , Panu (panu-redhat-bugs) wrote :

How much memory do the systems exhibiting this have, and what fs is /var on?

Like said on my own systems I can't reproduce anything but the index-creation/fsync issue, but on a virtual machine with less than 2GB memory the performance suddenly goes down the drain real hard, into the "several minutes" range. What seems to cure it is *decreasing* cache size:

In /usr/lib/rpm/macros change "mp_size=64Mb" to "mp_size=1Mb" (which incidentally is the value it had for years and years), remove the db environment: 'rm -f /var/lib/rpm/__db*', and performance should be back to normal - at least that's what I see on the virtual machines.

Now to figure out what the heck is actually going on in there...

Revision history for this message
In , Rob (rob-redhat-bugs) wrote :

Hi Panu,

My rootfs is ext3, which is what /var is on.

Changing mp_size to 1Mb does indeed bring performance back. (And changing it back to 64Mb takes it away again.)

Cheers,

Rob

Revision history for this message
In , Panu (panu-redhat-bugs) wrote :

So it is the same thing that I'm seeing then and no extra factors like completely different filesystem (ext3 vs ext4 is clone enough), good.

The (dramatic) slowdowndown is a somewhat peculiar combination of rpmdb size, system RAM and cache size: with "average" rpmdb sizes things are ok with the current 64Mb cache (this value was the same throughout F15) even with lower RAM, so for example your laptop's own db didn't exhibit the slowness but the desktop systems larger db crawled on the laptop.

Rebuilding the db is a special case though and doesn't really benefit from caching, whereas larger (than 1Mb) cache IS beneficial for other functionality. So its a question of finding a sweet spot that doesn't kill lower end systems while having a reasonable cache on "normal" operations...
How much memory do your systems have, circa 1GB or so?

Revision history for this message
In , Rob (rob-redhat-bugs) wrote :

Ah, apologies. Forgot to write how much RAM these machine have:

 * desktop: 6GB (32-bit install)
 * laptop: 2GB

Cheers.

Revision history for this message
In , Samuel (samuel-redhat-bugs) wrote :

I'm seeing this as well. I was wondering why yum was so slow, but then I figured out it was actually the rpm Packages database. It only started after upgrading to F16. The strange thing I noticed is that a warm start doesn't help at all as if there is no caching being done. Any rpm operation is slow, even if repeated immediately after. My laptop has 5GB of RAM.

That workaround definitely fixes it for me, it's actually speedy now.

Revision history for this message
In , Panu (panu-redhat-bugs) wrote :

So its not limited to systems with "low" memory. I'm curious as to why I'm not seeing this (except with lower available memory), when upping the cache size I did much of the testing on a laptop with 2Gb memory and was seeing definite improvements with the larger cache.

At this point better its quite obviously best to just back the cache size down to the previous value that is working so much better for others, at least until the issue is better understood.

Revision history for this message
In , Samuel (samuel-redhat-bugs) wrote :

My laptop does have a lot of things loaded most of the time, but still a 64MB cache seems pretty minimal. I also have two desktops with 2GB RAM that I upgraded and have this same problem. Did you upgrade yours or was it a clean install? I will see if I can do some testing with that. Although the issue may be related to the number of packages installed as well as all the computers involved have a lot of packages. The desktops have around 2300 and the laptop nearly 2800.

In the process of getting that information remotely while no one was logged in, I didn't see the problem. The first run of "rpm -qa" took quite a while with heavy disk access, but the second run was fast, basically entirely cached. I wonder if there is something about the disk reads with the larger cache setting that trigger something in the kernel to not cache when there's significant memory load that doesn't happen with smaller cache setting. I will see if I can find the time to do some more testing.

Revision history for this message
In , Hin-Tak (hin-tak-redhat-bugs) wrote :

*** Bug 754761 has been marked as a duplicate of this bug. ***

Revision history for this message
In , Hin-Tak (hin-tak-redhat-bugs) wrote :

As in 754761, I started noticing rpm being very slow (it used to be just yum in f15, and I assume it was the python overhead) in f16, whenever I have a lot of idle browser windows open - i.e. the system is mostly idle but with a lot of idle processes which can be swap out. I am just expecting an interactive process, especially a root-owned one like rpm/yum, to push more of those idle processes into swap to proceed.

FWIW, my system is 2GB RAM + 2GB swap. During most of F15 I had probably 3000+ packages installed but I desperately needed texlive 201x so I got it on from one of the fedora packages-being-reviewed places a week before I upgraded to F16. That is packaged into thousands of packages (from 10's in f15/f16's texlive 2007) so my package list is just under 7000 at the moment, I think. Since the switch from texlive 2007 to texlive 2011 happens just a week before f16, that probably made it worse.

If this bug is affected by large package list, putting on texlive 2011 from http://jnovy.fedorapeople.org/texlive/

(In reply to comment #12)
> In /usr/lib/rpm/macros change "mp_size=64Mb" to "mp_size=1Mb" (which
> incidentally is the value it had for years and years)...

Just for interest, when did it got changed?

Jeff Johnson (n3npq)
tags: added: fedora rpmdb
Revision history for this message
In , Panu (panu-redhat-bugs) wrote :

Seeing this: http://lwn.net/Articles/478440/ (original report at http://article.gmane.org/gmane.linux.kernel/1244680 for those who can't yet access the LWN article) makes me wonder if that's what is hitting rpm.

Just a theory for now, but at any rate that's the kind of thing that would explain why the very same DB configuration worked fine in (early) F15 but things got dirt slow in F16: F15 started with kernel 2.6.38 (but later updates brought in 3.0.x masquerading as 2.6.40 iirc) whereas F16 had 3.x from day one. If somebody wants to try and recompile their kernel to see if it makes any difference, this is what's supposed to fix it: http://article.gmane.org/gmane.linux.kernel.mm/73293

Revision history for this message
In , Stephen (stephen-redhat-bugs) wrote :

I don't think it's a disk performance issue. It's the sheer volume of disk activity that occurs.

Without the mp_size change, I'm seeing vast amounts of disk IO from rpm even from simple queries like "yum list available ...". It sustains writes (yes, writes!) of between 3MB and 15MB per second for 90 seconds... that's an insane amount of write IO for a readonly query of the bdb, and I think it's a red herring to be looking for kernel reasons why that amount of IO is slow. rpm/bdb really shouldn't be generating that much write traffic in the first place.

btw, "iostat" is a good way to watch the traffic resulting here. I usually use "iostat 2" to get it updated every couple of seconds.

With mp_size=1MB as in comment #12, the problem disappears completely. With the default /usr/lib/macros, rpm is almost unusable, on every system I've got which has been updated from older versions of Fedora.

Revision history for this message
In , Stephen (stephen-redhat-bugs) wrote :

btw, is there any reason not to revert mp_size while we work out what the underlying problem is? This is a pretty major issue for those of us it affects.

Revision history for this message
In , Stephen (stephen-redhat-bugs) wrote :

One more data point: in slow mode, the write IO all seems to be to the bdb internal journal files. (This done while executing a simple yum query, "yum list available foobar".) The most recently modified files in /var/lib/rpm look like

[rpm]# ls -lath
total 280M
-rw-r--r--. 1 root root 24K Feb 7 11:20 __db.001
-rw-r--r--. 1 root root 3.7M Feb 7 11:20 __db.002
-rw-r--r--. 1 root root 81M Feb 7 11:20 __db.003
-rw-r--r--. 1 root root 536K Feb 7 11:20 __db.004
-rw-r--r--. 1 rpm rpm 6.5M Feb 7 10:47 Dirnames
...

Reverting mp_size to 1MB by itself doesn't fix this, the results look the same. But clearing the __db* files and rerunning, I still get the write traffic, but much less; and the db files look like

[rpm]# ls -lath
total 198M
-rw-r--r--. 1 root root 24K Feb 7 11:27 __db.001
-rw-r--r--. 1 root root 240K Feb 7 11:27 __db.002
-rw-r--r--. 1 root root 1.3M Feb 7 11:27 __db.003
-rw-r--r--. 1 root root 536K Feb 7 11:27 __db.004
-rw-r--r--. 1 rpm rpm 6.5M Feb 7 10:47 Dirnames
...

Curiously, the __db.003 file (which I think is the bdb mmap cache file) has shrunk by almost exactly the same factor of 64 by which we shrank mp_size. But the actual amount of disk write according to iostat has shrunk to just a few kB.

So the real question is why we're generating so much traffic to the mmap file; and if we can't avoid that, then we should probably just minimise the issue by reverting the size of that file back to its older value.

Revision history for this message
In , Met (met-redhat-bugs) wrote :

I can confirm this bug on both of my F16 installations. One were upgraded with preupgrade, the other with yum --releasever=16 upgrade.

In both cases I get huge amounts of IO, it takes yum up to 10min to list packages to be upgraded. And then the next 10min to finish yum after `cleaning' the last package. On one system updated with yum (not preupgrade) the trick with mp_size=1MB did the job (on the other I didn't try yet)

**I think it is a high-priority bug**

Revision history for this message
In , Panu (panu-redhat-bugs) wrote :

The "mp" (for "memory pool") thing is BDB jargon for its internal (shared) cache, through which all db IO goes. That's why you get a lot of write-IO on read-only queries too, especially after reboot which erases the whole BDB environment including the cache.

On my four (F16) systems, mp_size=64Mb vs 1Mb makes no difference whatsoever to the --initdb times (a dumb benchmark for caching related issues as caching is of no benefit for this operation, but serves as a simple reproducer for this issue). However on a VM with just 512MB this makes a huge differece, quite obviously something horribly stupid happens with the cache when memory is on the low side.

And it also does have to do with the kernel. With mp_size=1Mb there's no measurable difference, but with 64Mb the difference between original F15 kernel and current F16 one is quite drastic. Using the packages from comment #1 as a test-case, here are the timing on the virtual machine with 512M memory, the magnitude of difference is easily reproducable here:

[root@localhost slowdb]# uname -r
2.6.38.6-26.rc1.fc15.x86_64
[root@localhost slowdb]# time rpmdb --initdb --dbpath /tmp/slowdb/
warning: Generating 12 missing index(es), please wait...

real 2m39.459s
user 0m6.173s
sys 0m2.478s

[root@localhost slowdb]# uname -r
3.2.3-2.fc16.x86_64
[root@localhost slowdb]# time rpmdb --initdb --dbpath /tmp/slowdb/
warning: Generating 12 missing index(es), please wait...

real 10m34.010s
user 0m5.728s
sys 0m2.117s

FWIW, mp_size=1Mb does the deed in roughly 8 seconds on the same VM on both kernels. This is the kind of time I get on my laptop running this VM, regardless of mp_size setting.

So, obviously the larger cache is severely hurting rpm performance in some situations, much more than its winning in the best scenarios. I'll switch it back to the former default for now, finding an optimum is going to take some added testing and tuning on different types of systems. But there's quite a regression in kernel performance as well, whether it's the plugging issue I referred to in comment #21 or something else I've no clue.

Revision history for this message
In , Panu (panu-redhat-bugs) wrote :

Oh and the timings were produced on the same F16 installation, the only difference is booting with F15 vs F16 kernel.

Revision history for this message
In , Panu (panu-redhat-bugs) wrote :

Dunno how much the Fedora kernel version can be trusted these days, but testing with a sampling of kernels grabbed from koji, this doesn't seem to start at 2.6.39 but 2.6.40 and gradually getting worse. This is the same test-case as with the above numbers, 3.2.3 retested for curiosity and producing a little less horrible figure this time, but still by far slowest of them all:

[root@localhost slowdb]# uname -r
2.6.38.6-26.rc1.fc15.x86_64
[root@localhost slowdb]# time rpmdb --initdb --dbpath /tmp/slowdb/
warning: Generating 12 missing index(es), please wait...

real 2m39.459s
user 0m6.173s
sys 0m2.478s

[root@localhost slowdb]# uname -r
2.6.39-1.fc16.x86_64
[root@localhost slowdb]# time rpmdb --initdb --dbpath /tmp/slowdb/
warning: Generating 12 missing index(es), please wait...

real 2m24.814s
user 0m5.939s
sys 0m2.245s

[root@localhost slowdb]# uname -r
2.6.40-4.fc15.x86_64
[root@localhost slowdb]# time rpmdb --initdb --dbpath /tmp/slowdb/
warning: Generating 12 missing index(es), please wait...

real 6m26.406s
user 0m5.772s
sys 0m2.152s

[root@localhost slowdb]# uname -r
2.6.41.10-3.fc15.x86_64
[root@localhost slowdb]# time rpmdb --initdb --dbpath /tmp/slowdb/
warning: Generating 12 missing index(es), please wait...

real 5m42.944s
user 0m5.796s
sys 0m2.164s

[root@localhost slowdb]# uname -r
3.2.3-2.fc16.x86_64
[root@localhost slowdb]# time rpmdb --initdb --dbpath /tmp/slowdb/
warning: Generating 12 missing index(es), please wait...

real 7m53.031s
user 0m5.750s
sys 0m2.217s

Revision history for this message
In , Fedora (fedora-redhat-bugs) wrote :

rpm-4.9.1.2-5.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/rpm-4.9.1.2-5.fc16

Revision history for this message
In , Fedora (fedora-redhat-bugs) wrote :

rpm-4.9.1.2-4.fc15 has been submitted as an update for Fedora 15.
https://admin.fedoraproject.org/updates/rpm-4.9.1.2-4.fc15

Revision history for this message
In , Fedora (fedora-redhat-bugs) wrote :

Package rpm-4.9.1.2-5.fc16:
* should fix your issue,
* was pushed to the Fedora 16 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing rpm-4.9.1.2-5.fc16'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2012-1504/rpm-4.9.1.2-5.fc16
then log in and leave karma (feedback).

Revision history for this message
In , Christoph (christoph-redhat-bugs) wrote :

I have been suffering from this for ages and tried a lot, but this update finally fixes my problems. Transactions are now going fast again. Thanks!

Revision history for this message
In , Fedora (fedora-redhat-bugs) wrote :

rpm-4.9.1.2-5.fc16 has been pushed to the Fedora 16 stable repository. If problems still persist, please make note of it in this bug report.

Revision history for this message
In , Fedora (fedora-redhat-bugs) wrote :

rpm-4.9.1.2-4.fc15 has been pushed to the Fedora 15 stable repository. If problems still persist, please make note of it in this bug report.

Revision history for this message
In , Mike (mike-redhat-bugs) wrote :

*** Bug 811544 has been marked as a duplicate of this bug. ***

Changed in fedora:
importance: Unknown → Medium
status: Unknown → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.