Poor insert performance from xt_flush_indices on IO bound tests

Bug #367098 reported by Mark Callaghan on 2009-04-26
2
Affects Status Importance Assigned to Milestone
PBXT
High
Paul McCullagh

Bug Description

I am running an IO bound performance test -- iibench from https://launchpad.net/mysql-patch/mytools and pbxt gets slow once the data gets big. The slowdown is interesting as the row insert rate alternates between high (2000 inserts/second) and low (30 inserts/second). During the slow phase, the common thread stack is:

pwrite64,xt_pwrite_file,XTIndexLog::il_apply_log,xt_flush_indices,xt_idx_insert,xt_tab_new_record,
ha_pbxt::write_row,handler::ha_write_row,write_record,mysql_insert,mysql_execute_command,
mysql_parse,dispatch_command

il_apply_log has a loop where it reads a log entry and then perfoms a write. The IO is done sequentially.

Can il_apply_log be changed to use async IO (simulated or real) and then block at function end for the writes to complete? Async IO would also allow for some amount of write reordering. writev() is another option instead of async IO.

This uses the latest PBXT code from launchpad
The test server has 2 disks in SW RAID 0.

 my.cnf parameters are:
pbxt_index_cache_size=800M
pbxt_record_cache_size=200M
pbxt_checkpoint_frequency=1GB
pbxt_data_file_grow_size=16MB
pbxt_row_file_grow_size=2MB
pbxt_data_log_threshold=256MB
pbxt_log_cache_size=128MB
pbxt_log_file_threshold=256MB

iostat output with 10 second intervals during the slowdown
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
md0 0.00 0.00 130.06 141.83 3586.21 8086.59 42.93 0.00 0.00 0.00 0.00
md0 0.00 0.00 125.00 165.30 4252.00 3684.90 27.34 0.00 0.00 0.00 0.00
md0 0.00 0.00 138.20 144.60 3545.60 3274.80 24.12 0.00 0.00 0.00 0.00
md0 0.00 0.00 166.20 2.60 5393.60 36.50 32.17 0.00 0.00 0.00 0.00
md0 0.00 0.00 106.60 366.30 4104.00 8722.10 27.12 0.00 0.00 0.00 0.00
md0 0.00 0.00 159.50 79.50 5340.00 2180.70 31.47 0.00 0.00 0.00 0.00
md0 0.00 0.00 118.70 197.70 4200.80 4439.80 27.31 0.00 0.00 0.00 0.00
md0 0.00 0.00 122.30 265.30 4229.60 6319.30 27.22 0.00 0.00 0.00 0.00
md0 0.00 0.00 156.50 62.40 5316.80 1444.60 30.89 0.00 0.00 0.00 0.00
md0 0.00 0.00 134.20 176.60 5137.60 4368.30 30.59 0.00 0.00 0.00 0.00
md0 0.00 0.00 113.60 240.60 4160.00 5754.00 27.99 0.00 0.00 0.00 0.00
md0 0.00 0.00 129.90 218.70 4290.40 5226.50 27.30 0.00 0.00 0.00 0.00
md0 0.00 0.00 157.60 58.10 6137.60 1399.10 34.94 0.00 0.00 0.00 0.00
md0 0.00 0.00 174.60 9.30 5460.80 188.10 30.72 0.00 0.00 0.00 0.00
md0 0.00 0.00 75.40 492.00 1732.80 11715.50 23.70 0.00 0.00 0.00 0.00
md0 0.00 0.00 133.97 166.23 4317.28 3986.71 27.66 0.00 0.00 0.00 0.00
md0 0.00 0.00 132.50 162.10 4308.00 3905.90 27.88 0.00 0.00 0.00 0.00
md0 0.00 0.00 163.20 29.40 5365.60 841.60 32.23 0.00 0.00 0.00 0.00

Related branches

Paul McCullagh (paul-mccullagh) wrote :

Hi Mark,

Thanks for the great bug report.

Doing that I/O in parallel looks like a good idea. The disk is totally under utilized.

According to iostat below the disk utilization (%util) is zero! And the wait time (await & svctm) as well. By issuing a lot of write in parallel we will probably be able to use the disk capacity better.

I will give it a try...

Changed in pbxt:
assignee: nobody → paul-mccullagh
importance: Undecided → High
status: New → In Progress
Mark Callaghan (mdcallag) wrote :

I will post iostat output again for the drives used in the SW RAID setup. md0 is a SW RAID device and iostat doesn't report accurate values for await and svctm for SW RAID devices. But the read and write rates are accurate.

Paul McCullagh (paul-mccullagh) wrote :

OK, thanks!

This will help because the question is, what is the actual disk utilization? If it is already 100% (which I don't think it is), then I don't think multiple I/O threads will help us.

Mark Callaghan (mdcallag) wrote :

I assume the reads in il_apply_log are fast and the wait occurs on the random writes. The current code runs at 1/N the max speed on an N disk setup. In my case there are 2 disks but I have other servers with many more and async IO would allow all of the disks to do IO at the same time. The other benefit of async IO is that lower levels of the system can reorder requests to reduce disk head movement.

But there are a couple of choices here. Right now the thread executing xt_flush_indices detects the case where this needs to be called and then calls xt_flush_indices. Can either of these tasks be moved to background thread?

my earlier tests showed that in some disk-bound workloads smaller pbxt_checkpoint_frequency gives less fsyncs. For example pbxt_checkpoint_frequency=100MB gave 2x fsyncs as pbxt_checkpoint_frequency=1MB. My test case was not exactly like this and data set was smaller but I think at the engine level there could be similarities. Of course this is not the solution of the entire problem but this can improve performance...

Paul McCullagh (paul-mccullagh) wrote :

The main problem here appears to be that during flush the indexes of the table are all locked. This was required because we need to flush a snapshot of the index to disk to ensure consistency.

This problem has been fixed in PBXT 1.1 (branch attached) and later.

The new version takes a snapshot of the index by doing copy on write of the index pages.

Paul McCullagh (paul-mccullagh) wrote :

Index flush speed has been further increased by flushing many tables in parallel.

Changed in pbxt:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers