PBXT "hangs" sometimes but regularly

Bug #684203 reported by Vladimir Kolesnikov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
PBXT
In Progress
Undecided
Unassigned

Bug Description

This bug was created from the below discussion on pbxt-discuss:

There are a number of things which are strange about this test.

Firstly the flush takes 25 minutes. From the xtstat dump we can see that the main flushing on the index file stops at 101202 13:00:13.
This is correct according to the "select now()" below (12:35:46 + 24 minutes).

This may not be the problem, but rather this:

1. Checkpointing does not seem to be happening, because to-chkpt is at 14.9G, and seems to remain there.

The lack of checkpoint may be explained be second bit of strange
activity:

2. Lots of little flusing on the ilog, but no update to the index data:

    ind ind ind ind ind ind ind ind ind ilog ilog
ilog ilog ilog
     in out rd.ms wr.ms syncs/ms hits miss %use %dty in out
rd.ms wr.ms syncs/ms
      0 0 0 0 0/0 0 0 100 0.0 0
54 0 0 6/3
      0 0 0 0 0/0 0 0 100 0.0 0
36 0 0 4/2
      0 0 0 0 0/0 0 0 100 0.0 0
54 0 0 6/3

This is very strange and need to be investigated. Basically, the ilog should be written, then the ilog is read, and the index file is written. That normally concludes the flushing of the index.

It looks like the checkpointer thread is in a loop here, and does not get to the end of flushing the index, and so it does not get to the end of the checkpoint.

@Vlad or Erkan, could you please report this as a bug, with the information we have so far. Next we will have to see if we can repeat this strange index "non-flush" behavior. Maybe using a similar test, but a little less data.

On Dec 2, 2010, at 12:31 PM, erkan yanar wrote:

> Hi Paul,
>
> I repeatet it for you;)
>
> mysql> select now(); truncate rein;
> +---------------------+
> | now() |
> +---------------------+
> | 2010-12-02 12:35:46 |
> +---------------------+
> 1 row in set (0.00 sec)
>
> Query OK, 0 rows affected (24 min 49.47 sec)
>
> So it was successfull.
>
> The xtstat-ouptut is attached. So maybe yo see something that might
> helpfull.
>
> Regards
> Erkan
>
> P.S.
> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
> avgrq-sz avgqu-sz await svctm %util
> dm-0 0.00 0.00 0.00 666.00 0.00 9040.00
> 13.57 186.48 327.86 1.50 100.00
>
>
> On Thu, Dec 02, 2010 at 09:13:48AM +0100, Paul McCullagh wrote:
>> Hi Erkan,
>>
>> Yes. Strange pause. But I cannot see what the engine is doing in that
>> time.
>>
>> There must be activity in some missing statistics, for example,
>> record, row or index file writing.
>>
>> Unfortunately, when displaying all statistics, the output is very
>> wide. What xtstat needs is a feature which logs all statistics but
>> just displays what you select.
>>
>> Then, if you see a problem that does not show up in the displayed
>> statistics you can still go back and check the others at the time.
>>
>> On Nov 30, 2010, at 3:05 AM, erkan yanar wrote:
>>
>>> Moin Paul
>>>
>>> On Mon, Nov 29, 2010 at 10:23:38AM +0100, Paul McCullagh wrote:
>>>>
>>>>
>>>> The problem here is the index cache is full, and the thread which
>>>> is flushing the index is hanging in sync (i.e. flush to disk). The
>>>> rest of the system is hanging because it needs a clean page in the
>>>> index.
>>>>
>>>> I did quite a bit of work to improve this in 1.1 and 1.5.
>>>>
>>>> But in general, if the index cache is really large, it need to be
>>>> flushed more regularly to ensure that too many changes do not
>>>> accumulate. Checkpointing should do this, but there is a bug in
>>>> 1.0.
>>>>
>>>> You checkpoint frequency is 128M, xtstat says to-chkpt = 1049M.
>>>> This
>>>> means the checkpoint has not occurred as often as he should.
>>>>
>>>> This problem should be fixed in 1.1.
>>>>
>>>
>>> While 1.5.02 Beta handles IO a lot better than 1.0.11-7, you may be
>>> interested in that checkpointing issue:
>>> pbxt_checkpoint_frequency = 67108864
>>>
>>> time time xact xact xact xact stat stat ilog
>>> ilog ilog ilog ilog xlog xlog xlog xlog xlog xlog xlog
>>> xlog to to to sweep row row row row
>>> curr msec commt rollb waits dirty read write in out
>>> rd.ms wr.ms syncs/ms in out rd.ms wr.ms syncs sy.ms hits miss
>>> chkpt write sweep waits sel ins upd del
>>> 101130 03:45:13 5001 198t 0 0 10.9t 0 198t 320M
>>> 0 83 0 0/0 0 21.1M 0 12 9 3576 69.6t
>>> 0 813M 9833K 0 13 0 198t 0 0
>>> 101130 03:45:18 5001 9501 0 0 5148 0 9501 0
>>> 0 0 0 0/0 0 0 0 0 0 5001 0
>>> 0 813M 9833K 0 1 0 9501 0 0
>>> 101130 03:45:23 5001 0 0 0 5148 0 0 0
>>> 0 0 0 0/0 0 0 0 0 0 5001 0
>>> 0 813M 9833K 0 0 0 0 0 0
>>> 101130 03:45:28 5001 80.1t 0 0 13.8t 0 80.1t 0
>>> 0 0 0 0/0 0 8192K 0 4 1 4993 1
>>> 0 814M 17.6M 115M 5 0 80.1t 0 0
>>> 101130 03:45:33 5001 179t 0 0 10.3t 0 179t 0
>>> 0 0 0 0/0 1244K 19.7M 0 8 20 4395 119t
>>> 39 837M 6772K 0 13 0 179t 0 0
>>> 101130 03:45:38 5001 12.7t 0 0 9084 0 12.7t 0
>>> 0 0 0 0/0 0 0 0 0 0 5001 5507
>>> 0 837M 3756K 0 1 0 12.7t 0 0
>>> 101130 03:45:43 5001 78.4t 0 0 8907 0 78.4t 0
>>> 0 0 0 0/0 0 8192K 0 3 1 4994 18.1t
>>> 0 840M 8192K 0 5 0 78.4t 0 0
>>> 101130 03:45:48 5001 156t 0 0 4410 0 156t 0
>>> 0 0 0 0/0 0 16.0M 0 8 2 4985 67.5t
>>> 0 856M 8192K 0 11 0 156t 0 0
>>> 101130 03:45:53 5001 78.9t 0 0 7213 0 78.9t 0
>>> 0 0 0 0/0 0 8192K 0 6 1 4991 1
>>> 0 864M 16.0M 0 7 0 78.9t 0 0
>>>
>>>
>>> Best Regards
>>> Erkan
>>>
>>> --
>>> über den grenzen muß die freiheit wohl wolkenlos sein
>>
>
> --
> über den grenzen muß die freiheit wohl wolkenlos sein
> <xtstat-truncate.log.gz>

Changed in pbxt:
status: New → In Progress
Revision history for this message
Vladimir Kolesnikov (vkolesnikov) wrote :

Hi Erkan,

what test are you running? If this is something non-standard then would it be possible to set it up on my machine?

Thanks,
Vladimir

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.