Percona XtraDB Cluster - HA scalable solution for MySQL

PXC innodb_buffer_pool_restore_at_startup and wsrep_recovery

Reported by Ilias Bertsimas on 2013-01-28
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster
Status tracked in Trunk
5.6
Undecided
Unassigned
Trunk
Undecided
Raghavendra D Prabhu

Bug Description

We have noticed an issue on PXC when we have 30GB of buffer pool and the option innodb_buffer_pool_restore_at_startup enabled so the buffer pool is warmed up quick during startup.
When the node joins (state JOINER ) the cluster before the buffer pool restore is done reading all the pages it had loaded last time cluster flow control will be paused and the whole cluster will stall as the newly joined node is not able to keep up due to lack of io which is being consumed by the buffer pool restore operation. In our case with our workload it may stay stalled for 4-5 mins.

It may not be a bug but it would be best if it was possible to have the buffer pool restore operation complete first and then the wsrep continue joining the cluster or the other way around. Now percona server's innodb_buffer_pool_restore_at_startup and wsrep are oblivious to each other causing the issue described above.

Vadim Tkachenko (vadim-tk) wrote :

you can use innodb_blocking_buffer_pool_restore=ON
to make sure that restore operation is complete.

Changed in percona-xtradb-cluster:
status: New → Invalid
Ilias Bertsimas (d-ilias) wrote :

Will this be respected by wsrep as well and not join the cluster until the operation is done ?

Vadim Tkachenko (vadim-tk) wrote :

I am actually not sure.
If wsrep does not respect that, it should be fixed.

But I think it should be respected for xtrabackup SST.

Download full text (7.6 KiB)

Here is the log:

sudo mysqld-debug --user=mysql --debug
130219 4:09:08 [Warning] The syntax '--log-slow-queries' is deprecated and will be removed in a future release. Please use '--slow-query-log'/'--slow-query-log-file' instead.
130219 4:09:08 [Note] Flashcache bypass: disabled
130219 4:09:08 [Note] Flashcache setup error is : ioctl failed

130219 4:09:08 [Note] WSREP: Read nil XID from storage engines, skipping position init
130219 4:09:08 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/libgalera_smm.so'
130219 4:09:08 [Note] WSREP: wsrep_load(): Galera 2.2(r114) by Codership Oy <email address hidden> loaded succesfully.
130219 4:09:08 [Note] WSREP: Found saved state: 11264cec-06e6-11e2-0800-61616b1fc754:2669
130219 4:09:08 [Note] WSREP: Reusing existing '/var/lib/mysql//galera.cache'.
130219 4:09:08 [Note] WSREP: Passing config to GCS: base_host = 10.0.2.154; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 0.5; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; replicator.causal_read_timeout = PT30S; replicator.commit_order = 3
130219 4:09:08 [Note] WSREP: Assign initial position for certification: 2669, protocol version: -1
130219 4:09:08 [Note] WSREP: wsrep_sst_grab()
130219 4:09:08 [Note] WSREP: Start replication
130219 4:09:08 [Note] WSREP: Setting initial position to 11264cec-06e6-11e2-0800-61616b1fc754:2669
130219 4:09:08 [Note] WSREP: protonet asio version 0
130219 4:09:08 [Note] WSREP: backend: asio
130219 4:09:08 [Note] WSREP: GMCast version 0
130219 4:09:08 [Note] WSREP: (01b640b8-7a1c-11e2-0800-ecbf0dd5f72b, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
130219 4:09:08 [Note] WSREP: (01b640b8-7a1c-11e2-0800-ecbf0dd5f72b, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
130219 4:09:08 [Note] WSREP: EVS version 0
130219 4:09:08 [Note] WSREP: PC version 0
130219 4:09:08 [Note] WSREP: gcomm: connecting to group 'PXC', peer 'Pxc1:'
130219 4:09:08 [Note] WSREP: declaring db89bc0c-7a18-11e2-0800-517d29aa6526 stable
130219 4:09:08 [Note] WSREP: view(view_id(PRIM,01b640b8-7a1c-11e2-0800-ecbf0dd5f72b,10) memb {
        01b640b8-7a1c-11e2-0800-ecbf0dd5f72b,
        db89bc0c-7a18-11e2-0800-517d29aa6526,
} joined {
} left {
} partitioned {
})
130219 4:09:09 [Note] WSREP: gcomm: connected
130219 4:09:09 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
130219 4:09:09 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
130219 4:09:09 [Note] WSREP: Opened channel 'PXC'
130219 4:09:09 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
130219 4:09:09 [Note] WSREP: Waiting for SST to complete.
130219 4:09:09 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 02501b41-7a1c-11e2-0800-fa53e208c983
130219 4:09:09 [Note] WSREP: STATE EXCHANGE: sent state msg: 02501b41-7a1c-11e2-0800-fa53e208c983
130219 4:09:09 [Note] WSREP: STATE EXCHANGE: got state msg: 0250...

Read more...

Changed in percona-xtradb-cluster:
status: Invalid → New
Ilias Bertsimas (d-ilias) wrote :

I can confirm that innodb_blocking_buffer_pool_restore=ON solves the issue but with a weird side-effect. Galera tries to restore the wsrep_position and with that option enabled it spends a lot of time on that part:

130205 13:10:38 mysqld_safe Starting mysqld daemon with databases from /data/mysql
130205 13:10:38 mysqld_safe WSREP: Running position recovery with --log_error=/tmp/tmp.nci5tZ9xZF
130205 13:16:37 mysqld_safe WSREP: Recovered position 402367df-5fd0-11e2-0800-58b321ec9eec:518510084

Almost 6 minutes spent above when without the innodb_blocking_buffer_pool_restore=ON it is pretty much instant.
It is not that it restores the buffer pool pages at that point as the log indicates later on:

130205 13:16:45 InnoDB: Restoring buffer pool pages from ib_lru_dump
130205 13:22:52 InnoDB: Completed reading buffer pool pages (requested: 3178478, read: 3178076)

What do you think ?

So, the delay between 13:10:38 and 13:16:37 is present only when innodb_blocking_buffer_pool_restore=ON?

Also, the delay between 13:16 and 13:22 is justified since it has to read 3178076 pages (about 48 GB) in a blocking way.

Can you provide that whole log?

========================

Here is the full log of tests which I ran:

 http://sprunge.us/MfPM

 http://sprunge.us/LcVW

As you can see it takes longer with lru dump but that alone can't be used here (due to caching effects of galera cache etc.)

Ignore the timing in http://sprunge.us/MfPM (it was due to verbose logging with --debug), here is the log for

 http://sprunge.us/hPGU (for innodb_buffer_pool_restore_at_startup=true)

The buffer pool is clearly getting restored in both places:

On Feb 18, 2013, at 6:24 PM, Ilias Bertsimas <email address hidden> wrote:

> I can confirm that innodb_blocking_buffer_pool_restore=ON solves the
> issue but with a weird side-effect. Galera tries to restore the
> wsrep_position and with that option enabled it spends a lot of time on
> that part:
>
> 130205 13:10:38 mysqld_safe Starting mysqld daemon with databases from /data/mysql
> 130205 13:10:38 mysqld_safe WSREP: Running position recovery with --log_error=/tmp/tmp.nci5tZ9xZF
> 130205 13:16:37 mysqld_safe WSREP: Recovered position 402367df-5fd0-11e2-0800-58b321ec9eec:518510084

The --wsrep_recover starts Innodb using our standard my.cnf, and it allocates and loads the full buffer pool (without logging as it seems).

>
> Almost 6 minutes spent above when without the innodb_blocking_buffer_pool_restore=ON it is pretty much instant.
> It is not that it restores the buffer pool pages at that point as the log indicates later on:
>
> 130205 13:16:45 InnoDB: Restoring buffer pool pages from ib_lru_dump
> 130205 13:22:52 InnoDB: Completed reading buffer pool pages (requested: 3178478, read: 3178076)
>
> What do you think ?

Again, Innodb blocks until the buffer pool is reloaded, again incurring the 6 minutes penalty (which just happens to be the time it takes your LRU dump to reload). This is what I would expect with innodb_blocking_buffer_pool_restore=ON

So, here's the issues as I see them:
- The mandatory wsrep_recover step has some interesting side-effects. Why do we even force run that? If we must do that, can we ignore certain variables in the config? (Huge buffer pools can slow this down, I've noticed too)
- LRU restore obviously has some side effect on cluster operation, but it's not clear why that should be the case. This is probably the real bug here.

Jay Janssen, MySQL Consulting Lead, Percona
http://about.me/jay.janssen
Percona Live in Santa Clara, CA April 22nd-25th 2013
http://www.percona.com/live/mysql-conference-2013/

Jay,

It should be expected of the LRU restore process to have an impact on the cluster. LRU_restore causes a lot of IO activity especially for large bufferpools. This would cause excessive load on one of the nodes, and inadvertently cause problems to the cluster because that one node might not be able to keep up with cluster replication during the time LRU_restore is in progress.

Jay Janssen (jay-janssen) wrote :

Ovais,
  I agree with you, but it depends on what symptoms are actually being seen. If the "cluster stalls" are really just flow control, then I agree this isn't technically a bug. If they are something worse, then there may be a larger problem.

 Ilias, can you confirm the "stalls" you see? Are they just flow control because the node doing the LRU restore is busy? I would expect to see that node increasing wsrep_flow_control_sent during this window. Or, is it some other symptoms on the cluster at this time?

Ilias Bertsimas (d-ilias) wrote :

Jay,

Yes it is flow control pause that stalls the cluster. That can be solved with innodb_blocking_buffer_pool_restore=ON
But the side effect I mentioned is the restoration of the buffer pool pages twice.

Once when wsrep tries to recover uuid:gtid in
 WSREP: Running position recovery with --log_error=/tmp/tmp.nci5tZ9xZF
And the second time where it should recover the buffer pool pages if the node was not part of a galera cluster.

I suspect wsrep tries to read the uuid:gtid from the innodb tablespace and the blocking is in effect and triggers the buffer pool pages restore before allowing that to happen.

On Feb 28, 2013, at 8:37 AM, Ilias Bertsimas <email address hidden> wrote:

> Yes it is flow control pause that stalls the cluster. That can be solved with innodb_blocking_buffer_pool_restore=ON

Ok, great. This bit is "not a bug" technically.

> But the side effect I mentioned is the restoration of the buffer pool pages twice.
>
> Once when wsrep tries to recover uuid:gtid in
> WSREP: Running position recovery with --log_error=/tmp/tmp.nci5tZ9xZF
> And the second time where it should recover the buffer pool pages if the node was not part of a galera cluster.

Yes, precisely. The wsrep_recover should not do an LRU reload. It's not technically a bug, but it's superfluous and obviously extra delay bringing up the node.

Jay Janssen, MySQL Consulting Lead, Percona
http://about.me/jay.janssen
Percona Live in Santa Clara, CA April 22nd-25th 2013
http://www.percona.com/live/mysql-conference-2013/

Jay,

Yes it is kind of an obstruction especially in one of our nodes with almost 200GB of buffer pool it takes almost an hour to bring it up due to the buffer pool pages restore run twice.

Ovais Tariq (ovais-tariq) wrote :

Ilias,

> I suspect wsrep tries to read the uuid:gtid from the innodb tablespace and the blocking is in effect and triggers the buffer pool pages restore before allowing that to happen.

Technically before doing so wsrep should wait for the InnoDB storage engine to report that its initialization is complete, which of course would not be complete until the LRU restore is complete when innodb_blocking_buffer_pool_restore=ON

This can be fixed by checking for wsrep_recovery variable and
not starting LRU thread if it is set, so, that LRU thread doesn't
run twice.

Changed in percona-xtradb-cluster:
milestone: none → 5.5.30-24.8
assignee: nobody → Raghavendra D Prabhu (raghavendra-prabhu)
status: New → Triaged
summary: - PXC innodb_buffer_pool_restore_at_startup and node joining cluster io
- resources race condition
+ PXC innodb_buffer_pool_restore_at_startup and wsrep_recovery
tags: added: lru restore wsrep
Changed in percona-xtradb-cluster:
status: Triaged → Fix Released
status: Fix Released → Fix Committed
Jay Janssen (jay-janssen) wrote :

Wonder if this should be expanded to prevent some other things are unnecessary for wsrep_recover that cause it to be really slow. I'm thinking about large buffer pools and innodb_buffer_pool_populate, for example

@Jay,

Yes, I have already added it for buffer_pool_populate as well -
you can review the diff here -- https://bazaar.launchpad.net/~raghavendra-prabhu/percona-xtradb-cluster/bug-1108035/revision/398

Let me know of any other use cases where this may be needed.

If you are hitting this regression with innodb_buffer_pool_populate, then you can do this:

--- mysqld_safe 2013-06-06 16:05:26.000000000 +0530
+++ /tmp/mysqld_safe 2013-06-07 15:27:54.660401912 +0530
@@ -222,7 +222,7 @@
   log_notice "WSREP: Running position recovery with --log_error=$wr_logfile \
                                 --pid-file="$DATADIR/`hostname`-recover.pid""

- eval_log_error "$mysqld_cmd --log_error=$wr_logfile --wsrep-recover \
+ eval_log_error "$mysqld_cmd --log_error=$wr_logfile --innodb_buffer_pool_populate=0 --wsrep-recover \
                             --pid-file="$DATADIR/`hostname`-recover.pid""

   local rp="$(grep 'WSREP: Recovered position:' $wr_logfile)"

Note, due to a bug, mysqld --verbose --help won't show innodb_buffer_pool_populate but it is there.

Can mysqld startup with 144G of buffer pool, innodb_buffer_pool_populate set BUT wsrep_recover NOT set be affected by this bug (or fixed by the patch) when IST ends up with:

...
130607 1:09:43 [ERROR] WSREP: Local state seqno (2249918697) is greater than group seqno (2249916946): states diverged. Aborting to avoid potential data loss. Remove '/data/mysql//grastate.dat' file and restart if you wish to continue. (FATAL)
at galera/src/replicator_str.cpp:state_transfer_required():34

and node can join the cluster only after full state transfer? Should I report separate bug for this?

Should the manual mention this problem for PXC when describing innodb_buffer_pool_populate variable?

@Valerii, that doesn't seem to be related to this. Have you been able to repeat this with innodb_buffer_pool_populate ON and not repeat it with it being OFF? That looks like an entirely different issue. Yes, you can report a separate bug for this.

Alex Yurchenko (ayurchen) wrote :

Valerii, Raghu, this does not even look like a bug. At first I'd make sure that there really was no state divergence.

Changed in percona-xtradb-cluster:
status: Fix Committed → Fix Released

Yes, we had the problem with innodb_buffer_pool_populate=ON at every node restart after successful full state transfer. As soon as we set it to OFF, we had not got a problem.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers