multi-master replication test failing - duplicate trx message error/ first master rpl failure.

Bug #867866 reported by Patrick Crews on 2011-10-04
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Drizzle
High
Patrick Crews
7.1
High
Patrick Crews

Bug Description

So, I have created the infrastructure to setup multi-master topologies for testing.
The gist of the test is we spin 3 servers = 2 masters, one slave replicating from both.
master1 = we create test.t1 and some records
master2 = we create test.t2 and some records
When we try:
./dbqp --suite=slave --record multi_master_basic, the test fails as follows (slave crashes)
$ cat workdir/bot0/s2/var/log/s2.err
InnoDB: Doublewrite buffer not found: creating new
InnoDB: Doublewrite buffer created
InnoDB: 127 rollback segment(s) active.
InnoDB: Creating foreign key constraint system tables
InnoDB: Foreign key constraint system tables created
(SQLSTATE 00000) Duplicate entry '772-1' for key 'PRIMARY'
Failure while executing:
INSERT INTO `sys_replication`.`queue` (`master_id`, `trx_id`, `seg_id`, `commit_order`, `originating_server_uuid`, `originating_commit_id`, `msg`) VALUES (2, 772, 1, 1, 'ac9c8ac0-8f10-474b-9bbd-b61d2cdb2b93' , 1, 'transaction_context {
  server_id: 1
  transaction_id: 772
  start_timestamp: 1317760732106016
  end_timestamp: 1317760732106017
}
event {
  type: STARTUP
}
segment_id: 1
end_segment: true
')

Replication slave: Unable to insert into queue.
Replication slave: drizzle_state_read:lost connection to server (EOF)
Lost connection to master. Reconnecting.
Replication slave: drizzle_state_connect:could not connect
111004 16:39:05 InnoDB: Starting shutdown...

Testing the setup with --start-and-exit shows that we only seem to be replicating from master 2, not master 1.
the config file is as follows:
ignore-errors

[master1]
master-host=127.0.0.1
master-port=9306
master-user=root
master-pass=''

[master2]
master-host=127.0.0.1
master-port=9312
master-user=root
master-pass=''

Related branches

Patrick Crews (patrick-crews) wrote :

The test + relevant code is in the linked branch:
./dbqp --suite=slave --record multi_master_basic

Slave logs are:
drizzle/tests/workdir/bot0/s2/var/log/s2.err

Changed in drizzle:
importance: Undecided → High
status: New → Confirmed
summary: - multi-master replication test failing
+ multi-master replication test failing - duplicate trx message error/
+ first master rpl failure.
David Shrewsbury (dshrews) wrote :

I've asked Patrick to verify that his test sets the server-id on the second master. They need to be different. This error message looks like they're both using 1.

Patrick Crews (patrick-crews) wrote :

So, I've updated the test system to use a --server-id argument for each server (my dumb mistake...)
After the changes, I am seeing the slave server crash with this output in the log:
cat workdir/bot0/s2/var/log/s2.err
InnoDB: Doublewrite buffer not found: creating new
InnoDB: Doublewrite buffer created
InnoDB: 127 rollback segment(s) active.
InnoDB: Creating foreign key constraint system tables
InnoDB: Foreign key constraint system tables created
terminate called after throwing an instance of 'std::out_of_range'
  what(): basic_string::substr
111014 14:30:35 - drizzled got signal 6;
This could be because you hit a bug. It is also possible that this binary
 or one of the libraries it was linked against is corrupt, improperly built,
 or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

read_buffer_size=131072
max_used_connections=2
connection_count=2
It is possible that drizzled could use up to
(read_buffer_size + sort_buffer_size)*thread_count
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

call_backtrace() began at drizzled/signal_handler.cc:156 for drizzled_handle_segfault()
Number of stack frames obtained: 26
/home/pcrews/bzr/work/drizzle_mm_test/drizzled/.libs/lt-drizzled() [0x6d0553]
()
gsignal()
abort()
__gnu_cxx::__verbose_terminate_handler()
()
()
()
std::__throw_out_of_range(char const*)
()
/home/pcrews/bzr/work/drizzle_mm_test/drizzled/.libs/lt-drizzled() [0x5328a6]
/home/pcrews/bzr/work/drizzle_mm_test/drizzled/.libs/lt-drizzled() [0x579d78]
drizzled::Execute::run(data_ref_basic<char const*>, drizzled::sql::ResultSet&)
()
()
()
()
()
()
()
()
()
thread_proxy()
()
clone()

Patrick Crews (patrick-crews) wrote :

If I don't get what seems to be an immediate crash / glibc output, I get this:
$ cat workdir/bot0/s2/var/log/s2.err
InnoDB: Doublewrite buffer not found: creating new
InnoDB: Doublewrite buffer created
InnoDB: 127 rollback segment(s) active.
InnoDB: Creating foreign key constraint system tables
InnoDB: Foreign key constraint system tables created
(SQLSTATE 00000) Duplicate entry '772-1' for key 'PRIMARY'
Failure while executing:
INSERT INTO `sys_replication`.`queue` (`master_id`, `trx_id`, `seg_id`, `commit_order`, `originating_server_uuid`, `originating_commit_id`, `msg`) VALUES (1, 772, 1, 1, 'a48bf5f2-6702-4735-94f5-9624ad2f91c6' , 1, 'transaction_context {
  server_id: 1
  transaction_id: 772
  start_timestamp: 1318624771081308
  end_timestamp: 1318624771081309
}
event {
  type: STARTUP
}
segment_id: 1
end_segment: true
')

Replication slave: Unable to insert into queue.
Replication slave: drizzle_state_read:lost connection to server (EOF)
Lost connection to master. Reconnecting.
Replication slave: drizzle_state_connect:could not connect
111014 16:39:44 InnoDB: Starting shutdown...

Patrick Crews (patrick-crews) wrote :
Download full text (12.2 KiB)

I see this output when the slave crashes hard:
*** glibc detected *** /drizzle_mm_test/drizzled/.libs/lt-drizzled: double free or corruption (!prev): 0x0000000003991230 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x78a96)[0x7fdd3669ba96]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x6c)[0x7fdd3669fd7c]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZNSs6assignERKSs+0x76)[0x7fdd3711d5c6]
/drizzle_mm_test/drizzled/.libs/lt-drizzled[0x532868]
/drizzle_mm_test/drizzled/.libs/lt-drizzled[0x579d78]
/drizzle_mm_test/drizzled/.libs/lt-drizzled(_ZN8drizzled7Execute3runE14data_ref_basicIPKcERNS_3sql9ResultSetE+0xb5)[0x578a07]
/drizzle_mm_test/plugin/.libs/libslave_plugin.so(+0x4d6eb)[0x7fdd31c846eb]
/drizzle_mm_test/plugin/.libs/libslave_plugin.so(+0x4caaf)[0x7fdd31c83aaf]
/drizzle_mm_test/plugin/.libs/libslave_plugin.so(+0x4b060)[0x7fdd31c82060]
/drizzle_mm_test/plugin/.libs/libslave_plugin.so(+0x42d59)[0x7fdd31c79d59]
/drizzle_mm_test/plugin/.libs/libslave_plugin.so(+0x425c5)[0x7fdd31c795c5]
/drizzle_mm_test/plugin/.libs/libslave_plugin.so(+0x42029)[0x7fdd31c79029]
/drizzle_mm_test/plugin/.libs/libslave_plugin.so(+0x4181e)[0x7fdd31c7881e]
/drizzle_mm_test/plugin/.libs/libslave_plugin.so(+0x40ffa)[0x7fdd31c77ffa]
/drizzle_mm_test/plugin/.libs/libslave_plugin.so(+0x3f71e)[0x7fdd31c7671e]
/usr/lib/libboost_thread.so.1.42.0(thread_proxy+0x5e)[0x7fdd380e916e]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7efc)[0x7fdd369c9efc]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fdd3670489d]
======= Memory map: ========
00400000-00cb0000 r-xp 00000000 00:15 8915151 /drizzle_mm_test/drizzled/.libs/lt-drizzled
00eb0000-00ec9000 r--p 008b0000 00:15 8915151 /drizzle_mm_test/drizzled/.libs/lt-drizzled
00ec9000-00eff000 rw-p 008c9000 00:15 8915151 /drizzle_mm_test/drizzled/.libs/lt-drizzled
00eff000-00f34000 rw-p 00000000 00:00 0
02d2a000-039ad000 rw-p 00000000 00:00 0 [heap]
7fdd14000000-7fdd14021000 rw-p 00000000 00:00 0
7fdd14021000-7fdd18000000 ---p 00000000 00:00 0
7fdd197fb000-7fdd197fc000 ---p 00000000 00:00 0
7fdd197fc000-7fdd19ffc000 rw-p 00000000 00:00 0
7fdd19ffc000-7fdd19ffd000 ---p 00000000 00:00 0
7fdd19ffd000-7fdd1a7fd000 rw-p 00000000 00:00 0
7fdd1a7fd000-7fdd1a7fe000 ---p 00000000 00:00 0
7fdd1a7fe000-7fdd1affe000 rw-p 00000000 00:00 0
7fdd1affe000-7fdd1afff000 ---p 00000000 00:00 0
7fdd1afff000-7fdd1b7ff000 rw-p 00000000 00:00 0
7fdd1b7ff000-7fdd1b800000 ---p 00000000 00:00 0
7fdd1b800000-7fdd1c000000 rw-p 00000000 00:00 0
7fdd1c000000-7fdd1c021000 rw-p 00000000 00:00 0
7fdd1c021000-7fdd20000000 ---p 00000000 00:00 0
7fdd2014a000-7fdd2014b000 ---p 00000000 00:00 0
7fdd2014b000-7fdd2094b000 rw-p 00000000 00:00 0
7fdd2094b000-7fdd2094c000 ---p 00000000 00:00 0
7fdd2094c000-7fdd2114c000 rw-p 00000000 00:00 0
7fdd2114c000-7fdd2114d000 ---p 00000000 00:00 0
7fdd2114d000-7fdd2194d000 rw-p 00000000 00:00 0
7fdd2194d000-7fdd2194e000 ---p 00000000 00:00 0
7fdd2194e000-7fdd2214e000 rw-p 00000000 00:00 0
7fdd2214e000-7fdd2214f000 ---p 00000000 00:00 0
7fdd2214f000-7fdd2294f000 rw-p 00000000 00:00 0
7fdd2294f000-7fdd229...

David Shrewsbury (dshrews) wrote :

Looks like the remerge of the multisource replication code went wonky somewhere and the primary key on the sys_replication.queue table is not correct. Missing the master_id column. I'm not sure how this code was remerged so there may be other hidden issues like this, but fixing the PK seems to be the fix for this particular bug.

I think there is an issue with the timing of the test case, though. Told Patrick about it and he's got a fix for that.

Patrick Crews (patrick-crews) wrote :

Ad hoc testing of this is looking solid. Unfortunately the test case itself needs some loving : ( Wrote bug:
https://bugs.launchpad.net/drizzle/+bug/876914

To track the testcase fix / closing this stuff out (seems they are two separate things / this could become a tedious bug to read through)

Sekhar (msekharmca) wrote :

Hi,
Please explain me more about drizzle flow architecture from clients changes the event and where it will stored and how it will replicated multi master and slaves.

Thanks,
sekhar.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers