PBXT

Endless loop on recovery

Bug #315493 reported by Philip Stoev on 2009-01-09

Affects		Status	Importance	Assigned to	Milestone
	PBXT	Fix Committed	Undecided	Vladimir Kolesnikov

Bug Description

When PBXT did a recovery after kill -9, it entered into an endless loop as follows:

2111 while (!xt_xn_is_before(xt_xn_get_curr_id(db), db->db_xn_to_clean_id)) { // was db->db_xn_to_clean_id <= xt_xn_get_curr_id(db)
2112 xt_lock_mutex(self, &db->db_sw_lock);
2113 pushr_(xt_unlock_mutex, &db->db_sw_lock);
2114 xt_wakeup_sweeper(db);
2115 freer_(); // xt_unlock_mutex(&db->db_sw_lock)
2116 xt_sleep_100th_second(1);
2117 now = time(NULL);
2118 if (abort_time && now >= then + abort_time) {
2122 if (now >= then + 2) {
2123 if (!message) {
2111 while (!xt_xn_is_before(xt_xn_get_curr_id(db), db->db_xn_to_clean_id)) { // was db->db_xn_to_clean_id <= xt_xn_get_curr_id(db)

Revision history for this message

Philip Stoev (pstoev) wrote on 2009-01-09:

Thread stacks Edit (18.7 KiB, text/plain)

Revision history for this message

Vladimir Kolesnikov (vkolesnikov) wrote on 2009-01-12:

Philip,

is this repeatable?

according to the stack dump It looks like this happens right after recovery when sweeper removes uncommitted transactions after replaying the log. It might be possible to reproduce the problem by simply re-executing SQL commands from thr query log and killing server immediately afterwards. If you can repeat the problem please attach the database or instrcutions how to repeat.

We have a script that runs dbt2 test and kills server after a random period of time during the test. I was not able to reproduce the problem using that script.

Thanks.

Changed in pbxt:
assignee:	nobody → vkolesnikov
status:	New → Incomplete

Revision history for this message

Philip Stoev (pstoev) wrote on 2009-01-12:

Datadir before recovery Edit (11.6 MiB, application/zip)

Hello,

This bug is repeatable in the sense that if you initiate recovery again on the initial datadir, PBXT will hang again.

Changed in pbxt:
status:	Incomplete → New

Revision history for this message

Philip Stoev (pstoev) wrote on 2009-01-12:

It is also repeatable when you start from scratch:

To avoid the recovery problem related to partitioning, comment out the partitions line in combinations.zz. Then execute:

$ perl runall.pl \
   --mem \
   --rows=100 \
   --threads=32 \
   --mask=2662 \
   --queries=1000000 \
   --duration=300 \
   --basedir=/build/bzr/mysql-6.0 \
   --mysqld=--plugin-dir=/build/bzr/pbxt/src/.libs/ \
   --mysqld=--plugin-load=PBXT=libpbxt.so \
   --engine=PBXT \
   --grammar=conf/combinations.yy \
   --gendata=conf/combinations.zz \
   --reporter=Deadlock,ErrorLog,Backtrace,Recovery \
   --mysqld=--loose-lock-wait-timeout=1 \
   --mysqld=--log-output=none

Note that the --mask parameter causes only certain portions of the SQL grammar to be exercised. So this test only includes queries of the following types: update, insert, replace.

Revision history for this message

Vladimir Kolesnikov (vkolesnikov) wrote on 2009-01-12:

Ok, I was able to reproduce the problem on 6.0.8. It's interesting to notice that I couldn't repeat it on 5.1.30. Looks like 5.1.30 is ok with 6.0 table format, it didn't complain and I was able to work with server...

Changed in pbxt:
status:	New → Confirmed

Vladimir Kolesnikov (vkolesnikov) on 2009-02-23

Changed in pbxt:
status:	Confirmed → In Progress

Revision history for this message

Vladimir Kolesnikov (vkolesnikov) wrote on 2009-06-10:

tested against rev.656. the recovery of the attached example finishes successfully. the bug was fixed in a previous revision

Changed in pbxt:
status:	In Progress → Fix Committed

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.