Endless loop on recovery

Bug #315493 reported by Philip Stoev
2
Affects Status Importance Assigned to Milestone
PBXT
Fix Committed
Undecided
Vladimir Kolesnikov

Bug Description

When PBXT did a recovery after kill -9, it entered into an endless loop as follows:

2111 while (!xt_xn_is_before(xt_xn_get_curr_id(db), db->db_xn_to_clean_id)) { // was db->db_xn_to_clean_id <= xt_xn_get_curr_id(db)
2112 xt_lock_mutex(self, &db->db_sw_lock);
2113 pushr_(xt_unlock_mutex, &db->db_sw_lock);
2114 xt_wakeup_sweeper(db);
2115 freer_(); // xt_unlock_mutex(&db->db_sw_lock)
2116 xt_sleep_100th_second(1);
2117 now = time(NULL);
2118 if (abort_time && now >= then + abort_time) {
2122 if (now >= then + 2) {
2123 if (!message) {
2111 while (!xt_xn_is_before(xt_xn_get_curr_id(db), db->db_xn_to_clean_id)) { // was db->db_xn_to_clean_id <= xt_xn_get_curr_id(db)

Revision history for this message
Philip Stoev (pstoev) wrote :
Revision history for this message
Vladimir Kolesnikov (vkolesnikov) wrote :

Philip,

is this repeatable?

according to the stack dump It looks like this happens right after recovery when sweeper removes uncommitted transactions after replaying the log. It might be possible to reproduce the problem by simply re-executing SQL commands from thr query log and killing server immediately afterwards. If you can repeat the problem please attach the database or instrcutions how to repeat.

We have a script that runs dbt2 test and kills server after a random period of time during the test. I was not able to reproduce the problem using that script.

Thanks.

Changed in pbxt:
assignee: nobody → vkolesnikov
status: New → Incomplete
Revision history for this message
Philip Stoev (pstoev) wrote :

Hello,

This bug is repeatable in the sense that if you initiate recovery again on the initial datadir, PBXT will hang again.

Changed in pbxt:
status: Incomplete → New
Revision history for this message
Philip Stoev (pstoev) wrote :

It is also repeatable when you start from scratch:

To avoid the recovery problem related to partitioning, comment out the partitions line in combinations.zz. Then execute:

$ perl runall.pl \
   --mem \
   --rows=100 \
   --threads=32 \
   --mask=2662 \
   --queries=1000000 \
   --duration=300 \
   --basedir=/build/bzr/mysql-6.0 \
   --mysqld=--plugin-dir=/build/bzr/pbxt/src/.libs/ \
   --mysqld=--plugin-load=PBXT=libpbxt.so \
   --engine=PBXT \
   --grammar=conf/combinations.yy \
   --gendata=conf/combinations.zz \
   --reporter=Deadlock,ErrorLog,Backtrace,Recovery \
   --mysqld=--loose-lock-wait-timeout=1 \
   --mysqld=--log-output=none

Note that the --mask parameter causes only certain portions of the SQL grammar to be exercised. So this test only includes queries of the following types: update, insert, replace.

Revision history for this message
Vladimir Kolesnikov (vkolesnikov) wrote :

Ok, I was able to reproduce the problem on 6.0.8. It's interesting to notice that I couldn't repeat it on 5.1.30. Looks like 5.1.30 is ok with 6.0 table format, it didn't complain and I was able to work with server...

Changed in pbxt:
status: New → Confirmed
Changed in pbxt:
status: Confirmed → In Progress
Revision history for this message
Vladimir Kolesnikov (vkolesnikov) wrote :

tested against rev.656. the recovery of the attached example finishes successfully. the bug was fixed in a previous revision

Changed in pbxt:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.