MariaDB consistently crashes in collect_tables on Aria checkpoint execution

Bug #902654 reported by Scott Feldstein on 2011-12-11
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MariaDB
Fix Released
High
Michael Widenius

Bug Description

Hi,
This issue occurs on MariaDB 5.2.10 and 5.3.2-beta

OS info:
$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.6 (Tikanga)

Memory (with MariaDB running):
$ free -m
             total used free shared buffers cached
Mem: 16050 14153 1897 0 242 12572
-/+ buffers/cache: 1338 14712
Swap: 4095 89 4006

This occurs just about every 5 minutes on the dot, but sometimes it crashes out of band. mysqld_safe continuously restarts it.

Attached info includes:
cpuinfo (from /proc)
mysql table creation script
my.cnf
mysqld.log

To repro, simply create a db, run the data_tables.sql script to create all the tables and it will start to crash.

Related branches

Scott Feldstein (scottmf76) wrote :
Elena Stepanova (elenst) wrote :

Hi,

Just creating the tables doesn't seem to cause the crash, at least it isn't crashing for me so far. I used the same settings as you.

Your log suggests that your server is not idle, you have some scheduled activity, in a cronjob or such -- something that connects to the server and apparently executes something. Could you find out what is executed? The connection looks local, that is whatever is running, is probably on the same box.

If you could turn on the general log temporarily, its output could help to understand what flow causes the problem. To turn on logging, set general_log=1 and global general_log_file=<file location>. If you do it on a running server, please make sure you are changing global variables (set global); but it would be more useful if you could set them in the cnf file, so we see the whole activity from the server restart and till the crash.

One more thing, in the log file, when the problem started, the server complains about mysql.user and mysql.db tables being crashed. Could you please check if they are OK now (run check table on them), just so we know we are not dealing with an underlying condition?

Thank you

Scott Feldstein (scottmf76) wrote :

attaching mysql general log

Elena Stepanova (elenst) wrote :

Hi,

Thank you for the provided log.

I have not succeeded so far at reproducing the problem, even by imitating the flow similar to yours, and using the identical data structures; ran tests both on RHEL 5.6 (although with less memory), and OpenSUSE with the same memory-related settings. I'll try to find configuration resembling yours more closely.

Is it correct that you only had the problem on RHEL 5.6, and only with aria-pagecache-buffer-size=12G?
Do you have installations on RHEL 5.6 that use a different memory configuration, and/or installations on different OSes with the same memory configuration? If you do, have you observed any problems with any of those?

Also, the general log that you provided has a bit different crashing pattern comparing to the previous error log. the first log shows that the server crashed about every 5 min for the first hour, then in 10 min, then in 50 min; and further with the same 2-hour cycle. If you happen to have a general log from any of such servers (where the crashes follow an uneven but stable pattern), it could also help.

Scott Feldstein (scottmf76) wrote :

Hi,
Attaching the log files that correspond to each other. I also generated a core dump using the mysql --core-file option. I hope you'll be able to diagnose the direct cause of the bug from that.

thanks.

Scott Feldstein (scottmf76) wrote :

Hi,
I am now able to reproduce this bug 100% of the time. I am now using 5.3.3 rc on linux.

I have attached a bundle which will allow you to do the same.

follow these instructions:

# create a db in mysql called "mydb"
# make sure that your ulimit -n is >= 8192 on your mysqld server due to the number of tables that are in this implementation
# unbundle mysql.bug.tgz
# use this script to populate the tables
$ cat aria_metric_data_tables0.sql | mysql -u <user> -p<pass> -D mydb
$ bin/mysqlreplay.sh <jdbcUrl> <user> <pass>
# e.g. bin/mysqlreplay.sh jdbc:mysql://localhost:3306/mydb myadmin mypasswd
# mariadb will crash

It turns out that the monitoring solution that we use calls

"show /*!50002 global */ status"
AND
SELECT * FROM information_schema.tables WHERE lower(table_schema) = 'mydb' AND engine is not null

by default this runs every 5 mins (same interval as the crash). When i change the interval on the monitoring solution to 20 mins, then mariadb crashes every 20 mins. When i turn it off the agent then mariadb is stable.

Please let me know if you are having issues with the script or are not able to reproduce. It works every time for me and the version of centos / rhel has nothing to do with it as i previously thought.

thanks.

Elena Stepanova (elenst) wrote :

Thank you, now I can reproduce the problem.

I am attaching an MTR testcase, it is based on the flow that Scott provided, only converted in mysqltest format and reduced. The test fails for me reliably on 5.2 and 5.3 debug server. If it doesn't cause the crash, try the algorithm suggested by Scott, it worked for me too.

Debug assertion:

mysqld: ma_checkpoint.c:949: collect_tables: Assertion `state_copy < state_copies_end' failed

#8 0xb7571014 in __assert_fail () from /lib/libc.so.6
#9 0x0860aba7 in collect_tables (str=0xb754a23c, checkpoint_start_log_horizon=4300787999)
    at ma_checkpoint.c:949
#10 0x086097a8 in really_execute_checkpoint () at ma_checkpoint.c:198
#11 0x086095c1 in ma_checkpoint_execute (level=CHECKPOINT_MEDIUM, no_wait=1 '\001')
    at ma_checkpoint.c:132
#12 0x0860a310 in ma_checkpoint_background (arg=0x1e) at ma_checkpoint.c:621
#13 0xb77ffb25 in start_thread () from /lib/libpthread.so.0

Release crash:

#3 0x0825ab96 in handle_segfault (sig=11) at mysqld.cc:2838
#4 <signal handler called>
#5 0x0847fb3a in collect_tables (level=CHECKPOINT_MEDIUM, no_wait=1 '\001')
    at ma_checkpoint.c:948
#6 really_execute_checkpoint (level=CHECKPOINT_MEDIUM, no_wait=1 '\001')
    at ma_checkpoint.c:198
#7 ma_checkpoint_execute (level=CHECKPOINT_MEDIUM, no_wait=1 '\001')
    at ma_checkpoint.c:132
#8 0x08480716 in ma_checkpoint_background (arg=0x1e) at ma_checkpoint.c:621
#9 0xb77c9b25 in start_thread () from /lib/libpthread.so.0

summary: - MariaDB consistently crashes
+ MariaDB consistently crashes in collect_tables on Aria checkpoint
+ execution
Changed in maria:
assignee: nobody → Michael Widenius (monty)
Elena Stepanova (elenst) on 2012-02-07
Changed in maria:
milestone: none → 5.2
Elena Stepanova (elenst) on 2012-02-07
Changed in maria:
importance: Undecided → High
Michael Widenius (monty) on 2012-02-20
Changed in maria:
status: New → In Progress
Michael Widenius (monty) wrote :

The bug happens when you have more than 1024 open Aria tables during checkpoint.
This is now fixed in 5.2. Fix will be in 5.2.11.
Will push to 5.3 today.

Michael Widenius (monty) on 2012-02-21
Changed in maria:
status: In Progress → Fix Committed
Michael Widenius (monty) on 2012-02-28
Changed in maria:
status: Fix Committed → Fix Released
Scott Feldstein (scottmf76) wrote :

Just fyi, been running maria 5.3.5 GA for several days now and it is stable.

I'm very impressed with the quick turnover. Thanks!

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers