core dump - Can't connect to MySQL server on '10.0.2.148' (111) at /usr/local/bin/pt-online-schema-change line 2262

Bug #1472779 reported by Dave Gregg
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona Toolkit moved to https://jira.percona.com/projects/PT
Expired
Undecided
Unassigned

Bug Description

I have a unique topology with replication.
I am attempting to attach to a Master that is ALSO a SLAVE....(10.0.2.148)

I have a dsns table set up on 10.0.2.148 that has the one entry (10.0.2.147) - the slave

There is a Grand MASTER 10.0.2.150
                                     |
                                    \/
So there is a Master-Slave 10.0.2.149 (has replication filters)
                                     |
                                    \/
                          A 2nd Master-Slave 10.0.2.148
                                     |
                                    \/
                            A Slave 10.0.2.147

The server I am attempting to connect to (10.0.2.148) and execute schema changes on has one Slave Host replicating -> ASE2TestVM (10.0.2.147)

I run a --dry-run and there are NO issues. (Debug mode tells me nothing extra???)

I run in --execute mode and it always end up with the same....A core dump and it exits. I Keep getting a message that it cannot connect to MySQL Server on '10.0.2.148' (111) at /usr/local/bin/pt-online-schema-change line 2262.

Once this happens....I notice the connection has dropped and that the SLAVE has STOPPED Replication on 10.0.2.148
The Slave host under that master/slave (10.0.2.147) - is fine.....it is still replicating.

A core dump happens and I have to clean up the triggers and the temp table manually.

The command I am running is as follows:

>pt-online-schema-change --recursion-method=dsn=h=10.0.2.148,D=percona,t=dsns --nocheck-replication-filters --max-lag=10 --critical-load Threads_running=15 --alter "modify ExtraText1 varchar(256)" h=10.0.2.148,D=Data,t=Retest,u=root,p=pswd --execute

Found 1 slaves:
  ASE2TestVM
Will check slave lag on:
  ASE2TestVM
Operation, tries, wait:
  copy_rows, 10, 0.25
  create_triggers, 10, 1
  drop_triggers, 10, 1
  swap_tables, 10, 1
  update_foreign_keys, 10, 1
Altering `ATData`.`Retest`...
Creating new table...
Created new table ATData._Retest_new OK.
Altering new table...
Altered `ATData`.`_Retest_new` OK.
2015-07-08T15:28:46 Creating triggers...
2015-07-08T15:28:46 Created triggers OK.
2015-07-08T15:28:46 Copying approximately 5406492 rows...
2015-07-08T15:28:49 Dropping triggers...
2015-07-08T15:28:49 Error dropping trigger: DBI connect('ATData;host=10.0.2.148;mysql_read_default_group=client','root',...) failed: Can't c
.0.2.148' (111) at /usr/local/bin/pt-online-schema-change line 2262.

2015-07-08T15:28:49 Error dropping trigger: DBI connect('ATData;host=10.0.2.148;mysql_read_default_group=client','root',...) failed: Can't c
.0.2.148' (111) at /usr/local/bin/pt-online-schema-change line 2262.

2015-07-08T15:28:49 Error dropping trigger: DBI connect('ATData;host=10.0.2.148;mysql_read_default_group=client','root',...) failed: Can't c
.0.2.148' (111) at /usr/local/bin/pt-online-schema-change line 2262.

2015-07-08T15:28:49 To try dropping the triggers again, execute:
DROP TRIGGER IF EXISTS `ATData`.`pt_osc_ATData_Retest_del`;
DROP TRIGGER IF EXISTS `ATData`.`pt_osc_ATData_Retest_upd`;
DROP TRIGGER IF EXISTS `ATData`.`pt_osc_ATData_Retest_ins`;
`ATData`.`Retest` was not altered.
Segmentation fault (core dumped)
dgregg@Slave6vm:/usr/local/bin$

Revision history for this message
Dave Gregg (dgregg) wrote :

Here is the dry-run
no issues...

dgregg@Slave6vm:/usr/local/bin$ pt-online-schema-change --recursion-method=dsn=h=10.0.2.148,D=percona,t=dsns --nocheck-replication-filters --check-interval=2 --max-lag=10 --critical-load Threads_running=15 --alter "modify ExtraText1 varchar(256);" h=10.0.2.148,D=ATData,t=Retest,u=root,p=pswd--dry-run
Operation, tries, wait:
  copy_rows, 10, 0.25
  create_triggers, 10, 1
  drop_triggers, 10, 1
  swap_tables, 10, 1
  update_foreign_keys, 10, 1
Starting a dry run. `ATData`.`Retest` will not be altered. Specify --execute instead of --dry-run to alter the table.
Creating new table...
Created new table ATData._Retest_new OK.
Altering new table...
Altered `ATData`.`_Retest_new` OK.
Not creating triggers because this is a dry run.
Not copying rows because this is a dry run.
Not swapping tables because this is a dry run.
Not dropping old table because this is a dry run.
Not dropping triggers because this is a dry run.
2015-07-08T15:58:54 Dropping new table...
2015-07-08T15:58:54 Dropped new table OK.
Dry run complete. `ATData`.`Retest` was not altered.
dgregg@Slave6vm:/usr/local/bin$
#

Revision history for this message
Dave Gregg (dgregg) wrote :

I have attempted to stop replication on the server (10.0.2.148) I am connecting to....before I execute the command.
Still no change - we get a core dump.

Revision history for this message
Dave Gregg (dgregg) wrote :

I have verified - I can connect to mysql DB from my remote host that I am executing the "pt-online-schema-change" from, to all the servers in the topology. Also - All the servers can connect to each other.

Revision history for this message
Dave Gregg (dgregg) wrote :

I can connect to the server (10.0.2.148) from the remote host...no issues.
>mysql -h 10.0.2.148 -u root -p

>show slave status;
The server continues replication...

So normal connectivity appears fine....and causes no issues.

Revision history for this message
Dave Gregg (dgregg) wrote :

I tried running a schema change on something I knew would work.....but it ALSO CORE DUMPED........not sure what has changed?????? this created the triggers + table and cleaned up ?? this is operating on 105 million records and it did this immediately upon starting the operation ???

dgregg@Slave6vm:/usr/local/bin$ pt-online-schema-change --recursion-method=dsn=h=10.0.10.110,D=percona,t=dsns --nocheck-replication-filters --check-interval=2 --max-lag=5 --critical-load Threads_running=10 --alter "add column BulkID varchar (30) NOT NULL, add key ( BulkID );" h=10.0.10.110,D=ATDB,t=AccelSummary,u=root,p=pswd --execute
Found 1 slaves:
  ChiefDB2
Will check slave lag on:
  ChiefDB2
Operation, tries, wait:
  copy_rows, 10, 0.25
  create_triggers, 10, 1
  drop_triggers, 10, 1
  swap_tables, 10, 1
  update_foreign_keys, 10, 1
Altering `ATDB`.`AccelSummary`...
Creating new table...
Created new table ATDB._AccelSummary_new OK.
Waiting forever for new table `ATDB`.`_AccelSummary_new` to replicate to ChiefDB2...
Altering new table...
Altered `ATDB`.`_AccelSummary_new` OK.
2015-07-09T11:28:40 Creating triggers...
2015-07-09T11:28:40 Created triggers OK.
2015-07-09T11:28:40 Copying approximately 105672117 rows...
2015-07-09T11:28:40 Dropping triggers...
2015-07-09T11:28:40 Dropped triggers OK.
2015-07-09T11:28:40 Dropping new table...
2015-07-09T11:28:40 Dropped new table OK.
`ATDB`.`AccelSummary` was not altered.
Segmentation fault (core dumped)

Revision history for this message
Dave Gregg (dgregg) wrote :

Removed the dsns from the equation....It still core dumps ??

pt-online-schema-change h=10.0.10.110,D=ATDB,t=AccelSummary,u=root,p=pswd --nocheck-replication-filters --check-interval=2 --max-lag=60 --critical-load Threads_running=10 --alter "add column BulkID varchar (30) NOT NULL, add key ( BulkID );" --execute
Cannot connect to D=ATDB,h=10.0.2.145,p=...,u=root
Found 1 slaves:
  ChiefDB2
Will check slave lag on:
  ChiefDB2
Operation, tries, wait:
  copy_rows, 10, 0.25
  create_triggers, 10, 1
  drop_triggers, 10, 1
  swap_tables, 10, 1
  update_foreign_keys, 10, 1
Altering `ATDB`.`AccelSummary`...
Creating new table...
Created new table ATDB._AccelSummary_new OK.
Waiting forever for new table `ATDB`.`_AccelSummary_new` to replicate to ChiefDB2...
Altering new table...
Altered `ATDB`.`_AccelSummary_new` OK.
2015-07-09T11:47:28 Creating triggers...
2015-07-09T11:47:28 Created triggers OK.
2015-07-09T11:47:28 Copying approximately 105672117 rows...
2015-07-09T11:47:29 Dropping triggers...
2015-07-09T11:47:29 Dropped triggers OK.
2015-07-09T11:47:29 Dropping new table...
2015-07-09T11:47:29 Dropped new table OK.
`ATDB`.`AccelSummary` was not altered.
Segmentation fault (core dumped)

Revision history for this message
Frank Cizmich (frank-cizmich) wrote :

Dave,

The "core dump" may be a sign that it's related to this other issue:

https://bugs.launchpad.net/percona-toolkit/+bug/1446928

There is a patch in the first comment there. It's scheduled for next release.

Apart from there being a serious problem with the error reporting (throwing errors in a loop and core dumping) , the original 'user' error was often a wrong --alter command.

Try applying the patch and see if you get a more meaningful error report.

Regards

tags: added: pt-online-schema-change
Revision history for this message
Dave Gregg (dgregg) wrote :

You know what........

This might be attributed to disk space !!!! or lack there of !!!

I just had a schema change CORE DUMP and clean up because the disk space was low........it ran out of space on the master
as the bin-logs continued to build up........it cleaned up and core dumped.

On the complex replication topology - the servers I am dealing with only have about 20GB on the Master available free and about 16GB on the Slave free. I am attempting to modify 25 million records.

Would the tool recognize this immediately and die? It dies pretty quickly.....

~ Dave

Revision history for this message
Dave Gregg (dgregg) wrote :

Also - are there notes on how to apply the patch?

Revision history for this message
Frank Cizmich (frank-cizmich) wrote :

if you know where pt-online-schema-change is located (usually /usr/bin/)

cd /usr/bin
patch pt-online-schema-change < patch_file

need root permission or sudo

Revision history for this message
Dave Gregg (dgregg) wrote :

Attached a Trace Log.
I added the patch - it's still not giving me much. What I see is that for some reason, suddenly I am loosing connection to MYSQL
to the "Master/Slave" I am connecting to (10.0.2.148) and then it cleans up.

Everything seems to be very successful....initially.....table is created....triggers are created....data is actually copied over into the temp table.....

Then I see around - :NibbleIterator:5328 5311 Nibble: Insert into ATData._Retest_new..........
we have this connection issue that I can't explain??

Again - my command is....

PTDEBUG=1 pt-online-schema-change --recursion-method=dsn=h=10.0.2.148,D=percona,t=dsns --nocheck-replication-filters --check-interval=2 --max-lag=60 --critical-load Threads_running=10 --alter "modify ExtraText1 varchar(256)" h=10.0.2.148,D=ATData,t=Retest,u=root,p=pswd --execute > ASE_Trace.log 2>&1

Revision history for this message
Dave Gregg (dgregg) wrote :

Also I notice that when I run the Online Schema Change tool against the MASTER/SLAVE system........replication is STOPPED
on that server. (10.0.2.148)

So I even attempted to Stop the Slave Replication first and then try it.....it still fails the same exact way....

Revision history for this message
Dave Gregg (dgregg) wrote :

Not sure what else to look at...???

MySQL version on all servers is 5.6.19
(now this is a difference as I have been running against other servers successfully (not this crazy replication topology) that have MySQL 5.6.23 running)!!! So the MySQL version is different here.....

and the pt-online-schema-change tool is 2.2.13

As I always do - I execute the "pt-online-schema-change" tool from a remote server....same as I do for all the other changes I have made in the past.

Revision history for this message
Frank Cizmich (frank-cizmich) wrote :

At least the "core dump" is gone right?

I see you are using:
--recursion-method=dsn=h=10.0.2.148,D=percona,t=dsns

The debug output mentions slave 10.0.0.199 , not 10.0.0.147
Not sure if this could be related. Can you verify if the content of percona.dsns on 10.0.0.148 is ok?
It should just contain dsn for 10.0.0.147 in this case.

Revision history for this message
Frank Cizmich (frank-cizmich) wrote :

Slight typo in my previous comment:

I meant 10.0.2.* instead of 10.0.0.*

Revision history for this message
Dave Gregg (dgregg) wrote :

Yes - Core Dump is gone.

Sorry for the confusion with IP addresses. Since I entered this ticket - my IP addresses for the VMs I am using have changed. DHCP.
The Hierarchy is exactly the same...just an IP address change....
So we have the following: And yes - (the only dsn entry I have is: 10.0.2.199)

There is a Grand MASTER 10.0.10.200
                                     |
                                    \/
So there is a Master-Slave 10.0.2.145 (has replication filters)
                                     |
                                    \/
                          A 2nd Master-Slave 10.0.2.148
                                     |
                                    \/
                            A Slave 10.0.2.199

Revision history for this message
Frank Cizmich (frank-cizmich) wrote :

Thanks Dave,

As I understand, the tool never got near finishing copying the rows to the temp table right? It fails just a few seconds into the operation?

Also, do you see anything strange in 10.0.2.148's mysql error.log ?

Revision history for this message
Dave Gregg (dgregg) wrote :

Hi,

Yes that is correct....
The Triggers are created on the Master/Slave 10.0.2.148
The Triggers are created on the Slave 10.0.2.199
The Temp Tables are created on both...
And there are about 20,000 records copied into these tables.......
Then....it seems there is some kind of ODD connection issue.....and the clean-up is attempted.
It attempts to clean up the triggers....but does not.

Thank you for addressing this - this is a major issue for us.

Thanks
~Dave

Revision history for this message
Dave Gregg (dgregg) wrote :

Not sure what to try at this point?

Revision history for this message
Frank Cizmich (frank-cizmich) wrote :

We have to find out why the "server is going away"
Did you check the error.log ? That can give a clue to why it's disconnecting.
If you can run the server with --log-warnings=2 , this should give more information.

Revision history for this message
Frank Cizmich (frank-cizmich) wrote :

Please upload the error log if you can. It would be great to nail this issue.

Revision history for this message
Dave Gregg (dgregg) wrote :

Hi Frank,

Yes I agree! I would love to find out what's happening here! I have a meeting in a bit and then after that I will make another test run to monitor the logs and see why we are disconnecting.

Thanks ~ Dave

Revision history for this message
Frank Cizmich (frank-cizmich) wrote :

a couple more things ( other than posting the error.log )

use the tool with the "--statistics" option , this reveals some more info in case of failure

probably not the cause but still worth a try: check the timeout options in the server:

"show variables like '%timeout%';"

Revision history for this message
Dave Gregg (dgregg) wrote :

This is the Trace for the following command...

PTDEBUG=1 pt-online-schema-change --statistics --recursion-method=dsn=h=10.0.2.148,D=percona,t=dsns --nocheck-replication-filters --check-interval=2 --max-lag=60 --critical-load Threads_running=10 --alter "modify ExtraText1 varchar(256)" h=10.0.2.148,D=ATData,t=Retest,u=root,p=likeskin --execute > ASE_Trace.log 2>&1

Revision history for this message
Dave Gregg (dgregg) wrote :

This is the Error Log for the Master/Slave Server that died when I ran the Online Schema Tool

Revision history for this message
Dave Gregg (dgregg) wrote :

You can see in the mysql-error.log - everything fails pretty quickly....the failure starts at:

2015-07-16 17:24:46 7f203c52f700 InnoDB: Page dump in ascii and hex (16384 bytes):

Revision history for this message
Dave Gregg (dgregg) wrote :

Hi Frank,

Any Status on this?? Or anything else we can try?

Thanks
Dave

Revision history for this message
Frank Cizmich (frank-cizmich) wrote :

Hello Dave,

Thanks for sending the error log.
Just been able to take a glance at it. Looks uglier than what I expected. (e.g. not a simple timeout or similar)
Since this goes into MySQL "hardcore" debugging territory I'll ask my experienced colleagues to take a look and see if they can decipher that output.

Get back to you as soon as I know something.

Revision history for this message
Frank Cizmich (frank-cizmich) wrote :

Ok,

Here's the feedback. It's either disk corruption or some deep MySQL bug

1. check disk corruption;
2. Turn mrr optimization OFF (optimizer_switch="mrr=off"), then run the tool again
error is in this optimization, so if this is a bug it will help

If it does not help, it's most probably disk corruption.

Revision history for this message
Dave Gregg (dgregg) wrote :

Morning Frank,

OK - we will give this a try. I will let you know later today. This is getting more and more urgent to get a solution as we have a production topology that operates 24x7 that needs to be updated soon.

The background on these servers I am testing. They are VMWare VMs. They were initially created with 50GB hard drives.
I expanded the drives to be 200GB drives for these servers in this particular topology.

I have not done any disk Drive Validation.....I can do so and see if anything shows up.

So that is the history on these particular systems. They appear OK - but who knows?

So you are thinking that this Scenario Should Work ? A combination "MASTER/Slave" Server (it is BOTH) is being
updated and it replicates to a Salve underneath or subordinate to it......and it is ALSO is a slave to another Master above it....?
Should not be an issue??

~ Dave

Revision history for this message
Frank Cizmich (frank-cizmich) wrote :

If the modification you are trying to make is not part of what the master above 10.0.2.148 is replicating, then I see no obvious limitation on using pt-osc there.

Revision history for this message
Dave Gregg (dgregg) wrote :

Just a quick FYI - the VMs we are working with do have some File System / Disk Corruption.

I am rebuilding the VMs so they are clean. I will let you know what this looks like next week.

Thanks ~ Dave

Revision history for this message
Dave Gregg (dgregg) wrote :

Hey Frank,

I think we now have this working!! I reinstalled multiple VMs and verified we have no file system corruption. Looks good in our initial testing. Just giving you a heads up.

Thanks
Dave

Revision history for this message
Frank Cizmich (frank-cizmich) wrote :

Great to hear that!

The old "reinstall and/or reboot" engineering solution to "everything".

 ;-)

Revision history for this message
Sveta Smirnova (svetasmirnova) wrote :

Dave, I assume you don't see the issue after you re-installed virtual machines and this was practically not pt-table-checksum bug? Please confirm.

Thanks in advance.

Changed in percona-toolkit:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for Percona Toolkit because there has been no activity for 60 days.]

Changed in percona-toolkit:
status: Incomplete → Expired
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PT-1293

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.