pt-query-digest fails to parse non-SQL errors

Reported by Joe on 2013-01-22
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona Toolkit
High
Daniel Nichter

Bug Description

When using the tcpdump mode with pt-query-digest to troubleshoot failed connections to the database, I noticed that there are certain situations where it would not show the connections that were failing. If you try to connect with bad password it would show the failed connection in the report, but if you do a `telnet dbserver 3306`, it does not print the connect error in the report.

How to recreate:

Start tcpdump.
sudo /usr/sbin/tcpdump -i any port 3306 -s 65535 -x -nn -q -tttt > server.tcpdump

"Connect to mysql"
telnet dbserver 3306

Disconnect from mysql

Stop tcpdump

Run report
cat server.tcpdump | pt-query-digest --type tcpdump --show-all host --limit 100% --inherit-attributes ts --no-for-explain

pt-query-digest 2.1.7

Joe (joegrasse) on 2013-01-22
summary: - pt-query-digest does not report all failed connect to the database
+ pt-query-digest does not report all failed connections to the database
tags: added: tcpdump
Changed in percona-toolkit:
milestone: none → 2.1.9
importance: Undecided → High
status: New → Triaged
Changed in percona-toolkit:
status: Triaged → In Progress
assignee: nobody → Daniel Nichter (daniel-nichter)
Daniel Nichter (daniel-nichter) wrote :

There is a bug here: MySQLProtocolParser::parser_error_packet() doesn't understand non-SQL errors, i.e. errors that do not have this format:

# Error packet structure:
# Offset Bytes Field
# ====== ================= ====================================
# 00 00 00 01 MySQL proto header (already removed)
# ff Error (already removed)
# 0 00 00 Error number
# 4 23 SQL state marker, always '#'
# 6 00 00 00 00 00 SQL state
# 16 00 ... Error message

Connection errors, for example, don't have a SQL state, they have only FF, error number, then error message.

summary: - pt-query-digest does not report all failed connections to the database
+ pt-query-digest fails to parse non-SQL errors
Daniel Nichter (daniel-nichter) wrote :

Joe, will you try the code in the linked branch? It should report all errors now.

Joe (joegrasse) wrote :

I have tried the linked branch and it still doesn't seem to report the telnet scenario.

Also, while you are in there, it would be nice to fix this bug as well.
https://bugs.launchpad.net/percona-toolkit/+bug/823411

Daniel Nichter (daniel-nichter) wrote :

Joe, can you send me your tcpdump for the telnet scenario that doesn't work? If you "telnet mysql 3306" and do nothing, MySQL will timeout and close the connection. In this case, there's no error so nothing to report. But if you telnet and do something to generate packets from the client, those packets will basically crash the auth/handshake and that error is now captured.

Joe (joegrasse) wrote :

Here is the tcpdump file you requested. IP addresses have been changed.

Steps to recreate.

1. On 100.0.0.1, Start tcpdump
2. On 100.0.0.2, telnet 100.0.0.1 3306
3. type ctl-]
3. type quit
4. On 100.0.0.1, Aborted_connects will increment by 1, which means mysql is reporting it as a failed connection attempt.

Daniel Nichter (daniel-nichter) wrote :

One the one hand, there's no error in the protocol data:

# MySQLProtocolParser:473 2953 Packet is from server; client state: undef
# MySQLProtocolParser:1161 2953 Server handshake data: ...
# MySQLProtocolParser:451 2953 Done parsing packet; client state: server_handshake

# MySQLProtocolParser:473 2953 Packet is from server; client state: server_handshake
# MySQLProtocolParser:477 2953 TCP retransmission
# MySQLProtocolParser:451 2953 Done parsing packet; client state: server_handshake

(from client)
# MySQLProtocolParser:282 2953 Client 100.0.0.2:44432
# MySQLProtocolParser:321 2953 Client state: server_handshake
# MySQLProtocolParser:342 2953 TCP control: ACK FIN

Since the client closes the connection, that's not an error. But since it closes it half-way through connecting, MySQL counts that internally towards Aborted_connects.

I have managed to capture this as "administrator command: Connect aborted\G", aka "CONNECT ABORTED" in the profile. This is caught when the client sends a FIN packet while it's in a handshake state.

Joe, please try the latest from the linked branch. It should work for you now too.

Joe (joegrasse) wrote :
Download full text (3.6 KiB)

Yes, this new version is now showing this aborted connects as well.

Just one question though. A connection with a bad password and the telnet connection like above will both be counted as an Aborted_connects by MySQL. However, the report generated by pt-query-digest now has two sections for these types of aborted connects. My question is, should both of these appear under "administrator command: Connect" just with different error messages.

So instead of...
# Query 1: 0 QPS, 0x concurrency, ID 0x1AE4CFF32EE9545A at byte 0 ________
# This item is included in the report because it matches --limit.
# Scores: Apdex = 0.50 [1.0]*, V/M = 0.00
# Query_time sparkline: | ^ |
# Time range: all events occurred at 2013-01-29 10:14:46.681057
# Attribute pct total min max avg 95% stddev median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count 50 1
# Exec time 99 4s 4s 4s 4s 4s 0 4s
# Rows affecte 0 0 0 0 0 0 0 0
# Query size 55 38 38 38 38 38 0 38
# Warning coun 0 0 0 0 0 0 0 0
# String:
# Errors none
# Hosts 100.0.0.2
# Query_time distribution
# 1us
# 10us
# 100us
# 1ms
# 10ms
# 100ms
# 1s ################################################################
# 10s+
administrator command: Connect aborted\G

# Query 2: 0 QPS, 0x concurrency, ID 0x5D51E5F01B88B79E at byte 5969 _____
# This item is included in the report because it matches --limit.
# Scores: Apdex = 1.00 [1.0]*, V/M = 0.00
# Query_time sparkline: | ^ |
# Time range: all events occurred at 2013-01-29 10:15:08.324292
# Attribute pct total min max avg 95% stddev median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count 50 1
# Exec time 0 2ms 2ms 2ms 2ms 2ms 0 2ms
# Rows affecte 0 0 0 0 0 0 0 0
# Query size 44 30 30 30 30 30 0 30
# Warning coun 0 0 0 0 0 0 0 0
# String:
# Databases
# Error msg Access denied for user 'root'@'mysqlb1.ts1.p...
# Errors 1045
# Hosts 100.0.0.2
# Users root
# Query_time distribution
# 1us
# 10us
# 100us
# 1ms ################################################################
# 10ms
# 100ms
# 1s
# 10s+
administrator command: Connect\G

Something like this...

# Query 1: 0 QPS, 0x concurrency, ID 0x5D51E5F01B88B79E at byte 5969 _____
# This item is included in the report because it matches --limit.
# Scores: Apdex = 1.00 [1.0]*, V/M = 0.00
# Query_time sparkline: | ^ |
# Time range: 2013-01-29 10:14:46.681057 to 2013-01-29 10:15:08.324292
# Attribute pct total min max avg 95% stddev median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count 100 2
# Exec time 0 2ms 2ms 2ms 2ms 2ms 0 2ms
# Rows affecte 0 0 0 0 0 0 ...

Read more...

Daniel Nichter (daniel-nichter) wrote :

There's a gray area here. The MySQL docs say that aborted connects happen for these reasons:

* A client does not have privileges to connect to a database.
* A client uses an incorrect password.
* A connection packet does not contain the right information.
* It takes more than connect_timeout seconds to get a connect packet.

MySQL will send an error packet for the first 3, but it won't for the last one or the telnet case when the connection is closed with CTRL-] because there's no error, the client just disappeared.

So "admin command: Connect aborted" in the current output is for that last case and the CTRL-] case. In other words, it means "client disappeared while connecting", which is why the event has no error number or error string.

I did that because I think that type of event is different from those first 3 reasons for aborted connects because those reasons result in MySQL sending an error packet with and error number and string.

They're all related to "Connect" events, but some have MySQL errors and others don't. I'm rather indifferent about whether we lump them altogether as "admin: Connect" or if we have "admin: Connect aborted" separate. What do you think Joe?

Joe (joegrasse) wrote :

Either way it is kind of a hack for #4 on your list since MySQL doesn't send an error packet, but I think I prefer combining them into one admin command since they are all connects. That way you can see stats for total connects vs failed all in one.

Also, I want an Error msg (and/or Error #) so I can do something like --filter '($event->{Error_msg})' to find all errors.

Ok, I'll put everything in admin: Connect. And I'm fixing bug 823411 (the "none" issue).

Filtering for Error_no OR Error_msg will work now since Error_no|msg is only present if there's an error number|message. Since case #4 above, the best method is probably filtering for Error_msg because if there's an Error_no there should always be an Error_msg, but the opposite is not true for case #4: there's an Error_msg which is now "Client closed connection during handshake", but not Error_no.

Changed in percona-toolkit:
status: In Progress → Fix Committed
Joe (joegrasse) wrote :

Looks good thanks.

Changed in percona-toolkit:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments