MAX_ALLOWED_PACKET and MySQL >= 5.1.31

Bug #622194 reported by fisharebest
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
webtrees
Fix Released
Medium
fisharebest

Bug Description

MySQL cannot import gedcoms larger than the value specified in the system parameter max_allowed_packet.

MySQL 5.1.30 and earlier allow an application to change this parameter - webtrees does this and it works OK.

MySQL 5.1.31 and later ignore requests to change this parameter, and imports fail.

To fix this, we will need to split the gedcom file into "chunks", and store each chunk in a separate row in a new table.

This will cause various problems, incuding splitting multi-byte characters. We do not know the encoding until after the import, but we need to know it beforehand to avoid splitting characters.

Changed in webtrees:
status: New → Confirmed
Revision history for this message
kiwi (kiwi3685-deactivatedaccount) wrote :

In the short term (thinking of before ver 1.0.0 release) is it worth adding a warning where the MySQL version is identified as >5.1.30 and file size is greater than max_allowed_packet setting, with a link to our WIKI section on changing the setting (assuming such a check can be done)?

Revision history for this message
ToyGuy (toyguy) wrote :

Remember, the import does NOT fail. Rather, if it is very large, simply takes a VERY LONG TIME to begin showing the import bar and still longer to complete the import. If the setting can not be changed, simply a warning to be extremely patient is appropriate...Stephen

Revision history for this message
Mark Hattam (mark-dxradio) wrote :

Given that the only change in a "new" database is the adding of about 768 KiB to the wt_gedcom table ... and nothing else happens ... the import does fail.
Changing the max_allowed_packet to 100M and it does work.

Changing it back to 1M, and re-importing ... the wt_gedcom table goes from 10.2 MiB to 9.0 KiB ... the other tables stay as they were.
And nothing else happens ... 20 minutes ... nothing ... there is no mysqld activity.

Changing max_allowed_packet to 2M ... wt_gedcom table goes to 1.2 MiB ... other tables are emptied ready for the import, and I get a "Loading data from GEDCOM: 0.0%" progress ... but again nothing then happens.

Changing back to 100M ... and it works again.

If the max_allowed_packet is smaller than the gedcom filesize it fails, MySQL doesn't attempt to chunk the gedcom file to get it into the table in packets that are "allowed".

MySQL 5.1.49 running on my local machine (hence I can change settings and stop/start mysqld at will). But for people using a shared hosting service, they are not going to be able to change a setting such as this themselves.

Mark

Revision history for this message
ToyGuy (toyguy) wrote :

Mark
Mine took almost 1 hour to show the import bar and another 1.5 hours to complete the import, but it works at the default of 1mb, albeit very slow and you have to be very patient - hence my caution. The tables did not show updating frequently and using phpMyAdmin to examine it during updating eventually corrupted the import where, if I did not, it didn't and it was perfect.

Revision history for this message
fisharebest (fisharebest) wrote : Re: [Bug 622194] Re: MAX_ALLOWED_PACKET and MySQL >= 5.1.31

On 23 August 2010 01:15, Mark Hattam <email address hidden> wrote:
> If the max_allowed_packet is smaller than the gedcom filesize it fails,
> MySQL doesn't attempt to chunk the gedcom file to get it into the table
> in packets that are "allowed".

"Chunking" is the way to go. The original version of the importer broke the
file in to 64KB chunks.

The problem was that if multi-byte characters get split, then all sorts of nasty
things happen, and we don't know if the file is UTF8, ANSEL or whatever
until afterwards. Dealing with this can be done, but it slowed the import,
and caused lots of timeout problems.

The current "load it all" technique avoids timeout problems because all
the work is handled in a single SQL statement, and (on UNIX) systems,
time spent in SQL does not count towards the execution time limit.

Greg
--
Greg Roach, Stony Stratford, Bucks, England, <email address hidden>
"Don't eat animals - it's not good for them and they don't like it"

Revision history for this message
fisharebest (fisharebest) wrote :

SVN9289: The installation script checks the version and issues an information note, and instructions on how to proceed.

Changed in webtrees:
status: Confirmed → In Progress
assignee: nobody → fisharebest (fisharebest)
Revision history for this message
fisharebest (fisharebest) wrote :

FYI, MySQL 5.1.31 was released in January 2009

I don't know what versions of MySQL are used by common web hosts.

Mine is still on 5.0 ;-)

Revision history for this message
Mark Hattam (mark-dxradio) wrote :

Pair Networks seems to be using : MySQL 5.0.90

Mark

Revision history for this message
Mark Hattam (mark-dxradio) wrote :

I've successfully imported 5 of my 6 GEDCOMs using MySQL 5.0.90 ... but the large one failed.

It took 30 minutes after clicking "Import" then "Save" to get to drawing the Progress Bar. But at 0% it gave up saying "no Head record found".

The BLOB in wt_gedcom is only 1.2 MiB for that GEDCOM (it's a 10.7 MB file in reality) ... and does have
0 HEAD
on the first line, like my other gedcoms, and it imports OK on my local machine.

Fortunately it hasn't (yet) blown away the imported data from my pgv_to_wt so the website is still available. It obviously failed before doing the delete from commands.

Going back to the editgedcoms page, it looks like it might still be trying to do something, as the page is still "loading", and there are no "Configuration", "Export", "Import" etc links in that row. I have little option but to leave that page open, though I'm unconfident that it's going to work. More relevant questions are "Am I constantly using (exhausting) resources on the webhost which might generate a HelpDesk email" and "how do I abort the import"

Mark

Revision history for this message
Mark Hattam (mark-dxradio) wrote :

The actual error text says

Invalid GEDCOM file - no header record found.

But quite clearly the GEDCOM has
0 HEAD
and is perfectly good.

Mark

Revision history for this message
ToyGuy (toyguy) wrote :

Mark
On my home machine, with a small MAX-PACKET (1mb), my 52mb file took 45minutes to start the import bar and another 1.5 hours to import - but it did not fail.
Sure that there is no BOM on that GEDCOM?
-Stephen

Revision history for this message
Mark Hattam (mark-dxradio) wrote :

If there were a BOM on it, it would fail too on PGV on both machines, and it would fail on my home machine's webtrees.

When I less or vi it on the Internet server, it's just the same as my other gedcoms ... in structure at least.

According to phpMyAdmin's showing of "variables", the max_allowed_packet is 2,097,152. MySQL is 5.0.90.

I do actually get a progress bar ... after about 25 mins, but it gets to 0% and then errors with
Invalid GEDCOM file - no header record found.
And doesn't empty the database of items pertaining to that gedcom. My smaller gedcoms import apparently without a problem on the 5.0.90 database (they're all well under 1 MB)

phpMyAdmin shows the BLOB only gets to 1.2 MiB for the large gedcom, but I'd expect and hope it would get to about 10.7 MiB.

Maybe it doesn't concaternate it properly, leaving just the last 1.2 MiB in the BLOB, which of course won't have a
0 HEAD
at the top of it.

But I can't "see" the contents of the BLOB to find out what's there or not.

Mark

Revision history for this message
Mark Hattam (mark-dxradio) wrote :

Are you using InnoDB or MyISAM on your home machine for webtrees?

Revision history for this message
fisharebest (fisharebest) wrote :

Mark - it seems strange that you get only 1.2MB of a 10.7MB file. The documented (and observed by me!) behaviour of exceeding the max_allowed_packet size is that the BLOB is set to NULL. I've never seen any truncation.

<<But I can't "see" the contents of the BLOB to find out what's there or not.>>

How are you trying to look? In phpMyAdmin, there is a check-box for "show blob contents". Alternatively, try

SELECT LEFT(CONVERT(import_gedcom using 'utf8'), 1000) FROM wt_gedcom;

Revision history for this message
fisharebest (fisharebest) wrote :

PS - the delay before the progress bar hits 0% is when we are deleting the existing gedcom data.

PPS - BOMs are not an issue. The are removed automatically. However, a blank line or other whitespace before the "0 HEAD" will cause problems.

Revision history for this message
Mark Hattam (mark-dxradio) wrote :

Yes, it seems to have the last 1.2 MB of the file only.

Browsing (in phpMyAdmin having turned on the rather well hidden "option" to show BLOB contents), that gedcom appears to start partway through a census transcription. Copying/pasting to a new document shows it to be 1.2 MB, and it ends with the TRLR line.

vi-ing it on the server, shows nothing strange at that point in the gedcom.

The only curious thing is that all the other GEDCOMs displayed in phpMyAdmin have \n\n between lines, and this one has \r\n ... but that could be part of your importing routine partially done. I'm uploading them having made them Windows line endings ... vi shows them with ^M on the end of every line ... less shows nothing at the end of the line ... and thus far PGV and webtrees on the home iMac are perfectly happy with the files.

There is no white space or blank lines above the 0 HEAD.

Mark

Revision history for this message
ToyGuy (toyguy) wrote :

Mark
Since hooking up with Greg, I have consistently used INNO-DB as the only method, both on my big server and the laptop and iMac at home. When the topic of the MAX-PKT first arose, I reset mine on the two home machines to the 1mb and, while I must be extremely patient, the GEDCOM imports (about 2.5-3 hours). With the proper setting, it imports in about 28 minutes. On the Octo-Core, with proper settings, it imports in about 12 minutes....Stephen

Revision history for this message
Mark Hattam (mark-dxradio) wrote :

What I'm seeing here though is a problem where the BLOB only has a last section of the GEDCOM and hence it's not got the
0 HEAD
when it gets to that stage of the import and hence fails, correctly claiming that it's (now) an invalid gedcom.

Being extraordinarily patient isn't a factor in this case - it is failing with an error.

MySQL 5.0.90
php 5.2.13

Revision history for this message
fisharebest (fisharebest) wrote :

Mark - the double \n\n is normal.

We need to
a) convert MSDOS/MAC line endings to UNIX
b) ignore blank lines

So, I just do a replace of \r with \n as the first import step.

I tried replacing \r\n with \n, but when the search/replace strings are different lengths, this can be quite slow. Ignoring blank lines later on is quite quick, so I just replace \r characters.

I've had another look at the code. (editgedcoms.php, import_gedcom_file(), line ~75 if you want to take a look).

In MySQL, max_allowed_packet is (unfortunately) used for many different things. It is documented as being the size of the largest query that can be received over the network (to prevent DoS attacks). Depending on the version of MySQL, it also seems to be used as the largest size of a buffer to use when manipulating strings.

So, the current logic tries to load the file in chunks of 75% of the max-allowed-packet

On my mysql 5.0 system, this truncates the entire result when it gets too large. On yours, it seems to just keep part of the expression.

Stephen - this issue is very dependent on the exact version of MySQL. While it may work for you, it isn't working for us. Mark and I both get the error, although with different symptoms.

The solution, as stated elsewhere, is to break the file into chunks in PHP, and store this in a "gedcom_chunk" table.

I resisted that solution initially, because the current system is very fast (when it works!), and means we don't need to handle timeout issues.

Revision history for this message
fisharebest (fisharebest) wrote :

It appears that the change in 5.1.31 (to make this setting read only) was backported to 5.0, and affects 5.0.84 onwards.

SVN has been updated, so that MySQL 5.0 users will now get a suitable error message, instead of a silent failure.

Revision history for this message
fisharebest (fisharebest) wrote :

Fixed in svn:9646

We now load the data in blocks of 64K into a holding table, then import these 64K blocks one at a time.

Changed in webtrees:
importance: Undecided → Medium
status: In Progress → Fix Committed
Revision history for this message
fisharebest (fisharebest) wrote :

Fix released in webtrees-1.0.5

Changed in webtrees:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.