Percona Server moved to https://jira.percona.com/projects/PS

Fulltext search can not find word which contains punctuation marks

Bug #1689268 reported by Sveta Smirnova on 2017-05-08

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
MySQL Server	Unknown	Unknown	mysql-bugs #86164
Percona Server moved to https://jira.percona.com/projects/PS	Status tracked in 5.7
5.6	Won't Fix	Wishlist	Unassigned
5.7	Fix Released	Wishlist	Laurynas Biveinis	Percona Server moved to https://jira.percona.com/projects/PS 5.7.21-20

Bug Description

Originally reported at https://bugs.mysql.com/bug.php?id=86164

Description:
Fulltext search can not find word which contains punctuation marks like "!?,.{}/" in boolean mode BUT it does in natural language mode.

There is one case reported for "." and "," but according our tests it includes much more punctuation marks.

The reported case is https://bugs.mysql.com/bug.php?id=85876

How to repeat:
The problem can be reproduced as follows:

=================================
CREATE TABLE `test` (
`seq` int(11) NOT NULL AUTO_INCREMENT,
`cont` text COLLATE utf8mb4_bin,
PRIMARY KEY (`seq`),
FULLTEXT KEY `fx_txts` (`cont`) /*!50100 WITH PARSER `ngram` */
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;

insert into test (cont) values ('aBc!efg');
insert into test (cont) values ('aBc@efg');
insert into test (cont) values ('aBc#efg');
insert into test (cont) values ('aBc$efg');
insert into test (cont) values ('aBc%efg');
insert into test (cont) values ('abc^efg');
insert into test (cont) values ('abc&efg');
insert into test (cont) values ('abc*efg');
insert into test (cont) values ('abc(efg');
insert into test (cont) values ('abc)efg');

Then try to select whatever value using boolean mode:
mysql [localhost] {msandbox} (test) > select * from test where match(cont) against ('"c!e"' in boolean mode);
Empty set (0.00 sec)

mysql [localhost] {msandbox} (test) > select * from test where match(cont) against ('"c#e"' in boolean mode);
Empty set (0.00 sec)

mysql [localhost] {msandbox} (test) > select * from test where match(cont) against ('"c$e"' in boolean mode);
Empty set (0.00 sec)

mysql [localhost] {msandbox} (test) > select * from test where match(cont) against ('"c!e"' in boolean mode);
Empty set (0.00 sec)

If change to natural language mode it works:
mysql [localhost] {msandbox} (test) > select * from test where match(cont) against ('"c!e"' in natural language mode);
+-----+---------+
| seq | cont |
+-----+---------+
| 1 | aBc!efg |
| 11 | abc!efg |
+-----+---------+
2 rows in set (0.00 sec)

The index entries are created as can be seen in INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE:

mysql [localhost] {msandbox} (test) > SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE; +------+--------------+-------------+-----------+--------+----------+
| WORD | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+------+--------------+-------------+-----------+--------+----------+
| !e | 2 | 12 | 2 | 2 | 3 |
| !e | 2 | 12 | 2 | 12 | 3 |
| #e | 4 | 4 | 1 | 4 | 3 |
| $e | 5 | 5 | 1 | 5 | 3 |
| %e | 6 | 6 | 1 | 6 | 3 |
| &e | 8 | 8 | 1 | 8 | 3 |
| (e | 10 | 10 | 1 | 10 | 3 |
| )e | 11 | 11 | 1 | 11 | 3 |
| *e | 9 | 9 | 1 | 9 | 3 |
| @e | 3 | 3 | 1 | 3 | 3 |
| ^e | 7 | 7 | 1 | 7 | 3 |
<...>
| c! | 2 | 12 | 2 | 2 | 2 |
| c! | 2 | 12 | 2 | 12 | 2 |
| c# | 4 | 4 | 1 | 4 | 2 |
| c$ | 5 | 5 | 1 | 5 | 2 |
| c% | 6 | 6 | 1 | 6 | 2 |
| c& | 8 | 8 | 1 | 8 | 2 |
| c( | 10 | 10 | 1 | 10 | 2 |
| c) | 11 | 11 | 1 | 11 | 2 |
| c* | 9 | 9 | 1 | 9 | 2 |
| c@ | 3 | 3 | 1 | 3 | 2 |
| c^ | 7 | 7 | 1 | 7 | 2 |
<...>
+------+--------------+-------------+-----------+--------+----------+
55 rows in set (0.00 sec)

Tags:

Laurynas Biveinis (laurynas-biveinis) on 2017-05-09

tags:

added: upstream

Revision history for this message

Sveta Smirnova (svetasmirnova) wrote on 2017-07-11:

If we decide to let user pass symbols they want to search via option it would be good not to limit them to ASCII, but make Uft8mb4 compatible

Sveta Smirnova (svetasmirnova) on 2017-10-02

tags:

added: sfr-80

Revision history for this message

Laurynas Biveinis (laurynas-biveinis) wrote on 2017-11-29:

Sveta -

On the upstream bug you write "ngram parser is not required to repeat this bug. It is repeatable with MyISAM as well." Could you please provide an example without the ngram parser?

Revision history for this message

Sveta Smirnova (svetasmirnova) wrote on 2017-11-29:

This is exactly same case, but with skipped /*!50100 WITH PARSER `ngram` */ part:

mysql> CREATE TABLE `test` (
    -> `seq` int(11) NOT NULL AUTO_INCREMENT,
    -> `cont` text COLLATE utf8mb4_bin,
    -> PRIMARY KEY (`seq`),
    -> FULLTEXT KEY `fx_txts` (`cont`)
    -> ) ENGINE=myisam DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;
Query OK, 0 rows affected (0.06 sec)

mysql> set names utf8mb4;
Query OK, 0 rows affected (0.02 sec)

mysql> insert into test (cont) values ('aBc!efg');
Query OK, 1 row affected (0.03 sec)

mysql> insert into test (cont) values ('aBc@efg');
Query OK, 1 row affected (0.00 sec)

mysql> insert into test (cont) values ('aBc#efg');
Query OK, 1 row affected (0.00 sec)

mysql> insert into test (cont) values ('aBc$efg');
Query OK, 1 row affected (0.00 sec)

mysql> insert into test (cont) values ('aBc%efg');
Query OK, 1 row affected (0.00 sec)

mysql> insert into test (cont) values ('abc^efg');
Query OK, 1 row affected (0.00 sec)

mysql> insert into test (cont) values ('abc&efg');
Query OK, 1 row affected (0.00 sec)

mysql> insert into test (cont) values ('abc*efg');
Query OK, 1 row affected (0.00 sec)

mysql> insert into test (cont) values ('abc(efg');
Query OK, 1 row affected (0.00 sec)

mysql> insert into test (cont) values ('abc)efg');
Query OK, 1 row affected (0.00 sec)

mysql> select * from test where match(cont) against ('"c!e"' in boolean mode);
Empty set (0.13 sec)

mysql> select * from test where match(cont) against ('"c#e"' in boolean mode);
Empty set (0.00 sec)

mysql> select * from test where match(cont) against ('"c$e"' in boolean mode);
Empty set (0.00 sec)

mysql> select * from test where match(cont) against ('"c!e"' in boolean mode);
Empty set (0.00 sec)

mysql> select * from test where match(cont) against ('"c!e"' in natural language mode);
Empty set (0.00 sec)

Revision history for this message

Laurynas Biveinis (laurynas-biveinis) wrote on 2017-11-30:

Introducing a per-session variable for extra word characters. On setting, it will be checked not to intersect with the ft_boolean_syntax variable value.

Revision history for this message

Laurynas Biveinis (laurynas-biveinis) wrote on 2017-12-04:

A change that would affect tokenizing on writing the FTS is not safe enough. Since the ngram parser tokenizer already does what we want here, and the use case involves that parser, only introduce a new variable that affects tokenizing only for FTS queries.

Revision history for this message

Sveta Smirnova (svetasmirnova) wrote on 2017-12-12:

Would be great if fix will work for both ngram and mecab fulltext indexes.

Revision history for this message

Laurynas Biveinis (laurynas-biveinis) wrote on 2017-12-13:

Ack, I will update on the mecab status later

Revision history for this message

Laurynas Biveinis (laurynas-biveinis) wrote on 2018-01-23:

https://github.com/percona/percona-server/pull/2078

Changes done for MeCab as well, but only tested not to have caused regressions in their existing tests.

Revision history for this message

Shahriyar Rzayev (rzayev-sehriyar) wrote on 2018-01-25:

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PS-2501

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

mysql-bugs #86164 Edit

Bug watches keep track of this bug in other bug trackers.