Problems with mixed cp1251/cp866 content in qdiff (Failed to decode using charmap, falling back to latin1)

Bug #814117 reported by Vyacheslav Garashchenko
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
QBzr
Fix Released
Medium
Alexander Belchenko

Bug Description

Then use diff under Windows7, to see changes, if changes have Cyrillic letters in cp1251 codetable, the internal diff show this places with incorrect encodings. This problem is in 2.4b5 and also in the 2.3.4. Then I use external diff utils (csdiff for example) - all ok
If I change encoding to cp866 or koi8r or mac-cyrillic, I see that encoding really changes and changes correctly, but then I try cp1251 I see the incorrect encoded text again.

Here the log file for 2.4b5:
hu 2011-07-21 17:06:04 +0300
.128 bazaar version: 2.4b5
.129 bzr arguments: [u'qdiff']
.232 looking for plugins in C:/Users/Slava/AppData/Roaming/bazaar/2.0/plugins
.251 looking for plugins in C:/Program Files/Bazaar/plugins
.255 Plugin name xmloutput already loaded
.417 encoding stdout as osutils.get_user_encoding() 'cp1251'
.053 opening working tree 'C:/Users/Slava/Documents/CATI/MyTests/10067_test'
 6832] 2011-07-21 17:06:05.546 INFO: Failed to decode using charmap, falling back to latin1
 6832] 2011-07-21 17:06:05.549 INFO: Failed to decode using charmap, falling back to latin1
 6832] 2011-07-21 17:06:21.829 INFO: Failed to decode using charmap, falling back to latin1

Here the log file for 2.3.4:
Thu 2011-07-21 15:56:07 +0300
0.187 bazaar version: 2.3.4
0.187 bzr arguments: [u'qversion']
0.265 looking for plugins in C:/Users/Slava/AppData/Roaming/bazaar/2.0/plugins
0.281 looking for plugins in C:/Program Files/Bazaar/plugins
0.281 Plugin name xmloutput already loaded
0.515 encoding stdout as osutils.get_user_encoding() 'cp1251'
11.996 return code 0
[ 2924] 2011-07-21 15:57:45.753 INFO: Failed to decode using utf8, falling back to latin1
[ 2924] 2011-07-21 15:57:45.769 INFO: Failed to decode using utf8, falling back to latin1
[ 2924] 2011-07-21 15:57:56.019 INFO: Failed to decode using utf8, falling back to latin1

In attachnent - two screenshots one for 2.3.4, one for 2.4.b5, where you can see incorrect encoding at the left (where setting cp1251 codetable), partly incorrect at the right (where setting mac-cyrillic), and correct encoding with external diff utility.

Tags: qdiff unicode

Related branches

Revision history for this message
Vyacheslav Garashchenko (slavagt) wrote :
Revision history for this message
Vyacheslav Garashchenko (slavagt) wrote :

Sorry, I make incorrect attachment (I attach part of log file) in the previous message. Here the screenshot for 2.4b5

Revision history for this message
Vyacheslav Garashchenko (slavagt) wrote :

And here the screenshot for 2.3.4 - you can see that both the stable 2.3.4 and new beta 2.4b5 have the same bug with cp1251 encoding.

Revision history for this message
Vyacheslav Garashchenko (slavagt) wrote :

PS to my comment #3 - The text in CP1251 at the screen is have the same looks then I set UTF8 or then I set CP1251. It's look like that then I set "cp1251" it's really not changed and used default UTF8, which can not handle 1byte CP1251 encoding.

Revision history for this message
Alexander Belchenko (bialix) wrote :

Вячеслав, на второй картинке явно видно, что слева используется кодировка utf-8 а справа mac-cyrrilic. Тут что-то не так, cp1251 я не вижу.

affects: bzr → qbzr
Changed in qbzr:
status: New → Incomplete
Revision history for this message
Alexander Belchenko (bialix) wrote :

Hint: set the encoding for your branch in .bzr/branch/branch.conf file as

encoding = cp1251

and invoke qdiff again. Will it show you the text in the right encoding?

You can set this option globally in your bazaar.conf (in DEFAULT section).

Revision history for this message
Alexander Belchenko (bialix) wrote :

Stab in the dark: perhaps qdiff refuses to represent your text as cp1251 because there are some characters that cannot be decoded to unicode using cp1251 codec, because they're not in cp1251 character set.

Revision history for this message
Vyacheslav Garashchenko (slavagt) wrote :

Thank you for quick answer. I will be try your solution and adding new screenshot for 2.3.4 version in the nearest 30-40 minutes. Thank you.

Revision history for this message
Vyacheslav Garashchenko (slavagt) wrote :

 Here for bzr 2.4b5 - screenshot where you can see incorrect encoding at the left and menu there you can see that cp1251 is set.

Revision history for this message
Vyacheslav Garashchenko (slavagt) wrote :

Here for bzr 2.4b5 - screenshot where you can see incorrect encoding at the right and menu there you can see that cp1251 is set.

I also check .bzr/branch/branch.conf for encoding = cp1251 - it was here.

Now I must uninstal 2.4b5 for make screenshot for 2.3.4

Revision history for this message
Vyacheslav Garashchenko (slavagt) wrote :

Here is the new screenshot for version 2.3.4. The cp1251 is set at the both sides.

Revision history for this message
Vyacheslav Garashchenko (slavagt) wrote :

"perhaps qdiff refuses to represent your text as cp1251 because there are some characters that cannot be decoded to unicode using cp1251 codec"

Please see at screenshot at messages 3 and 11. You can see the results of external diff utility at the same screen. For external diff utility this is a normal text in CP1251 that can be decoded into unicode without any problems, but fo internal diff utility "there are some characters that cannot be decoded to unicode using cp1251 codec" ??

And also in my messages 2 and 3 you can see the same text with the mac-cyrilic, and you can see that this text is don't have "some characters that cannot be decoded" also for internal diff but with mac-cyrillic, but with cp1251 tha same text is "have some characters that can not be decoded" ??????

Revision history for this message
Vyacheslav Garashchenko (slavagt) wrote :
Download full text (3.6 KiB)

PS to comment #12. Here you can see fragment of log file (version 2.4b5)) with error which appear then internal diff utility open text with CP1251 at the left side and mac-cyrillic at the right, then fragment of the log file then external diff csdiff is opening. Also you can see hex codes of characters in the program fragment with russian where incorrect encoding with the internal diff occurred, and oroginal text of program where problem with cp1251 encoding with internal diff occurred. The screenshot attached. You can see that mac-cyrillic has no errors with this text, but cp1251 can not decode it and fallback to latin1. Also you can see again that external diff utility work well for the same text.
==================================================
Fri 2011-07-22 10:27:39 +0300
0.172 bazaar version: 2.4b5
0.172 bzr arguments: [u'qdiff']
0.219 looking for plugins in C:/Users/Slava/AppData/Roaming/bazaar/2.0/plugins
0.234 looking for plugins in C:/Program Files/Bazaar/plugins
0.343 encoding stdout as osutils.get_user_encoding() 'cp1251'
0.867 opening working tree 'C:/Users/Slava/Documents/CATI/MyTests/10067_test'
[20940] 2011-07-22 10:27:40.213 INFO: Failed to decode using charmap, falling back to latin1
[20940] 2011-07-22 10:27:40.213 INFO: Failed to decode using charmap, falling back to latin1
[20940] 2011-07-22 10:28:11.740 INFO: Failed to decode using charmap, falling back to latin1

Fri 2011-07-22 10:30:07 +0300
0.093 bazaar version: 2.4b5
0.093 bzr arguments: [u'qsubprocess', u'--bencode', u'l4:diff7:--using52:C:\\Program Files\\ComponentSoftware\\CSDiff\\CSDiff.exee']
0.124 looking for plugins in C:/Users/Slava/AppData/Roaming/bazaar/2.0/plugins
0.140 looking for plugins in C:/Program Files/Bazaar/plugins
0.265 encoding stdout as osutils.get_user_encoding() 'cp1251'
0.421 bazaar version: 2.4b5
0.421 bzr arguments: [u'diff', u'--using', u'C:\\Program Files\\ComponentSoftware\\CSDiff\\CSDiff.exe']
0.421 encoding stdout as osutils.get_user_encoding() 'cp1251'
0.483 opening working tree 'C:/Users/Slava/Documents/CATI/MyTests/10067_test'

===========Here the hex values of fragment there differences is=====================
0000000000: 4C 61 6E 67 75 61 67 65 20 71 73 6C 3B 09 09 0D
0000000010: 0A 64 61 74 61 6E 61 6D 65 20 4D 6F 62 69 6C 65
0000000020: 3B 09 09 09 3C 20 20 31 35 2E 30 38 2E 32 30 30
0000000030: 38 20 20 3E 0D 0A 0D 0A 3C 20 CF F0 EE E2 E5 F0
0000000040: EA E0 20 E5 F9 E5 20 F0 E0 E7 20 2D 20 F7 F2 EE
0000000050: E1 FB 20 EF EE F1 EC EE F2 F0 E5 F2 FC 20 EA E0
0000000060: EA 20 EF EE EA E0 E7 FB E2 E0 E5 F2 20 EE F2 EB
0000000070: E8 F7 E8 FF 20 EF F0 E8 20 EF EE EB F3 F7 E5 ED
0000000080: E8 E8 20 E8 E7 EC E5 ED E5 ED E8 E9 20 F1 20 EE
0000000090: F7 ED EE E2 ED EE E9 20 E2 E5 F2 EA E8 20 3E 0D
00000000A0: 0A 3C 20 C0 20 F2 E5 EF E5 F0 FC 20 E2 ED EE F1
00000000B0: E8 EC 20 E8 E7 EC E5 ED E5 ED E8 FF 20 E2 20 E4
00000000C0: EE EF EE EB ED E8 F2 E5 EB FC ED EE E9 20 E8 20
00000000D0: F1 EC EE F2 F0 E8 EC 20 EA E0 EA 20 EF E5 F0 E5
00000000E0: E4 E0 FE F2 F1 FF 20 E2 20 EE F1 ED EE E2 ED F3
00000000F0: FE 20 3E 0D 0A 0D 0A 3C 20 2D 2D 2D 2D 2D 2D 2D

==========Here the txt fragment where differe...

Read more...

Revision history for this message
Vyacheslav Garashchenko (slavagt) wrote :

PS - After set cp1251 globally (C:\Users\Slava\AppData\Roaming\bazaar\2.0\bazaar.conf) in the default section problem is the same.

Revision history for this message
Vyacheslav Garashchenko (slavagt) wrote :

Here the aditional screenshot for internal diff with two files with different cyrillic encodings. First file is CP866 (DOS), second - CP1251. The left panel is set to CP866 encoding, the right to CP1251, and the external diff utility is set to CP1251 encoding for compare.
You can see that CP866 work well, earlier in this report is screenshots where you can see that mac-cyrillic also work well. Problem is with CP1251.
At this screenshot you can see right encoding at the left (CP866 source and left panek set to CP866), wrong at the right in second file. At the right the first file is CP866, The second is CP1251, the panel is set to CP1251 encoding. SO the encoding of the first file (CP866) in right panel (which set to CP1251) is incorrect - as may be, but the second file is CP1251, but it also incorrect.
Also for compare at the same screenshot you can see external diff utility with second (CP1251 file) in it, here you can see text without any errors.

Revision history for this message
Alexander Belchenko (bialix) wrote :

Slava, to fix this bug I'd like to see your file. If you can, please send me the problematic file in private mail. Thank you.

Changed in qbzr:
importance: Undecided → High
tags: added: qdiff unicode
Revision history for this message
Vyacheslav Garashchenko (slavagt) wrote :

Then I begin to preparing files to send to you, I find then this problem appear... It appear ONLY then file have some text in CP1251 encoding and some text in CP866. For me this appear because I was get some files from working project for testing the possibility of using bazaar in our work, and forget that files was converted to upload to server (SCO). But in real world this problem may appear very rarely - for example if in any programs will be appear texts in several encodings for several console encoding, which is "wrong style of programming"... SO this is a bug, but this bug may affected very small amount of peoples.
This my fault - I was must to notice this at very beginning :(

Revision history for this message
Vyacheslav Garashchenko (slavagt) wrote : Re: [Bug 814117] Re: Problems with cyrillic cp1251 in internal diff (Failed to decode using charmap, falling back to latin1)
Download full text (3.7 KiB)

Если не возражаете я воспользуюсь русским - по моему он нам обоим ближе чем
английский и в привате вполне допустим.

Я вынужден извиниться за собственный промах. Ошибка есть но она не настолько
критична - она проявляется только когда в программе одновременно
присутствует win1251 и CP866. Это маловероятная ситуация, хотя и возможная.
При чем проблема эта возникает только для кодировки CP1251 на не возникает
для mac_cyrillic что странно, как и для других утилит diff. Я должен был с
самого начала обратить на это внимание. Я просто взял файлы рабочего проекта
(одного из) чтобы проверить применимость для нас системы bazaar (она явно
удобнее многих других особенно когда коллектив небольшой и поддержка
"выделенного сервера" - неоправдана) и не обратил внимание что файл уже
сконверчен для загрузки на сервер (SCO где CP866).

Сам файл в аттаче.

2011/8/3 Alexander Belchenko <email address hidden>

> Slava, to fix this bug I'd like to see your file. If you can, please
> send me the problematic file in private mail. Thank you.
>
> ** Changed in: qbzr
> Importance: Undecided => High
>
> ** Tags added: qdiff unicode
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/814117
>
> Title:
> Problems with cyrillic cp1251 in internal diff (Failed to decode using
> charmap, falling back to latin1)
>
> Status in Qt frontend for Bazaar:
> Incomplete
>
> Bug description:
> Then use diff under Windows7, to see changes, if changes have Cyrillic
> letters in cp1251 codetable, the internal diff show this places with
> incorrect encodings. This problem is in 2.4b5 and also in the 2.3.4. Then I
> use external diff utils (csdiff for example) - all ok
> If I change encoding to cp866 or koi8r or mac-cyrillic, I see that
> encoding really changes and changes correctly, but then I try cp1251 I see
> the incorrect encoded text again.
>
> Here the log file for 2.4b5:
> hu 2011-07-21 17:06:04 +0300
> .128 bazaar version: 2.4b5
> .129 bzr arguments: [u'qdiff']
> .232 looking for plugins in
> C:/Users/Slava/AppData/Roaming/bazaar/2.0/plugins
> .251 looking for plugins in C:/Program Files/Bazaar/plugins
> .255 Plugin name xmloutput already loaded
> .417 encoding stdout as osutils.get_user_encoding() 'cp1251'
> .053 opening working tree
> 'C:/Users/Slava/Documents/CATI/MyTests/10067_test'
> 6832] 2011-07-21 17:06:05.546 INFO: Failed to decode using charmap,
> falling back to latin1
> 6832] 2011-07-21 17:06:05.549 INFO: Failed to decode using charmap,
> falling back to latin1
> 6832] 2011-07-21 17:06:21.829 INFO: Failed to decode using charmap,
> falling back to latin1
>
> Here the log file for 2.3.4:
> Thu 2011-07-21 15:56:07 +0300
> 0.187 bazaar version: 2.3.4
> 0.187 bzr arguments: [u'qversion']
> 0.265 looking for plugins in
> C:/Users/Slava/AppData/Roaming/bazaar/2.0/plugins
> 0.281 looking for plugins in C:/Program Files/Bazaar/plugins
> 0.281 Plugin name xmloutput already loaded
> 0.515 encoding stdout as osutils.get_user_encoding() 'cp1251'
> 11.996 return code 0
> [ 2924] 2011-07-21 15:57:45.753 INFO: Failed to decode using utf8, falling
> back to latin1
> ...

Read more...

Revision history for this message
Alexander Belchenko (bialix) wrote :

Vyacheslav Garashchenko пишет:
> Then I begin to preparing files to send to you, I find then this problem appear... It appear ONLY then file have some text in CP1251 encoding and some text in CP866. For me this appear because I was get some files from working project for testing the possibility of using bazaar in our work, and forget that files was converted to upload to server (SCO). But in real world this problem may appear very rarely - for example if in any programs will be appear texts in several encodings for several console encoding, which is "wrong style of programming"... SO this is a bug, but this bug may affected very small amount of peoples.
> This my fault - I was must to notice this at very beginning :(

Thanks for this information. As I can see my suspects have been
confirmed. I'll try to provide a patch to address this problem.
And I'll need your help to test the patch with your mixed encoding file.

--
All the dude wanted was his rug back

Revision history for this message
Alexander Belchenko (bialix) wrote :

Vyacheslav Garashchenko пишет:
> Then I begin to preparing files to send to you, I find then this problem appear... It appear ONLY then file have some text in CP1251 encoding and some text in CP866. For me this appear because I was get some files from working project for testing the possibility of using bazaar in our work, and forget that files was converted to upload to server (SCO). But in real world this problem may appear very rarely - for example if in any programs will be appear texts in several encodings for several console encoding, which is "wrong style of programming"... SO this is a bug, but this bug may affected very small amount of peoples.
> This my fault - I was must to notice this at very beginning :(

I can confirm this behavior after looking at the real code. For mixed
content cp1251/cp866 python can't convert the content to unicode safely,
so we fallback to latin-1 encoding instead. I think we can try to
convert to unicode without falling back to latin-1 in such cases.

--
All the dude wanted was his rug back

Changed in qbzr:
status: Incomplete → Confirmed
importance: High → Medium
summary: - Problems with cyrillic cp1251 in internal diff (Failed to decode using
- charmap, falling back to latin1)
+ Problems with mixed cp1251/cp866 content in qdiff (Failed to decode
+ using charmap, falling back to latin1)
Revision history for this message
Vyacheslav Garashchenko (slavagt) wrote :
Download full text (4.0 KiB)

I am programming really only for Linux, and with M$ I was work many years
ago - before WIn95 appear, so I can not compile from source under Windows.
If you will be send any compiled binary for M$ windows for testing - I will
be do this.
Thank you !
PS The test file with mixed encoding I was sent to you in the previous
letter.

2011/8/3 Alexander Belchenko <email address hidden>

> Vyacheslav Garashchenko пишет:
> > Then I begin to preparing files to send to you, I find then this problem
> appear... It appear ONLY then file have some text in CP1251 encoding and
> some text in CP866. For me this appear because I was get some files from
> working project for testing the possibility of using bazaar in our work, and
> forget that files was converted to upload to server (SCO). But in real world
> this problem may appear very rarely - for example if in any programs will be
> appear texts in several encodings for several console encoding, which is
> "wrong style of programming"... SO this is a bug, but this bug may affected
> very small amount of peoples.
> > This my fault - I was must to notice this at very beginning :(
>
> Thanks for this information. As I can see my suspects have been
> confirmed. I'll try to provide a patch to address this problem.
> And I'll need your help to test the patch with your mixed encoding file.
>
> --
> All the dude wanted was his rug back
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/814117
>
> Title:
> Problems with cyrillic cp1251 in internal diff (Failed to decode using
> charmap, falling back to latin1)
>
> Status in Qt frontend for Bazaar:
> Incomplete
>
> Bug description:
> Then use diff under Windows7, to see changes, if changes have Cyrillic
> letters in cp1251 codetable, the internal diff show this places with
> incorrect encodings. This problem is in 2.4b5 and also in the 2.3.4. Then I
> use external diff utils (csdiff for example) - all ok
> If I change encoding to cp866 or koi8r or mac-cyrillic, I see that
> encoding really changes and changes correctly, but then I try cp1251 I see
> the incorrect encoded text again.
>
> Here the log file for 2.4b5:
> hu 2011-07-21 17:06:04 +0300
> .128 bazaar version: 2.4b5
> .129 bzr arguments: [u'qdiff']
> .232 looking for plugins in
> C:/Users/Slava/AppData/Roaming/bazaar/2.0/plugins
> .251 looking for plugins in C:/Program Files/Bazaar/plugins
> .255 Plugin name xmloutput already loaded
> .417 encoding stdout as osutils.get_user_encoding() 'cp1251'
> .053 opening working tree
> 'C:/Users/Slava/Documents/CATI/MyTests/10067_test'
> 6832] 2011-07-21 17:06:05.546 INFO: Failed to decode using charmap,
> falling back to latin1
> 6832] 2011-07-21 17:06:05.549 INFO: Failed to decode using charmap,
> falling back to latin1
> 6832] 2011-07-21 17:06:21.829 INFO: Failed to decode using charmap,
> falling back to latin1
>
> Here the log file for 2.3.4:
> Thu 2011-07-21 15:56:07 +0300
> 0.187 bazaar version: 2.3.4
> 0.187 bzr arguments: [u'qversion']
> 0.265 looking for plugins in
> C:/Users/Slava/AppData/Roaming/bazaar/2.0/plugins
> 0.281 looking for plugins in C:/Prog...

Read more...

Revision history for this message
Alexander Belchenko (bialix) wrote :

Vyacheslav Garashchenko пишет:
> Если не возражаете я воспользуюсь русским - по моему он нам обоим ближе чем
> английский и в привате вполне допустим.

Вы ответили не в приват, в в багмыло. Я скачаю ваш файл и удалю из
трекера на всякий случай.

Revision history for this message
Alexander Belchenko (bialix) wrote :

Slava, you don't need to compile anything from source. QBzr plugin is written in Python-only (interpreted language) and can be run with existing bzr.exe without problems.

Changed in qbzr:
milestone: none → 0.22b1
Changed in qbzr:
assignee: nobody → Alexander Belchenko (bialix)
Changed in qbzr:
status: Confirmed → Fix Committed
Changed in qbzr:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.