file-position is confused by utf-8 buffering

Bug #657183 reported by Faré
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
SBCL
Fix Released
Medium
Unassigned

Bug Description

file-position seems to be computed by (- position-at-end-of-buffer (- end-of-buffer index-in-buffer)) which is bogus, and doesn't allow to use the resulting position as an argument to (file-position s position). Demonstration:

# cat ~/bug/sbcl-file-position.lisp
(in-package :cl-user)
(defparameter *u* "/tmp/u")
(with-open-file (s *u* :direction :output :if-exists :rename-and-delete)
  (princ "Faré λ 自由 foo" s))
(with-open-file (s *u* :direction :input)
  (format t "~&file length: ~D~%" (file-length s))
  (loop :for pos = (file-position s)
    :for c = (read-char s nil nil)
    :for nil = (format t "~&pos ~2D ~S~%" pos c)
    :while c))
(delete-file *u*)
(quit)

# sbcl --load ~/bug/sbcl-file-position.lisp
This is SBCL 1.0.42.37, an implementation of ANSI Common Lisp.
More information about SBCL is available at <http://www.sbcl.org/>.

SBCL is free software, provided as is, with absolutely no warranty.
It is mostly in the public domain; some portions are provided under
BSD-style licenses. See the CREDITS and COPYING files in the
distribution for more information.
file length: 19
pos 0 #\F
pos 7 #\a
pos 8 #\r
pos 9 #\LATIN_SMALL_LETTER_E_WITH_ACUTE
pos 10 #\
pos 11 #\GREEK_SMALL_LETTER_LAMDA
pos 12 #\
pos 13 #\U81EA
pos 14 #\U7531
pos 15 #\
pos 16 #\f
pos 17 #\o
pos 18 #\o
pos 19 NIL

This is on Linux amd64, which shouldn't matter, with a recentish SBCL.
I'm building the latest SBCL to test it, but from the git log
I don't think it has been magically solved.

NB: I found this problem while writing a function using file-position
to read backwards in a file and portably find the stream-line-column
at the current position of a file stream despite encoding issues. It
was quite annoying to not have read/write consistency for the position.

PS: thanks to Nikodemus for updating ASDF to 2.009.

Revision history for this message
Faré (fahree) wrote :

Note that the code that implements this was correct back in the days of CMUCL that only had 8-bit encodings.

Revision history for this message
Faré (fahree) wrote :

Also note that if I use :external-format :utf-8 when I open, I don't have the problem.
The issue is that sb-impl::fd-stream-char-size is wrongly set to 1 when the :external-format is :default.

set-fd-stream-routines should probably set the char-size function when the supplied external-format is :default.

Revision history for this message
Faré (fahree) wrote :

The patch attached fixes the issue for me. Please review and commit.

Revision history for this message
Faré (fahree) wrote :

The patch attached fixes the issue for me. Please review and commit.

Revision history for this message
Nikodemus Siivola (nikodemus) wrote :

Yep. It seems that :DEFAULT doesn't get treated right.

* (sb-impl::default-external-format)

:UTF-8
* (sb-impl::external-format-char-size :default)

1
* (sb-impl::get-external-format :default)

NIL
NIL

However, I think the right place to fix this is making GET-EXTERNAL-FORMAT understand :DEFAULT.

Changed in sbcl:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Faré (fahree) wrote :

I don't fully understand the code, which I think could use some refactoring, but the reason I didn't just modify GET-EXTERNAL-FORMAT is that I believe that :DEFAULT means something special only if the ELEMENT-TYPE is CHARACTER, so GET-EXTERNAL-FORMAT would need a second argument (possible &OPTIONAL) to make sure it doesn't do anything silly when the ELEMENT-TYPE is (UNSIGNED-BYTE 8) or something.

Revision history for this message
Nikodemus Siivola (nikodemus) wrote :

:EXTERNAL-FORMAT :DEFAULT means always the same thing.

Maybe you're confusing things with :ELEMENT-TYPE :DEFAULT?

Anyways, fixing this looks simple enough, as long as I or someone else has a bit of time.

A merge-ready patch including a test-case will of course speed things along, but I should be able to attend to this before the week is over.

Changed in sbcl:
assignee: nobody → Nikodemus Siivola (nikodemus)
status: Triaged → In Progress
Revision history for this message
Nikodemus Siivola (nikodemus) wrote :

In 1.0.43.52.

Changed in sbcl:
assignee: Nikodemus Siivola (nikodemus) → nobody
status: In Progress → Fix Committed
Changed in sbcl:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.