Port dulwich to Python 3

Bug #883315 reported by Chris Eberle
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Dulwich
Fix Committed
Medium
Chris Eberle

Bug Description

Python 3 is coming, there's really no avoiding it. And I'd really like to see dulwich working on Python 3.

I've started porting it over myself (see https://github.com/eberle1080/dulwich-py3k), but I can tell you it's a rather substantial effort. 99% of the problems thus far has been converting from python 2 strings to python 3 unicode strings / byte strings. I'll keep plugging away on this, but there's almost no chance this will nicely merge in with the original code base (such is the nature of Python 3).

Tags: port python3
Revision history for this message
Chris Eberle (eberle1080) wrote :

Assigned to myself, still could use help

Changed in dulwich:
assignee: nobody → Chris Eberle (eberle1080)
assignee: Chris Eberle (eberle1080) → nobody
assignee: nobody → Chris Eberle (eberle1080)
status: New → In Progress
Revision history for this message
Jelmer Vernooij (jelmer) wrote :

Cool, thanks for working on this!

I agree that it will be hard to merge this in the original codebase, so I think we'll have to end up with a python2 and a python3 branch. It'll be a bit painful to keep both up to date, but it shouldn't be too bad as long as patches to either one of the branches is ported to the other.

Revision history for this message
Chris Eberle (eberle1080) wrote :

Just wanted to give an update. I've managed to convert everything over to python 3. However all that means at this point is that the unit tests all run. Whether things work "as expected" is yet to be seen. I need to run some real-world tests to see how it does (volunteers?)

Basically my original statement was dead on. The big challenge is knowing when to use a regular unicode string, and when to use a byte string. Because they were used interchangeably in python 2, it made the port rather difficult. To make life easier, I created a decorator which transparently converts parameters of functions to the right type.

While the decorators were absolutely a necessity (for my own sanity), they do slow the code down noticeably. It's also confusing from an API standpoint when you ask a class to store a string and then later on it gives you back bytes. So my next task is to slowly remove these decorators in favor or type enforcement (i.e. you have to call the methods with the correct types or you get a TypeError).

Furthermore, the original codebase never cleaned up open file handles. I guess this is legal in python 2, and I'm certainly guilty of writing code like that. But in python 3 the interpreter yells at you when it garbage-collects unclosed files. And so I also converted everything over to use the "with" semantics to support nicely closing everything (no more warnings, my inner perfectionist can sleep tonight).

All of which is to say, so far so good.

Revision history for this message
Jelmer Vernooij (jelmer) wrote : Re: [Bug 883315] Re: Port dulwich to Python 3

On 08/11/11 05:26, Chris Eberle wrote:
> Just wanted to give an update. I've managed to convert everything over
> to python 3. However all that means at this point is that the unit tests
> all run. Whether things work "as expected" is yet to be seen. I need to
> run some real-world tests to see how it does (volunteers?)
>
Thanks for the update, and your work on porting. Is there any
preliminary code available somewhere?

Cheers,

Jelmer

Revision history for this message
Chris Eberle (eberle1080) wrote :

Yep, the current head revision
(60d6803fbced0dfdaa7ab98c2dfc8225399eae6f) of the following project:

https://github.com/eberle1080/dulwich-py3k

I managed to run "dulwich clone" and "dulwich log" on a remote repo
(over ssh). Also all of the unit tests work fine. That's pretty much
all I know at this point, to be honest I'm not 100% sure what the
capabilities are of the library (I only use it for a very limited
subset of functions for myself), so it would be helpful to get someone
who's a lot more familiar with how the library should behave to look
at the new code and verify things are working as they should.

--
Chris Eberle
"Give me ambiguity or give me something else."

Quoting "Jelmer Vernooij" <email address hidden>:

> On 08/11/11 05:26, Chris Eberle wrote:
>> Just wanted to give an update. I've managed to convert everything over
>> to python 3. However all that means at this point is that the unit tests
>> all run. Whether things work "as expected" is yet to be seen. I need to
>> run some real-world tests to see how it does (volunteers?)
>>
> Thanks for the update, and your work on porting. Is there any
> preliminary code available somewhere?
>
> Cheers,
>
> Jelmer
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/883315
>
> Title:
> Port dulwich to Python 3
>
> Status in Python Git Library:
> In Progress
>
> Bug description:
> Python 3 is coming, there's really no avoiding it. And I'd really like
> to see dulwich working on Python 3.
>
> I've started porting it over myself (see https://github.com/eberle1080
> /dulwich-py3k), but I can tell you it's a rather substantial effort.
> 99% of the problems thus far has been converting from python 2 strings
> to python 3 unicode strings / byte strings. I'll keep plugging away on
> this, but there's almost no chance this will nicely merge in with the
> original code base (such is the nature of Python 3).
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/dulwich/+bug/883315/+subscriptions
>
>

Revision history for this message
Jelmer Vernooij (jelmer) wrote :

On 08/11/11 21:01, Chris Eberle wrote:
> Yep, the current head revision
> (60d6803fbced0dfdaa7ab98c2dfc8225399eae6f) of the following project:
>
> https://github.com/eberle1080/dulwich-py3k
>
> I managed to run "dulwich clone" and "dulwich log" on a remote repo
> (over ssh). Also all of the unit tests work fine. That's pretty much
> all I know at this point, to be honest I'm not 100% sure what the
> capabilities are of the library (I only use it for a very limited
> subset of functions for myself), so it would be helpful to get someone
> who's a lot more familiar with how the library should behave to look
> at the new code and verify things are working as they should.
That's great, thanks.

I've rebased your work on the current trunk, fixed a few more of the
test failures and pushed it to the current dulwich repository as the
"py3k" branch.

There are now 4 remaining test failures, all with a very similar traceback.

So it seems this is very close. Nice work. :-)

Cheers,

Jelmer

Revision history for this message
Chris Eberle (eberle1080) wrote :

OK, well I'm calling the port done. The tests all pass for me (I'm using python 3.2.2). I got rid of most of the old compat functions since they were meant for python 2.x compatibility issues (no longer an issue, obviously). I also removed any references to pypy since, as far as I know, there is no python3-compatible version of pypy. All of my old wrapper code is gone (i.e. the decorators), which means the calls are a lot more predictable and consistent, and the performance is improved.

The single most reused chunk of code was converting between the different representations of a SHA1 sum (unicode hex <-> bytes hex <-> raw bytes). To that end, I created a new class called Sha1Sum which nicely abstracts all of this. I also created unit tests for this new class.

I've documented the major porting issues here: https://github.com/eberle1080/dulwich-py3k/wiki
And I've generated API documentation here: http://eberle1080.github.com/dulwich-py3k/index.html

I tried my very hardest to keep up to date with the main branch, and as far as I'm aware everything from the main branch has been merged, ported, and is working. Please let me know if you encounter any issues with the port (and if you do find any, please tell me which version of python 3 you're using). I'll continue merging and porting as the main branch is updated.

Changed in dulwich:
status: In Progress → Fix Committed
Revision history for this message
Jelmer Vernooij (jelmer) wrote :

Hi Chris,

Am 21/11/11 00:56, schrieb Chris Eberle:
> OK, well I'm calling the port done. The tests all pass for me (I'm using
> python 3.2.2). I got rid of most of the old compat functions since they
> were meant for python 2.x compatibility issues (no longer an issue,
> obviously). I also removed any references to pypy since, as far as I
> know, there is no python3-compatible version of pypy. All of my old
> wrapper code is gone (i.e. the decorators), which means the calls are a
> lot more predictable and consistent, and the performance is improved.
Very cool!

There still seem to be some remaining issues. In particular, it seems
like paths are now unicode and it is assumed paths in git itself are
utf8, rather than just plain bytes.

> The single most reused chunk of code was converting between the
> different representations of a SHA1 sum (unicode hex<-> bytes hex<->
> raw bytes). To that end, I created a new class called Sha1Sum which
> nicely abstracts all of this. I also created unit tests for this new
> class.
>
> I've documented the major porting issues here: https://github.com/eberle1080/dulwich-py3k/wiki
> And I've generated API documentation here: http://eberle1080.github.com/dulwich-py3k/index.html
>
> I tried my very hardest to keep up to date with the main branch, and as
> far as I'm aware everything from the main branch has been merged,
> ported, and is working. Please let me know if you encounter any issues
> with the port (and if you do find any, please tell me which version of
> python 3 you're using). I'll continue merging and porting as the main
> branch is updated.
I wonder what the best way is to maintain this in the long term. I would
be happy to keep it up to date as part of the main dulwich project, and
just regularly merge in changes from the main Dulwich branch. It would
be nice to actually have this rebased on the main dulwich branch, so
that revisions can just be merged in, rather than having to be rebased
(and changing identity, which makes it hard to see what has been merged
and what hasn't).

What do you think?

Cheers,

Jelmer

Revision history for this message
Chris Eberle (eberle1080) wrote :

To which paths are you referring? The only time I ask for real unicode strings is when using disk paths (this is python's preference). Everything else is actual byte strings. However for my part, yes when converting to strings I have been assuming a utf-8 encoding. I figured this was a relatively safe assumption since the old code assumed 8-bit characters. Now of course, one of the advantages of the old way is that there was no implied encoding, so you could just stuff bytes in there. That's why in general I've been preferring bytes objects. However in the few instances where strings make more sense (like committer name / email address / commit message / diff strings / email messages, etc) I've been assuming utf-8. I think there's somewhere in one of my dozens of commits where I said something like "I guess I'll be using utf-8 until someone corrects me". Frankly it was a guess. The default encoding is 7-bit ascii which I found to be insufficient for most things (e.g. I couldn't even view the git log for dulwich because some commiters have international characters in their names). So obviously I'm open to any and all suggestions. But as mentioned, there are actually some places where real unicode strings are required by python.

As for rebasing everything, I'm all for that. I can definitely see that as an advantage. I say go for it. :)

Revision history for this message
Chris Eberle (eberle1080) wrote :

I've rebased the python3 branch onto the main branch. The history is a lot cleaner, and all patches in the master branch can now be found in the python 3 branch:

https://github.com/eberle1080/dulwich-py3k/tree/python3-rebased

Let me know if you think this needs some more cleanup. I was tempted to squash it all into one big commit, but I didn't. Frankly I could live with either.

Revision history for this message
Jelmer Vernooij (jelmer) wrote :

Hi Chris,

On 11/28/2011 01:07 AM, Chris Eberle wrote:
> I've rebased the python3 branch onto the main branch. The history is a
> lot cleaner, and all patches in the master branch can now be found in
> the python 3 branch:
>
> https://github.com/eberle1080/dulwich-py3k/tree/python3-rebased
That's great, thanks! I'm going to pull this in, we can of course make
more improvements as necessary.
>
> Let me know if you think this needs some more cleanup. I was tempted to
> squash it all into one big commit, but I didn't. Frankly I could live
> with either.
It's fine with me as is, it's often useful to see the history of
particular changes.

And I can't say this enough, thanks for working on this. :-)

Cheers,

Jelmer

Revision history for this message
Chris Eberle (eberle1080) wrote :

Happy to help, it's just a shame that Python made the transition so
difficult (their 2to3 tool was totally ineffective).

I really like the project by the way -- I was half tempted to start my
own Scala-based knockoff, I think it's a really cool idea. Great work
you guys have all done on it.

--
Chris Eberle
"Give me ambiguity or give me something else."

Quoting "Jelmer Vernooij" <email address hidden>:

> Hi Chris,
>
> On 11/28/2011 01:07 AM, Chris Eberle wrote:
>> I've rebased the python3 branch onto the main branch. The history is a
>> lot cleaner, and all patches in the master branch can now be found in
>> the python 3 branch:
>>
>> https://github.com/eberle1080/dulwich-py3k/tree/python3-rebased
> That's great, thanks! I'm going to pull this in, we can of course make
> more improvements as necessary.
>>
>> Let me know if you think this needs some more cleanup. I was tempted to
>> squash it all into one big commit, but I didn't. Frankly I could live
>> with either.
> It's fine with me as is, it's often useful to see the history of
> particular changes.
>
> And I can't say this enough, thanks for working on this. :-)
>
> Cheers,
>
> Jelmer
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/883315
>
> Title:
> Port dulwich to Python 3
>
> Status in Python Git Library:
> Fix Committed
>
> Bug description:
> Python 3 is coming, there's really no avoiding it. And I'd really like
> to see dulwich working on Python 3.
>
> I've started porting it over myself (see https://github.com/eberle1080
> /dulwich-py3k), but I can tell you it's a rather substantial effort.
> 99% of the problems thus far has been converting from python 2 strings
> to python 3 unicode strings / byte strings. I'll keep plugging away on
> this, but there's almost no chance this will nicely merge in with the
> original code base (such is the nature of Python 3).
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/dulwich/+bug/883315/+subscriptions
>
>

Revision history for this message
Jelmer Vernooij (jelmer) wrote :

Am 26/11/11 23:15, schrieb Chris Eberle:
> To which paths are you referring? The only time I ask for real unicode
> strings is when using disk paths (this is python's preference).
> Everything else is actual byte strings. However for my part, yes when
> converting to strings I have been assuming a utf-8 encoding. I figured
> this was a relatively safe assumption since the old code assumed 8-bit
> characters. Now of course, one of the advantages of the old way is that
> there was no implied encoding, so you could just stuff bytes in there.
> That's why in general I've been preferring bytes objects. However in the
> few instances where strings make more sense (like committer name / email
> address / commit message / diff strings / email messages, etc) I've been
> assuming utf-8. I think there's somewhere in one of my dozens of commits
> where I said something like "I guess I'll be using utf-8 until someone
> corrects me". Frankly it was a guess. The default encoding is 7-bit
> ascii which I found to be insufficient for most things (e.g. I couldn't
> even view the git log for dulwich because some commiters have
> international characters in their names). So obviously I'm open to any
> and all suggestions. But as mentioned, there are actually some places
> where real unicode strings are required by python.
Git itself doesn't really define an encoding (there is a "encoding"
attribute in the commit message, but it's optional and used by almost
nobody), so it's usually really up to the user to interpret the data.

In fact, there are quite a few git repositories out there which don't
have data valid in utf8. Encoding/decoding back and forth also adds
quite a bit of overhead in terms of performance, especially when you're
serializing/deserializing lots of objects.

UTF-8 would probably be a safe bet for the far majority of the cases.
That said, implicit encoding being problematic was also one of the
reasons why Python3 added the clear split between string objects and
byte objects (ie code always working until somebody happens to pass in a
string with a non-ascii character). So I'd prefer to just hand back
plain bytes to the user, and leave it up to them to interpret those
bytes in whatever way they see fit.

Cheers,

Jelmer

Revision history for this message
Chris Eberle (eberle1080) wrote :

Aww, man, you took out Sha1Sum? No!!! :-) Heh, yeah I knew that one might raise a few eyebrows. One of the problems of doing long sessions of coding -- sudden loss of long-term judgement.

Well I think the theme here is performance -- so I agree with you both about the Sha1Sum class and the UTF strings. There were a few places where strings absolutely had to be used, namely any time you need to construct a path name (for functions like open, os.path.join, etc). So I guess the question is -- is there some way to get the filesystem's encoding? Or do we just assume some default encoding?

Jelmer Vernooij (jelmer)
Changed in dulwich:
importance: Undecided → Medium
Revision history for this message
Benedikt Sauer (filmor) wrote :

Is there a chance of getting those patches upstream?

Revision history for this message
Jelmer Vernooij (jelmer) wrote :

Chris' patches are in the upstream repository in the 'python3' branch. They can' t be merged upstream without breaking python2 compatibility.

Revision history for this message
Jelmer Vernooij (jelmer) wrote :

Unfortunately the python3 branch has bitrotted a bit; any help getting it updated to the current state of trunk welcome.

Revision history for this message
Chris Eberle (eberle1080) wrote :

Hey thanks for the request. I'd love to help if I had the time, but
unfortunately right now life is keeping me too busy for personal
projects. Sorry. When time frees up, I'd love to contribute again.

On 01/18/2014 10:30 AM, Jelmer Vernooij wrote:
> Unfortunately the python3 branch has bitrotted a bit; any help getting
> it updated to the current state of trunk welcome.
>

--Chris

Revision history for this message
Andreas Klöckner (inform) wrote :
Revision history for this message
Jelmer Vernooij (jelmer) wrote :

That's an old port though, it misses about 3 years of bug fixes and new features in dulwich.

On 26 January 2015 01:16:14 CET, "Andreas Klöckner" <email address hidden> wrote:
>https://github.com/eberle1080/dulwich-py3k
>
>--
>You received this bug notification because you are subscribed to
>Dulwich.
>https://bugs.launchpad.net/bugs/883315
>
>Title:
> Port dulwich to Python 3
>
>To manage notifications about this bug go to:
>https://bugs.launchpad.net/dulwich/+bug/883315/+subscriptions

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.