[focal/core20][python3.7+] staging conflicts when multiple python parts have the same python dependencies

Bug #1882535 reported by Dmitrii Shcherbakov on 2020-06-08
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Snapcraft
High
Sergio Schvezov
pip
Unknown
Unknown
python-pip (Ubuntu)
Undecided
Unassigned

Bug Description

TL;DR: as of python 3.7, .pyc files by default include a timestamp and a size of the source file which results in a change of a hash every time a .pyc file is generated for a given source file. This results in staging conflicts for python parts.
https://docs.python.org/3/library/py_compile.html#py_compile.compile
https://docs.python.org/3/library/py_compile.html#py_compile.PycInvalidationMode.TIMESTAMP
Even if this is set, there is still an issue https://github.com/pypa/pip/issues/8414

Description/analysis:

When building a project with multiple identical python dependencies I consistently get an error like this:

Failed to stage: Parts 'openstack-projects' and 'cluster' have the following files, but with different contents:
    bin/activate
    bin/activate.csh
    bin/activate.fish
    bin/python3
    pyvenv.cfg
    bin/python3
    lib/python3.8/site-packages/Flask-1.1.2.dist-info/RECORD
    lib/python3.8/site-packages/__pycache__/easy_install.cpython-38.pyc
    lib/python3.8/site-packages/certifi/__pycache__/__init__.cpython-38.pyc
    lib/python3.8/site-packages/certifi/__pycache__/__main__.cpython-38.pyc
# many other .pyc files ...

While snapcraft suggests that I use something like `organize`, `filesets` and `stage`, the issue is that the source files for those dependencies are identical - there is no reason for any manual work here.

Source hashes are the same:

snapcraft-microstack # sha256sum ./parts/cluster/install/lib/python3.8/site-packages/click/_textwrap.py
6a30b3933165cb9b639bd7e843937dfcc39e69824c063025b6e15aebd9f88976

./parts/cluster/install/lib/python3.8/site-packages/click/_textwrap.py
snapcraft-microstack # sha256sum ./parts/openstack-projects/install/lib/python3.8/site-packages/click/_textwrap.py
6a30b3933165cb9b639bd7e843937dfcc39e69824c063025b6e15aebd9f88976 ./parts/openstack-projects/install/lib/python3.8/site-packages/click/_textwrap.py

.pyc files are different:

snapcraft-microstack # sha256sum ./parts/openstack-projects/install/lib/python3.8/site-packages/click/__pycache__/_textwrap.cpython-38.pyc
398b47a5abfc87e9da73153e42d48dcd5d917bd637a0e0af1eb6999f19fb1085 ./parts/openstack-projects/install/lib/python3.8/site-packages/click/__pycache__/_textwrap.cpython-38.pyc

snapcraft-microstack # sha256sum ./parts/cluster/install/lib/python3.8/site-packages/click/__pycache__/_textwrap.cpython-38.pyc
d4642cfecd727d228944a1d31ff728e7ef6529a7a88898f6568ea6e96d1f8f82 ./parts/cluster/install/lib/python3.8/site-packages/click/__pycache__/_textwrap.cpython-38.pyc

RECORD files include hashes as well, hence they are also different:

snapcraft-microstack # diff ./parts/openstack-projects/install/lib/python3.8/site-packages/Flask-1.1.2.dist-info/RECORD ./parts/cluster/install/lib/python3.8/site-packages/Flask-1.1.2.dist-info/RECORD
1c1
< ../../../bin/flask,sha256=VXQqccMeG03Rn8_yN8Kq3Up13rzyaoHsEckFnCxHor4,242
---
> ../../../bin/flask,sha256=NAzPpe84iZFX3PYsCZEirt3fAFObAjBuCpM25792kSU,231

Apparently, as of python 3.7, .pyc files include a timestamp and a size of the source by default (PycInvalidationMode.TIMESTAMP). There is a way to override this behavior by setting the SOURCE_DATE_EPOCH environment variable to switch py_compile to using PycInvalidationMode.CHECKED_HASH:

https://docs.python.org/3/library/py_compile.html
py_compile.compile(file, cfile=None, dfile=None, doraise=False, optimize=-1, invalidation_mode=PycInvalidationMode.TIMESTAMP, quiet=0)

invalidation_mode should be a member of the PycInvalidationMode enum and controls how the generated bytecode cache is invalidated at runtime. The default is PycInvalidationMode.CHECKED_HASH ***if the SOURCE_DATE_EPOCH environment variable is set***, otherwise ***the default is PycInvalidationMode.TIMESTAMP***.

https://docs.python.org/3/library/py_compile.html#py_compile.PycInvalidationMode.TIMESTAMP
TIMESTAMP
The .pyc file includes the timestamp and size of the source file, which Python will compare against the metadata of the source file at runtime to determine if the .pyc file needs to be regenerated.

https://docs.python.org/3/library/py_compile.html#py_compile.PycInvalidationMode.CHECKED_HASH
CHECKED_HASH
The .pyc file includes a hash of the source file content, which Python will compare against the source at runtime to determine if the .pyc file needs to be regenerated.

Adding something like this seems to be needed:
    build-environment:
      - SOURCE_DATE_EPOCH: '1591640328'

However, see https://bugs.launchpad.net/snapcraft/+bug/1882535/comments/2

summary: - focal/core20: staging conflicts when multiple python parts have the same
- python dependencies (python 3.7+)
+ [focal/core20][python3.7+] staging conflicts when multiple python parts
+ have the same python dependencies
description: updated
Sergio Schvezov (sergiusens) wrote :

Great, this shall be in 4.1 then

Changed in snapcraft:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Sergio Schvezov (sergiusens)
Dmitrii Shcherbakov (dmitriis) wrote :
Download full text (4.1 KiB)

It looks like my workaround doesn't quite work. And the reason is not snapcraft which correctly passes the environment variable down.

I found that pip is also not a problem. I was using a simple package installation as a test case and found that py_compile correctly gets the SOURCE_DATE_EPOCH variable and decides that SOURCE_DATE_EPOCH method should be used.
https://paste.ubuntu.com/p/63d5wD3DsZ/

So, in the end, we correctly get to this point where the .pyc content is generated:
https://github.com/python/cpython/blob/0f5a28f834bdac2da8a04597dc0fc5b71e50da9d/Lib/py_compile.py#L164-L172

However, the importlib._bootstrap_external._code_to_hash_pyc function returns a different value for the same source during different executions (change of a process):

root@ftest:~/test# rm -r ./*
root@ftest:~/test# pip3 install -t ./ setuptools

> /usr/lib/python3.8/py_compile.py(169)compile()
-> bytecode = importlib._bootstrap_external._code_to_hash_pyc(
(Pdb) print(invalidation_mode) ; import hashlib ; hashlib.sha256(source_bytes).hexdigest() ; hashlib.sha256(source_hash).hexdigest() ; hashlib.sha256(importlib._bootstrap_external._code_to_hash_pyc(code, source_hash, (invalidation_mode == PycInvalidationMode.CHECKED_HASH))).hexdigest() ; importlib._bootstrap_external.MAGIC_NUMBER
PycInvalidationMode.CHECKED_HASH
'a33213258b106b1cedbec418662b29f4226a91e9f579dfc4218722f2a826a2b5'
'6750d9e9657c3ce3ad48843f9d615ece9310755cbfb9deb037c3c9ce2d5e249f'
'4e537d6107457ed2e875030d2717961f319c4da55b0d012fcb09f38d91eda6cd'
b'U\r\r\n'

root@ftest:~/test# rm -r ./*
root@ftest:~/test# pip3 install -t ./ setuptools

(Pdb) print(invalidation_mode) ; import hashlib ; hashlib.sha256(source_bytes).hexdigest() ; hashlib.sha256(source_hash).hexdigest() ; hashlib.sha256(importlib._bootstrap_external._code_to_hash_pyc(code, source_hash, (invalidation_mode == PycInvalidationMode.CHECKED_HASH))).hexdigest() ; importlib._bootstrap_external.MAGIC_NUMBER
PycInvalidationMode.CHECKED_HASH
'a33213258b106b1cedbec418662b29f4226a91e9f579dfc4218722f2a826a2b5'
'6750d9e9657c3ce3ad48843f9d615ece9310755cbfb9deb037c3c9ce2d5e249f'
'8d7d7ab7695b88f4f5c4ee8d297228241c6488b0e39b6d71cd2de9d0bb7a9790'
b'U\r\r\n'

(Pdb) cfile
'/tmp/pip-unpacked-wheel-lx3lxwld/setuptools/command/__pycache__/bdist_egg.cpython-38.pyc'

Within the same process the value returned by importlib._bootstrap_external._code_to_hash_pyc for the same source does not change:

> /usr/lib/python3.8/py_compile.py(169)compile()
-> bytecode = importlib._bootstrap_external._code_to_hash_pyc(
print(invalidation_mode) ; import hashlib ; hashlib.sha256(source_bytes).hexdigest() ; hashlib.sha256(source_hash).hexdigest() ; hashlib.sha256(importlib._bootstrap_external._code_to_hash_pyc(code, source_hash, (invalidation_mode == PycInvalidationMode.CHECKED_HASH))).hexdigest() ; importlib._bootstrap_external.MAGIC_NUMBER
PycInvalidationMode.CHECKED_HASH
'a33213258b106b1cedbec418662b29f4226a91e9f579dfc4218722f2a826a2b5'
'6750d9e9657c3ce3ad48843f9d615ece9310755cbfb9deb037c3c9ce2d5e249f'
'46827abbb669a51e7072be45d72353ee21db9b2b28f5a17bd6c957b0181baa64'
b'U\r\r\n'
(Pdb) print(invalidation_mode) ; import hashlib ; hashlib.sha256(source_bytes).hexdigest() ...

Read more...

description: updated
Dmitrii Shcherbakov (dmitriis) wrote :

Made a diff between the two .pyc files generated at different invocations. By the looks of it there is somehow a tmp path included into the .pyc file:

/tmp/pip-unpacked-wheel-w3yhi95n/setuptools/command/bdist_egg.py
vs
/tmp/pip-unpacked-wheel-uj8jixvx/setuptools/command/bdist_egg.py

root@ftest:~# diff -a 1-bdist_egg.cpython-38.pyc 2-bdist_egg.cpython-38.pyc
9c9
< EntryPoint)�Library)�Command�SetuptoolsDeprecationWarning)get_path�get_python_versioncCtd�S)N�purelib)r�rr�@/tmp/pip-unpacked-wheel-w3yhi95n/setuptools/command/bdist_egg.py�
         _get_purelibr)�get_python_librcCtd�S)NF)rrrrrr scCs2d|krtj�|�d}|�d�r.|dd�}|S)N�.r�modulei����)�os�pathsplitextendswith)filenamerrr�
                                                                                                                                            strip_module$s
---
> EntryPoint)�Library)�Command�SetuptoolsDeprecationWarning)get_path�get_python_versioncCtd�S)N�purelib)r�rr�@/tmp/pip-unpacked-wheel-uj8jixvx/setuptools/command/bdist_egg.py�
         _get_purelibr)�get_python_librcCtd�S)NF)rrrrrr scCs2d|krtj�|�d}|�d�r.|dd�}|S)N�.r�modulei����)�os�pathsplitextendswith)filenamerrr�
                                                                                                                                            strip_module$s

That's what causes the .pyc file hashes to be different while the sources are the same. The sources themselves definitely don't include the tmp path.

(Pdb) '/tmp/pip-unpacked-wheel-' in str(source_bytes)
False

Dmitrii Shcherbakov (dmitriis) wrote :

Going further:

* _code_to_hash_pyc takes a code object in (not the source code itself)
https://github.com/python/cpython/blob/843c27765652e2322011fb3e5d88f4837de38c06/Lib/importlib/_bootstrap_external.py#L608-L616

* MAGIC_NUMBER is the same across different invocations and other transformations are static;

* _code_to_hash_pyc uses `marshal` to dump bytes of a code object and those bytes contain a dynamic tmp path:
    data.extend(marshal.dumps(code))

(Pdb) '/tmp/pip-unpacked-wheel-' in str(marshal.dumps(code))
True

(Pdb) marshal.dumps(code)
b'\xe3\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00@\x00\x00\x00s\x84\x01\x00\x00d\x00Z\x00d\x01d\x02l\x01m\x02Z\x02\x01\x00d\x01d\x03l\x03m\x04Z\x04m\x05Z\x05\x01\x00d\x01d\x04l\x06m\x07Z\x07\x01\x00d\x01d\x05l\x08m\tZ\t\x01\x00d\x01d\x06l\nZ\nd\x01d\x06l\x0bZ\x0bd\x01d\x06l\x0cZ\x0cd\x01d\x06l\rZ\rd\x01d\x06l\x0eZ\x0ed\x01d\x06l\x0fZ\x0fd\x01d\x07l\x10m\x11Z\x11\x01\x00d\x01d\x08l\x12m\x13Z\x13m\x14Z\x14m\x15Z\x15\x01\x00d\x01d\tl\x12m\x16Z\x16\x01\x00d\x01d\nl\x17m\x18Z\x18\x01\x00d\x01d\x0bl\x19m\x1aZ\x1am\x1bZ\x1b\x01\x00z\x1cd\x01d\x0cl\x1cm\x1dZ\x1dm\x1eZ\x1e\x01\x00d\rd\x0e\x84\x00Z\x1fW\x00n,\x04\x00e k\nr\xf8\x01\x00\x01\x00\x01\x00d\x01d\x0fl!m"Z"m\x1eZ\x1e\x01\x00d\x10d\x0e\x84\x00Z\x1fY\x00n\x02X\x00d\x11d\x12\x84\x00Z#d\x13d\x14\x84\x00Z$d\x15d\x16\x84\x00Z%G\x00d\x17d\x18\x84\x00d\x18e\x1a\x83\x03Z&e\'\xa0(d\x19\xa0)\xa1\x00\xa1\x01Z*d\x1ad\x1b\x84\x00Z+d\x1cd\x1d\x84\x00Z,d\x1ed\x1f\x84\x00Z-d d!d"\x9c\x02Z.d#d$\x84\x00Z/d%d&\x84\x00Z0d\'d(\x84\x00Z1d)d*d+d,g\x04Z2d1d/d0\x84\x01Z3d\x06S\x00)2z6setuptools.command.bdist_egg\n\nBuild .egg distributions\xe9\x00\x00\x00\x00)\x01\xda\x13DistutilsSetupError)\x02\xda\x0bremove_tree\xda\x06mkpath)\x01\xda\x03log)\x01\xda\x08CodeTypeN)\x01\xda\x03six)\x03\xda\x12get_build_platform\xda\x0cDistribution\xda\x10ensure_directory)\x01\xda\nEntryPoint)\x01\xda\x07Library)\x02\xda\x07Command\xda\x1cSetuptoolsDeprecationWarning)\x02\xda\x08get_path\xda\x12get_python_versionc\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00C\x00\x00\x00s\x08\x00\x00\x00t\x00d\x01\x83\x01S\x00)\x02N\xda\x07purelib)\x01r\x0f\x00\x00\x00\xa9\x00r\x12\x00\x00\x00r\x12\x00\x00\x00\xfa@/tmp/pip-unpacked-wheel-1fei0ikx/setuptools/command/bdist_egg.py
# ...

This is where the variability in contents is coming from.

Dmitrii Shcherbakov (dmitriis) wrote :

source_to_code is used to load the code object which has a source path specified:

https://github.com/python/cpython/blob/0f5a28f834bdac2da8a04597dc0fc5b71e50da9d/Lib/py_compile.py#L144-L145
https://docs.python.org/3/library/importlib.html#importlib.abc.InspectLoader.source_to_code
"The path argument should be the “path” to where the source code originated from, which can be an abstract concept (e.g. location in a zip file)."

Seems to be the reason for adding a path to the .pyc file:
https://docs.python.org/3/library/compileall.html#cmdoption-compileall-d
"This will appear in compilation time tracebacks, and is also compiled in to the byte-code file, where it will be used in tracebacks and other messages in cases where the source file does not exist at the time the byte-code file is executed."

The path specified in the second argument to source_to_code is included into the dump of a code object:

python -m py_compile ./bdist_egg.py
(Pdb) cfile
'./__pycache__/bdist_egg.cpython-38.pyc'
(Pdb) dfile
(Pdb) file
'./bdist_egg.py'

# vs

pip3 install -t ./ setuptools
(Pdb) cfile
'/tmp/pip-unpacked-wheel-79blnlyq/setuptools/command/__pycache__/bdist_egg.cpython-38.pyc'
(Pdb) dfile
(Pdb) file
'/tmp/pip-unpacked-wheel-79blnlyq/setuptools/command/bdist_egg.py'

(Pdb) '/tmp/pip-unpacked' in str(marshal.dumps(loader.source_to_code(source_bytes, dfile or file, _optimize=optimize)))
True

(Pdb) '/tmp/pip-unpacked' in str(marshal.dumps(loader.source_to_code(source_bytes, './setuptools/command/bdist_egg.py', _optimize=optimize)))
False

(Pdb) './setuptools/command/bdist_egg.py' in str(marshal.dumps(loader.source_to_code(source_bytes, './setuptools/command/bdist_egg.py', _optimize=optimize)))
True

The absolute paths are passed in from pip:
https://github.com/pypa/pip/blob/20.0.2/src/pip/_vendor/distlib/util.py#L596

no longer affects: python3.8 (Ubuntu)
Dmitrii Shcherbakov (dmitriis) wrote :

To sum up:

* snapcraft still needs to pass the same SOURCE_DATE_EPOCH for different parts to change the invalidation mode to PycInvalidationMode.CHECKED_HASH;
  note: it needs to be set to something after 1980 apparently, otherwise
  ValueError('ZIP does not support timestamps before 1980')
  will be raised.
* The upstream pip issue needs to be fixed https://github.com/pypa/pip/issues/8414

description: updated
Dmitrii Shcherbakov (dmitriis) wrote :

I was trying to disable .pyc files by having PIP_COMPILE=false and PYTHONDONTWRITEBYTECODE=false.

But this isn't possible based on the below.

https://pip.pypa.io/en/stable/user_guide/#environment-variables (pip allows environment variables instead of command-line arguments)
https://docs.python.org/3/using/cmdline.html#envvar-PYTHONDONTWRITEBYTECODE (PYTHONDONTWRITEBYTECODE disables writing .pyc files at the interpreter level)

    build-environment: &python-build-environment
      - PIP_COMPILE: 'false' # disable .pyc generation in pip
      - PYTHONDONTWRITEBYTECODE: 'false' # disable .pyc generation by setup.py
      - SOURCE_DATE_EPOCH: '1591640328'

However, I still had some .pyc files created.

The reason is that snapcraft creates a venv which, in turn, uses the `ensurepip` command built into the interpreter (and also adds -I to filter out PYTHON* variables). `ensurepip` has some code to ignore any environment variable prefixed with "PIP_" which includes PIP_COMPILE.

See below:

1) venv used by snapcraft

https://github.com/snapcore/snapcraft/blob/71cebabd8155937fa329c94f7d3559b6b2e723b7/snapcraft/plugins/v2/python.py#L117-L120
"${SNAPCRAFT_PYTHON_INTERPRETER}" -m venv ${SNAPCRAFT_PYTHON_VENV_ARGS} "${SNAPCRAFT_PART_INSTALL}"

For example:
echo $SNAPCRAFT_PYTHON_INTERPRETER $SNAPCRAFT_PYTHON_VENV_ARGS $SNAPCRAFT_PART_INSTALL
python3 /root/parts/cluster/install

python3 -m venv /root/parts/cluster/install

2) ensurepip used by venv with -I option passed.

https://github.com/python/cpython/blob/58ec58a42bece5b2804b178c7a6a7e67328465db/Lib/venv/__init__.py#L291-L298
        cmd = [context.env_exe, '-Im', 'ensurepip', '--upgrade',
                                                    '--default-pip']
        subprocess.check_output(cmd, stderr=subprocess.STDOUT)

https://docs.python.org/3/using/cmdline.html#id2
"-I
... All PYTHON* environment variables are ignored, too. Further restrictions may be imposed to prevent the user from injecting malicious code."

3) ensurepip's filtration of all envars prefixed with "PIP_".

https://github.com/python/cpython/blob/0f5a28f834bdac2da8a04597dc0fc5b71e50da9d/Lib/ensurepip/__init__.py#L47-L56
    keys_to_remove = [k for k in os.environ if k.startswith("PIP_")]
    for k in keys_to_remove:
        del os.environ[k]
https://github.com/python/cpython/blob/0f5a28f834bdac2da8a04597dc0fc5b71e50da9d/Lib/ensurepip/__init__.py#L88-L119

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.