Creating new subordinate repositories fails

Bug #1853174 reported by Colin Watson
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
turnip
Triaged
Critical
Unassigned

Bug Description

Since the migration to the new deployment last night, creation of new subordinate repositories (i.e. ones where there's already a target-default repository for their target) is failing. The traceback from turnip-api looks like this:

Traceback (most recent call last):
  File "/srv/turnip/turnip-api/payloads/f486417f5fd1674241677fe2f1afb886b3aa3b24/env/local/lib/pyt
hon2.7/site-packages/gunicorn/workers/sync.py", line 130, in handle
    self.handle_request(listener, req, client, addr)
  File "/srv/turnip/turnip-api/payloads/f486417f5fd1674241677fe2f1afb886b3aa3b24/env/local/lib/pyt
hon2.7/site-packages/gunicorn/workers/sync.py", line 171, in handle_request
    respiter = self.wsgi(environ, resp.start_response)
  File "/srv/turnip/turnip-api/payloads/f486417f5fd1674241677fe2f1afb886b3aa3b24/env/local/lib/pyt
hon2.7/site-packages/pyramid/router.py", line 242, in __call__
    response = self.invoke_subrequest(request, use_tweens=True)
  File "/srv/turnip/turnip-api/payloads/f486417f5fd1674241677fe2f1afb886b3aa3b24/env/local/lib/pyt
hon2.7/site-packages/pyramid/router.py", line 217, in invoke_subrequest
    response = handle_request(request)
  File "/srv/turnip/turnip-api/payloads/f486417f5fd1674241677fe2f1afb886b3aa3b24/env/local/lib/pyt
hon2.7/site-packages/pyramid/tweens.py", line 46, in excview_tween
    response = view_callable(exc, request)
  File "/srv/turnip/turnip-api/payloads/f486417f5fd1674241677fe2f1afb886b3aa3b24/env/local/lib/pyt
hon2.7/site-packages/pyramid/config/views.py", line 385, in viewresult_to_response
    result = view(context, request)
  File "/srv/turnip/turnip-api/payloads/f486417f5fd1674241677fe2f1afb886b3aa3b24/env/local/lib/pyt
hon2.7/site-packages/pyramid/tweens.py", line 21, in excview_tween
    response = handler(request)
  File "/srv/turnip/turnip-api/payloads/f486417f5fd1674241677fe2f1afb886b3aa3b24/env/local/lib/pyt
hon2.7/site-packages/pyramid/router.py", line 163, in handle_request
    response = view_callable(context, request)
  File "/srv/turnip/turnip-api/payloads/f486417f5fd1674241677fe2f1afb886b3aa3b24/env/local/lib/pyt
hon2.7/site-packages/pyramid/config/views.py", line 596, in __call__
    return view(context, request)
  File "/srv/turnip/turnip-api/payloads/f486417f5fd1674241677fe2f1afb886b3aa3b24/env/local/lib/pyt
hon2.7/site-packages/pyramid/config/views.py", line 329, in attr_view
    return view(context, request)
  File "/srv/turnip/turnip-api/payloads/f486417f5fd1674241677fe2f1afb886b3aa3b24/env/local/lib/pyt
hon2.7/site-packages/pyramid/config/views.py", line 305, in predicate_wrapper
    return view(context, request)
  File "/srv/turnip/turnip-api/payloads/f486417f5fd1674241677fe2f1afb886b3aa3b24/env/local/lib/python2.7/site-packages/pyramid/config/views.py", line 355, in rendered_view
    result = view(context, request)
  File "/srv/turnip/turnip-api/payloads/f486417f5fd1674241677fe2f1afb886b3aa3b24/env/local/lib/python2.7/site-packages/pyramid/config/views.py", line 501, in _requestonly_view
    response = view(request)
  File "/srv/turnip/turnip-api/payloads/f486417f5fd1674241677fe2f1afb886b3aa3b24/env/local/lib/python2.7/site-packages/cornice/service.py", line 527, in wrapper
    response = view_()
  File "/srv/turnip/turnip-api/payloads/f486417f5fd1674241677fe2f1afb886b3aa3b24/turnip/api/views.py", line 77, in collection_post
    repo, clone_from=repo_clone, clone_refs=clone_refs)
  File "/srv/turnip/turnip-api/payloads/f486417f5fd1674241677fe2f1afb886b3aa3b24/turnip/api/store.py", line 167, in init_repo
    import_into_subordinate(sub_path, clone_from)
  File "/srv/turnip/turnip-api/payloads/f486417f5fd1674241677fe2f1afb886b3aa3b24/turnip/api/store.py", line 138, in import_into_subordinate
    os.link(from_path, sub_path)
OSError: [Errno 13] Permission denied

But it's not currently clear to me why this is getting EACCES; the permissions look fine on disk. https://pastebin.canonical.com/p/s4SYJR3rgw/ shows lp:ubuntu/+source/germinate (27656) and a failed attempt to create lp:~cjwatson/ubuntu/+source/germinate (43224). Based on the find output (assuming that find iterates over directories in the same order as os.listdir, which the output seems to suggest), the EACCES is probably on objects/pack/pack-ccdf56ab39b00807e62fc0e290def554e6935b05.pack, but there seems no clear reason why that should be the case.

I haven't been able to reproduce this on staging, and manually running ln on the probable file in question succeeded. Stracing all the gunicorn processes slowed things down to the point where repository creation attempts timed out (which is a separate bug). Maybe this is an NFS race involving new directories or something?

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :
Download full text (9.2 KiB)

######### On the NFS client side ###########

# link(): SVr4, 4.3BSD, POSIX.1-2001 (but see NOTES), POSIX.1-2008.

int link(const char *oldpath, const char *newpath);

ERRORS:

EACCES Write access to the directory containing newpath is denied, or search permission is denied for one of the directories in the path prefix of oldpath or newpath.

Notes:

- Hard links, as created by link(), cannot span filesystems. Use symlink(2) if this is required.

Bugs:

On NFS filesystems, the return code may be wrong in case the NFS server performs the link creation and dies before it can say so. Use stat(2) to find out if the link got created.

----

So, unless ganesha deviates from POSIX.1-2008, OR the ganesha nfsd RPC thread crashes, EACCESS should be a result of either NEWPATH having bad permissions for NFS user OR there is an issue with directory (+x) permissions in ONE directory of EITHER oldpath or newpath (/aaa/bbb/cccc <- the full path I mean).

----

######### On the NFS SERVER side ###########

Considering the NFS server is giving us EACCESS, then...

With https://github.com/nfs-ganesha/nfs-ganesha/wiki/NFS-Ganesha-Architecture:

the EACCESS comes from the def_handle_ops abstraction (FSAL/default_methods.c): fsal_test_access() or fsal_obj_ops->test_access(). We also have the MDCACHE in between final fsal_test_access() and the caller.

The stacktrace:

fsal_print_access_by_acl()
  fsacl_check_access_acl()
    fsal_test_access()
      mdcache_test_access() -> comes from mdcache(ops->test_access()) call
        mdcache_handle_ops_init() -> initializes fsal_obj_ops for:
          mdcache_fsal_init() ops->test_access = mdcache_test_access;
            start_fsals()

would tell us the exact error if:

 if (!isFullDebug(COMPONENT_NFS_V4_ACL))
  return;

debug was enabled.

--------

For the fsal_test_access():

/* test_access
 * common (default) access check method for fsal_obj_handle objects.
 * NOTE: A fsal can replace this method with their own custom access
 * checker. If so and they wish to have an option to switch
 * between their custom and this one, it their test_access
 * method's responsibility to do that test and select this one.
 */

looks like a filesystem abstraction can replace the check method. fsal_test_access() is the DEFAULT FSAL check access but not the only one.

NOTE: I'm not sure what is the underlying FSAL you're using (ceph ?)

--------

From the FSAL abstraction layer:

/**
 * @brief Check access for a given user against a given object
 *
 * This function checks whether a given user is allowed to perform the
 * specified operations against the supplied file. The goal is to
 * allow filesystem specific semantics to be applied to cached
 * metadata.
 *
 * This method must read attributes and/or get them from a cache.
 *
 * @param[in] obj_hdl Handle to check
 * @param[in] access_type Access requested
 * @param[out] allowed Returned access that could be granted
 * @param[out] denied Returned access that would be granted
 * @param[in] owner_skip Skip test if op_ctx->creds is owner
 *
 * @return FSAL status.
 */
  fsal_status_t (*test_access)(struct fsal_obj_handle *obj_hdl,
          f...

Read more...

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Sorry for being prolix in last comment, I was reading ganesha code, by my own personal interest, to check what could have happened and the TL;DR I have is this:

TODO: Enabling Debug will give you the exact acl/uid/gid for the broken access, together with the file/directory server was trying to access. From the acl you will get a the ACES which is the FSAL ACE structure giving you the: access type, erm, uid, gid, so you will have the exact EACCESS issue.

This will get you the EACCESS you're having in FSAL layer coming from the link() operation. Independently if you're getting it from NFS3 or NFS4 clients, LogFullDebug() or LogDebug() will get you the exact EACCESS issue you're getting (Also independently of the underlying filesystem you're using).

I hope this helps as I don't know if you can reproduce the issue outside production environment.

----

Exporting:

#export COMPONENT_DISPATCH=NIV_FULL_DEBUG
#export COMPONENT_DISPATCH=NIV_DEBUG

in the systemd .service unit might do the trick

----

#define LogDebug(component, format, args...) \
 do { \
  if (unlikely(component_log_level[component] \
      >= NIV_DEBUG)) \
   DisplayLogComponentLevel(component, __FILE__,\
       __LINE__, \
        __func__, \
       NIV_DEBUG, format, ## args); \
 } while (0)

----

#define LogFullDebug(component, format, args...) \
 do { \
  if (unlikely(component_log_level[component] \
      >= NIV_FULL_DEBUG)) \
   DisplayLogComponentLevel(component, __FILE__,\
       __LINE__, \
        __func__, \
       NIV_FULL_DEBUG, \
       format, ## args); \
 } while (0)

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Alright, just saw you - IRC - already found the EACCESS root cause... g'luck =)

Revision history for this message
William Grant (wgrant) wrote :

The EACCESS/EPERM was an apparmor deny on ganesha's linkat(), with olddirfd being an open_by_handle_at on a cached handle. It's quite possibly an apparmor bug, since with apparmor put into complain mode it works fine.

Everything should be working fine now, but we still need to work out why apparmor breaks things.

Revision history for this message
Seth Arnold (seth-arnold) wrote :

Are there any ALLOWED lines in logs after the complain mode change?

Thanks

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.