Command exec on kolla_toolbox container fail sporadically

Bug #1763533 reported by Sameer Goel
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
kolla-ansible
Invalid
Undecided
Unassigned

Bug Description

Kolla - Dev tip 04/09 (debian, source, queens-20180412)
Kolla-Ansible - Dev tip 04/09

Debian Stretch + Kernel 4.14-arm64

Kolla Images - https://hub.docker.com/r/linaro tag: queens-20180412
Eg: https://hub.docker.com/r/linaro/debian-source-mariadb/tags/

While deploying using kolla-ansible I see an error with the following signature:
"msg": "Can not parse the inner module output: rpc error: code = 14 desc = grpc: the connection is unavailable\r\n"

The above seems to happen at different tasks. But when I execute this manually after a failure the command works fine.
If I put sleep in kolla_toolbox.p.

 I do not see this failure with a 1s timeout in kolla_toolbox.py for AIO deployment:

===================================================
for exp in [JSON_REG, NON_JSON_REG]:
        m = exp.match(output)
        if m:
            inner_output = m.groupdict().get('stdout')
            break
    else:
        module.fail_json(
            msg='Can not parse the inner module output: %s' % output)

time.sleep(1) <<<<<
==================================================

I checked and made sure that the kolla_toolbox container is running and I can start a shell in it after the above failure. As seen in the log other commands before this have run fine. Usually this failure is an issue with the docker container. So, I'm wondering why would kolla_toolbox return random failures.

Revision history for this message
Sameer Goel (sameergoel) wrote :
Revision history for this message
Marcin Juszkiewicz (hrw) wrote :

Linaro ERP 17.12 == Debian 'stretch' + 4.14 kernel

Revision history for this message
Jeffrey Zhang (jeffrey4l) wrote :

the grpc error should be raised by docker daemon.
This is wired, sound like docker daemon cannot handle the request. I haven't seen the similar issue.

1. could you check you docker engine logs?
2. are you sure the `time.sleep(1)` is added after `module.fail_json` line? it does not make sense. because it never runs.

Revision history for this message
Sameer Goel (sameergoel) wrote :

I do not see the error if I put in the 1s sleep. The sleep only makes sure that there is a 1s delay after every successful execution of a command in kolla_toolbox. As I said in the bug report this does seem to be an error with something in the docker framework not being ready.

I'll post the docker daemon log.

Revision history for this message
Sameer Goel (sameergoel) wrote :

Attached dockerd logs.

Revision history for this message
Sameer Goel (sameergoel) wrote :

root@linaro-erp-os-control:~# docker-containerd -v
containerd version 0.2.3 commit: 6e23458c129b551d5c9871e5174f6b1b7f6d1170

Sameer Goel (sameergoel)
description: updated
description: updated
Revision history for this message
Sameer Goel (sameergoel) wrote :

On further triage the issue seems to be related to the following line in the continerd log:
time="2018-04-18T11:27:35.765002258-06:00" level=fatal msg="containerd: epoll wait" error="Failed to wait epoll"

Monitor thread exits due to epoll_wait(fd, events, 128, -1) systemcall failure in epoll_arm64.go
The call is failing with an error code that is not EINTR.

Wondering if anyone else has seen the above issue.

Dockerd respawns containerd but the operation (exec) due to which the above error was seen is not retried.

Revision history for this message
Gema Gomez (gema) wrote :

What is the error code the call is failing with?

Revision history for this message
Harry Kominos (hkominos) wrote :

Some version of this is also affecting me on aarch64 on 4.14.0-49.10.1 Centos

During the apex installations at some random point the docker-containerd process dies with the same error level=fatal msg="containerd: epoll wait" error="Failed to wait epoll.

Restarting the process seems to bring back to life but in the meantime you lose control of the containers and the installation fails.

I should mention that this happens with all the docker versions that I have found in Centos and delorean repos. Even with the ones that I compile myself

Revision history for this message
Harry Kominos (hkominos) wrote :
Changed in kolla-ansible:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.