Collector heat can get stuck in request with no timeout

Bug #1927122 reported by Gabriel Hartmann
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
os-collect-config
Undecided
Unassigned

Bug Description

Hello,

under certain circumstances the collector "heat" will get stuck while collecting config.
This happens for example when the connection to the heat-api has already been established and the connection is unexpectedly interrupted.
The heatclient will then wait without any timeout for the request to complete.

I noticed that there would still be a socket open by os-collect-config.
By manually closing the socket I was able to get os-collect-config to continue:

[root@toolbox workdir]# ps aux|grep collect
root 2071 0.0 1.6 51048 33888 ? S Apr20 0:16 /usr/bin/python3 /usr/local/bin/os-collect-config --debug
root 1199103 0.0 0.1 10448 2308 pts/0 S+ 12:55 0:00 grep --color=auto collect
[root@toolbox workdir]# strace -p 2071
strace: Process 2071 attached
read(3, ^Cstrace: Process 2071 detached
 <detached ...>
[root@toolbox workdir]# netstat -np |grep 2071
tcp 0 0 10.XXX.XXX.XXX:44124 XXX.XXX.XXX.XXX:8004 ESTABLISHED 2071/python3
[root@toolbox workdir]# ss --kill state established src :44124
Netid Recv-Q Send-Q Local Address:Port Peer Address:Port Process
tcp 0 0 10.XXX.XXX.XXX:44124 XXX.XXX.XXX.XXX:8004
[root@toolbox workdir]# netstat -np |grep 2071
[root@toolbox workdir]# strace -p 2071
strace: Process 2071 attached
select(0, NULL, NULL, NULL, {tv_sec=20, tv_usec=598993}) = 0 (Timeout)
stat("/var/lib/os-collect-config/ec2.json", {st_mode=S_IFREG|0600, st_size=651, ...}) = 0
openat(AT_FDCWD, "/var/lib/os-collect-config/ec2.json", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0600, st_size=651, ...}) = 0

As a temporary fix I'm passing a hardcoded timeout (timeout=60) to the heatclient in the code of heat.py (https://github.com/openstack/os-collect-config/blob/224af052afd3ee59911dc8809ad6758983a95884/os_collect_config/heat.py#L92).

The changed line looks like this:
                '1', endpoint, token=ks.auth_token, timeout=60)

This prevents the collector to get stuck during such network interruptions.

It would be great to have a proper fix for this issue however.
My suggestion would be a (optional) config option for the timeout which can be set within the heat section of /etc/os-collect-config.conf and which would be passed to the heatclient.

Best regards
Gabriel

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers