Collector heat can get stuck in request with no timeout

Bug #1927122 reported by Gabriel Hartmann
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
os-collect-config
New
Undecided
Unassigned

Bug Description

Hello,

under certain circumstances the collector "heat" will get stuck while collecting config.
This happens for example when the connection to the heat-api has already been established and the connection is unexpectedly interrupted.
The heatclient will then wait without any timeout for the request to complete.

I noticed that there would still be a socket open by os-collect-config.
By manually closing the socket I was able to get os-collect-config to continue:

[root@toolbox workdir]# ps aux|grep collect
root 2071 0.0 1.6 51048 33888 ? S Apr20 0:16 /usr/bin/python3 /usr/local/bin/os-collect-config --debug
root 1199103 0.0 0.1 10448 2308 pts/0 S+ 12:55 0:00 grep --color=auto collect
[root@toolbox workdir]# strace -p 2071
strace: Process 2071 attached
read(3, ^Cstrace: Process 2071 detached
 <detached ...>
[root@toolbox workdir]# netstat -np |grep 2071
tcp 0 0 10.XXX.XXX.XXX:44124 XXX.XXX.XXX.XXX:8004 ESTABLISHED 2071/python3
[root@toolbox workdir]# ss --kill state established src :44124
Netid Recv-Q Send-Q Local Address:Port Peer Address:Port Process
tcp 0 0 10.XXX.XXX.XXX:44124 XXX.XXX.XXX.XXX:8004
[root@toolbox workdir]# netstat -np |grep 2071
[root@toolbox workdir]# strace -p 2071
strace: Process 2071 attached
select(0, NULL, NULL, NULL, {tv_sec=20, tv_usec=598993}) = 0 (Timeout)
stat("/var/lib/os-collect-config/ec2.json", {st_mode=S_IFREG|0600, st_size=651, ...}) = 0
openat(AT_FDCWD, "/var/lib/os-collect-config/ec2.json", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0600, st_size=651, ...}) = 0

As a temporary fix I'm passing a hardcoded timeout (timeout=60) to the heatclient in the code of heat.py (https://github.com/openstack/os-collect-config/blob/224af052afd3ee59911dc8809ad6758983a95884/os_collect_config/heat.py#L92).

The changed line looks like this:
                '1', endpoint, token=ks.auth_token, timeout=60)

This prevents the collector to get stuck during such network interruptions.

It would be great to have a proper fix for this issue however.
My suggestion would be a (optional) config option for the timeout which can be set within the heat section of /etc/os-collect-config.conf and which would be passed to the heatclient.

Best regards
Gabriel

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.