tracker bug: post_failures in upstream jobs. Storage issues atm.
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tripleo |
Fix Released
|
Critical
|
Unassigned |
Bug Description
jobs die here:
2021-02-10 10:49:20.634194 | TASK [upload-logs-swift : Upload logs to swift]
working w/ infra to resolve.
[07:55:31] <arxcruz|ruck> weshay|ruck: o/
[07:56:11] <weshay|ruck> zbr, fungi quick question... we have some jobs failing in the gate w/ post_failure.. the upload logs to swift is timing out. Is that something we can fix on our own by reducing the log size?
[07:56:24] <weshay|ruck> https:/
[07:56:27] <weshay|ruck> for example
[07:56:49] <weshay|ruck> or.. is there something else going on?
[07:57:07] <zbr> weshay|ruck: did the log size increase recently? last time when I seen this was caused by some switft issues, not log size.
[07:57:12] <fungi> weshay|ruck: probably, or by increasing the post-run timeout, but i'll see if the output gives me any other suspicions
[07:57:18] <weshay|ruck> zbr, checking
[07:58:21] <weshay|ruck> zbr I do track the log size.. but not a historical trend
[07:58:23] <weshay|ruck> but I should do that
[08:00:00] <weshay|ruck> ya.. noticing post_failures in a lot of non-tripleo jobs as well
[08:01:36] <zbr> weshay|ruck: last execution on collect logs looks like ~9.5min, which a lot but far below 30min limit i know for post.
[08:01:53] <fungi> there may be a problem with one of our swift donors, i'm trying to correlate
[08:05:22] <weshay|ruck> tox example https:/
[08:05:28] <weshay|ruck> fungi++
[08:05:29] <weshay|ruck> thanks
[08:06:57] <fungi> yeah, i'm digging in the executor debug log to see if there were any obvious errors emitted during that part of the log upload
[08:08:44] <fungi> that's actually where the console log normally ends because that's when the console log is uploaded
Thanks Infra!