Shotgun crashes when tries to collect a broken link

Bug #1541390 reported by Maksim Malchuk
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Won't Fix
Critical
Alexander Kislitsky
8.0.x
Fix Released
Critical
Alexander Kislitsky

Bug Description

https://product-ci.infra.mirantis.net/job/8.0.ubuntu.bvt_2/465/
failed with an error during puppet deploy on nodes

1) diagnostic snapshot doesn't contain at least:
 - nailgun.test.domain.local/var/log/astute
 - nailgun.test.domain.local/var/log/docker-logs

2) log files differs from the original files from the slave nodes:
 - node-6/var/log/puppet.log (truncated at the middle of the file for example)

snapshot attached

Revision history for this message
Maksim Malchuk (mmalchuk) wrote :
Dmitry Pyzhov (dpyzhov)
tags: added: area-ib
tags: added: area-library
removed: area-ib
Changed in fuel:
assignee: nobody → Sergey Novikov (snovikov)
Revision history for this message
Sergey Novikov (snovikov) wrote :

I tried to generate diagnostic snapshot via simple REST API call. And I got snapshot with lost log files. Although task "dump" was marked as "ready".

Changed in fuel:
status: New → Confirmed
assignee: Sergey Novikov (snovikov) → Fuel Python Team (fuel-python)
Dmitry Pyzhov (dpyzhov)
tags: added: area-python
removed: area-library
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Sorry, wrong link. Last BVT where we can find docker-logs:

- https://product-ci.infra.mirantis.net/job/8.0.ubuntu.bvt_2/464

Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

Something was broken in the nailgun I think, because:

 - https://github.com/openstack/fuel-web/blob/master/nailgun/nailgun/settings.yaml#L837-L841

this describes that all the files in the /var/log should be included except /var/log/atop, but last exists in the broken snapshots.

We have several changes during the last two weeks into nailgun which can affect:

 - https://github.com/openstack/fuel-web/commit/32eb8a8b0aa86a47ccf9a598c4230bfb06b3284e

/etc/fuel_build_id and /etc/fuel_build_number not exists in the snapshot

 - https://github.com/openstack/fuel-web/commit/a8f8f880fb03a48912260705d2c8241c27ee1fed

this commit changes the place where is the archive created - my be problem in it

Also we have the changes to logging:

 - https://github.com/openstack/fuel-web/commit/31b3dbb7ae4598436f49db38f7aa58d7dfb5e049
 - https://github.com/openstack/fuel-web/commit/45950e02988bd3092168d1a0bdf2acc198bb6eff

may be logs didn't flushed while was written to the disk ?

Revision history for this message
Fedor Zhadaev (fzhadaev) wrote :

Looks like this bug appeared after https://github.com/openstack/fuel-web/commit/a8f8f880fb03a48912260705d2c8241c27ee1fed . I've tried to manually undo this change on my env and got 'normal' snapshot.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/275960

Changed in fuel:
assignee: Fedor Zhadaev (fzhadaev) → Alexey Shtokolov (ashtokolov)
status: Confirmed → In Progress
Revision history for this message
Alexey Shtokolov (ashtokolov) wrote : Re: bvt's diagnostic snapshot doesn't contain logs

Looks like permissions issue for shotgun due to the typo in https://review.openstack.org/#/c/275159/
    /var/dump/ -> /var/log/dump

    Custom iso with https://review.openstack.org/275960 :
http://jenkins-product.srt.mirantis.net:8080/job/custom_8.0_iso/

Revision history for this message
Alexey Shtokolov (ashtokolov) wrote :

Custom ISO Build #1786 ^^

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/275969

Changed in fuel:
assignee: Alexey Shtokolov (ashtokolov) → Maksim Malchuk (mmalchuk)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/275969
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=dfad93c21744f49e26d902240246de5c622a35af
Submitter: Jenkins
Branch: master

commit dfad93c21744f49e26d902240246de5c622a35af
Author: Maksim Malchuk <email address hidden>
Date: Thu Feb 4 01:59:57 2016 +0300

    Fix path to diagnostic snapshot

    Fix typo in https://review.openstack.org/#/c/273094/
    Closes-bug: 1541390

    Change-Id: I1961bd18a36b20015de184f8df3a66714b85afdd

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/8.0)

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/276030

Revision history for this message
Fedor Zhadaev (fzhadaev) wrote : Re: bvt's diagnostic snapshot doesn't contain logs

Status returned to 'In progress' because patch for fuel-web is still on review.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (master)

Reviewed: https://review.openstack.org/275960
Committed: https://git.openstack.org/cgit/openstack/fuel-web/commit/?id=e6236895707bc21779ac14f1ffc6a10b47c42bad
Submitter: Jenkins
Branch: master

commit e6236895707bc21779ac14f1ffc6a10b47c42bad
Author: Alexey Shtokolov <email address hidden>
Date: Thu Feb 4 01:38:18 2016 +0300

    Fix path to diagnostic snapshot

    Fix typo in https://review.openstack.org/#/c/275159/
    Closes-bug: 1541390

    Change-Id: Ib97492595770cdca7d71bfcec757458299b18714

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (stable/8.0)

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/276119

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/8.0)

Reviewed: https://review.openstack.org/276030
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=93edeba714699a19da4be21cc02757d92c80e3c1
Submitter: Jenkins
Branch: stable/8.0

commit 93edeba714699a19da4be21cc02757d92c80e3c1
Author: Maksim Malchuk <email address hidden>
Date: Thu Feb 4 01:59:57 2016 +0300

    Fix path to diagnostic snapshot

    Fix typo in https://review.openstack.org/#/c/273094/
    Closes-bug: 1541390

    Change-Id: I1961bd18a36b20015de184f8df3a66714b85afdd
    (cherry picked from commit dfad93c21744f49e26d902240246de5c622a35af)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-web (stable/8.0)

Change abandoned by Maksim Malchuk (<email address hidden>) on branch: stable/8.0
Review: https://review.openstack.org/276119
Reason: due https://review.openstack.org/#/c/276156/

Revision history for this message
Ihor Kalnytskyi (ikalnytskyi) wrote : Re: bvt's diagnostic snapshot doesn't contain logs

Well, the root cause is that shotgun trying to resolve link (thanks to Fabric library, which doesn't allow us to configure that) and when it finds a broken one - it simply fails.

Sometimes we have broken "atop_current" symlink in "/var/log/atop/" folder which points to non-existing file. We fails here, and don't even try to download other sub-folders of "/var/log", including "docker-logs".

If we restart atop service on master node, the symlink will be fixed.

Since we can't fix Fabric internals, we must do something to prevent broken links.

Revision history for this message
Maksim Malchuk (mmalchuk) wrote :
Revision history for this message
Maksim Malchuk (mmalchuk) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-library (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/276218

Revision history for this message
Michael Polenchuk (mpolenchuk) wrote : Re: bvt's diagnostic snapshot doesn't contain logs

The mentioned commit above is not a root cause. Symlink is created as before. Question is why atop ain't create actual file?
And also not sure that #/c/276218 will resolve the issue.
I'd rather go through file and get broken links to be sure.

Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

This is problem with reverts in the CI, so this problem affects only our systems and should be fixed in the another place.

Revision history for this message
Aleksandr Didenko (adidenko) wrote :

The following cronjob creates symlink:
/etc/cron.daily/atop_retention

It may create broken link if another cronjob was skipped:

[root@nailgun ~]# cat /etc/cron.d/atop
# start atop daily at midnight
0 0 * * * root /bin/systemctl try-restart atop.service > /dev/null 2>&1 || :

Frankly speaking - it's a mess. But we need to fix this asap without huge refactoring due to HCF.

Revision history for this message
Aleksandr Didenko (adidenko) wrote :

https://review.openstack.org/276258 - with this patch we at least won't try to create broken link so it should fix the problem

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-library (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/276258

Revision history for this message
Andrey Maximov (maximov) wrote : Re: bvt's diagnostic snapshot doesn't contain logs

Folks, shouldn't we downgrade this bug to High ?
It doesn't affect deployment, and this is not security vulnerability. We know workaround, and in real deployments our users often have to download diag snapshots manually.
So my proposal: downgrade it to High and fix it in MU1.

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

I've checked changes with 'Fix path to diagnostic snapshot' and we have a problem. Archive is generated in container, we add link to it in master node in /var/www/nailgun/dump/last

[root@nailgun dump]# cat last
/var/dump/fuel-snapshot-2016-02-04_14-39-12.tar.xz

But this file present only in mcollective container and not it host system.

Problem iso: http://jenkins-product.srt.mirantis.net:8080/job/custom_8.0_iso/1791/
BVT without logs with this iso: http://jenkins-product.srt.mirantis.net:8080/job/8.0.custom.ubuntu.bvt_2/625/

Revision history for this message
Andrey Maximov (maximov) wrote :

downgraded to High on triage meeting with QA.

Revision history for this message
Andrey Maximov (maximov) wrote :

rationale: this issues doesn't affect deployment, only debugging convenience .

Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

Vladimir, patches 'Fix path to diagnostic snapshot' were reverted. Please have a look again.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (master)

Change abandoned by Michael Polenchuk (<email address hidden>) on branch: master
Review: https://review.openstack.org/276218
Reason: won't resolve

Revision history for this message
Nastya Urlapova (aurlapova) wrote : Re: bvt's diagnostic snapshot doesn't contain logs

@Andrey, priority for this issue is Critical w/o any doubts, we are not able debug any env + master is blocker- how it can me a High?

Revision history for this message
Andrey Maximov (maximov) wrote :

@Nastya,

https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Confirm_and_triage_bugs
Critical = can't deploy anything and there's no trivial workaround; data loss; or security vulnerability

We can deploy, there is workaround, there is no data loss and this isnot a security vulnerability, so that's why we decided to downgrade it to High.

tags: added: swarm-blocker
summary: - bvt's diagnostic snapshot doesn't contain logs
+ BVT's diagnostic snapshot doesn't contain logs
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

@Andrey, you can easily revert the patch, which brought a regression.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/276258
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=18ec584ab23aaa54cc7b82b80b3701593c63a89c
Submitter: Jenkins
Branch: master

commit 18ec584ab23aaa54cc7b82b80b3701593c63a89c
Author: Aleksandr Didenko <email address hidden>
Date: Thu Feb 4 15:14:43 2016 +0100

    Ensure atop_current is not a broken symlink

    We should create symlink only when destination file actually exists
    and remove it if it is broken.

    Change-Id: Iaec6cf2b1b32578e017641c7971bc24433b992d8
    Related-bug: #1541390

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-library (stable/8.0)

Related fix proposed to branch: stable/8.0
Review: https://review.openstack.org/276609

Revision history for this message
Vladimir Sharshov (vsharshov) wrote : Re: BVT's diagnostic snapshot doesn't contain logs

I can confirm that now logs are dumping. So as i can understand, we have 2 bugs in one bug description.
First, critical - no 'docker-logs' in dump logs or no dump logs as all. It it successfully fixed yesterday at evening (all changes were merged and works)

Second - atop_current, which is happened sometimes only in CI jobs. Second bug looks like High and should be separated to new bug to protect from misunderstandings.

Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

No Vladimir, this is one bug, broken atop_current link causes shotgun do not include many files to the diagnostic snapshot.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-library (stable/8.0)

Reviewed: https://review.openstack.org/276609
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=40195cd45798612aedeff5ba0299c756842d16cd
Submitter: Jenkins
Branch: stable/8.0

commit 40195cd45798612aedeff5ba0299c756842d16cd
Author: Aleksandr Didenko <email address hidden>
Date: Thu Feb 4 15:14:43 2016 +0100

    Ensure atop_current is not a broken symlink

    We should create symlink only when destination file actually exists
    and remove it if it is broken.

    Change-Id: Iaec6cf2b1b32578e017641c7971bc24433b992d8
    Related-bug: #1541390
    (cherry picked from commit 18ec584ab23aaa54cc7b82b80b3701593c63a89c)

Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote : Re: BVT's diagnostic snapshot doesn't contain logs

I'm reopening the bug for 9.0 and changing priority from Critical to High. We still need the case when shotgun fails because of broken links.

summary: - BVT's diagnostic snapshot doesn't contain logs
+ Shotgun crashes when tries to collect a broken link
no longer affects: fuel/mitaka
Revision history for this message
Vladimir Khlyunev (vkhlyunev) wrote :
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
importance: High → Critical
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Alexander Kislitsky (akislitsky)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/277376

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Alexander Kislitsky (akislitsky) wrote :

Semms that Shotgun handles exclude and dir combination in the wrong way.
I've tested issue on the live env in the following way:

- dump snapshot
- check snapshot doesn't contain docker-logs
- add settings for docker-logs to the Nailgun settings.yaml
- restart nailgun
- dump snapshot
- check snapshot contains docker-logs

I guess it's too risky to introduce changes into Shotgun now. For this moment we have hotfix, but we need to fix it properly in the Shotgun for the 9.0.

Revision history for this message
Ihor Kalnytskyi (ikalnytskyi) wrote :

Alexander K.,

It has nothing to do with Shotgun handles. We ask Fabric to get /var/log, and it doesn't support any *excludes*. So it fails to download 'atop` files, even if we ain't interested in them.

BTW, I thought we fixed that broken atop link on library side. Does the iso contains that patch?

Revision history for this message
Ihor Kalnytskyi (ikalnytskyi) wrote :

Well, I was wrong. It's indeed about Fabric. My investigations reveal the following things:

- We store snapshots in `/var/log/dump`
- Shotgun tries to download the whole `/var/log`
- Despite the fact Shotgun has `dump/` in exclusion list for `/var/log`, the real exclusion is taking place post-factum (when it's completely downloaded)
- If there're files in `/var/log/dump/`, there's a chance that path to them is too long, and that Fabric fails to retrieve them.
- As we know from my previous explanation above, Fabric tries to download `/var/log` file-by-file recursively, but if it fails to download at least one file - the whole downloading will be aborted.
- So it fails to download file due to long path (File name too long) and we miss docker-logs in the snapshot.

So basically, I see two workarounds:

* Do not use the whole `/var/log` and point to `/var/log/docker-logs` + some other helpful stuff.
* Move `/var/log` to the beginning of objects in shotgun config, so dump/ will be probably empty and we don't catch that error.

Meantime, we MUST to consider re-implementation of shotgun as a part of bug 1543119.

Revision history for this message
Alexander Kislitsky (akislitsky) wrote :

This bug doesn't affect 9.0 due to removing of docker containers: https://review.openstack.org/#/q/topic:bp/get-rid-docker-containers.

We have no /var/logs/docker-logs on the master any more. Checked on 199 ISO: https://product-ci.infra.mirantis.net/view/9.0-liberty/job/9.0-liberty.all/199/

Changed in fuel:
status: In Progress → Won't Fix
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (stable/8.0)

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/277806

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-web (master)

Change abandoned by Alexander Kislitsky (<email address hidden>) on branch: master
Review: https://review.openstack.org/277376
Reason: Doesn't affect 9.0 after docker containers removing.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (stable/8.0)

Reviewed: https://review.openstack.org/277806
Committed: https://git.openstack.org/cgit/openstack/fuel-web/commit/?id=ed2e0cde96ae7bc064e689f7409470e69c57772e
Submitter: Jenkins
Branch: stable/8.0

commit ed2e0cde96ae7bc064e689f7409470e69c57772e
Author: Alexander Kislitsky <email address hidden>
Date: Tue Feb 9 15:02:23 2016 +0300

    docker-logs added to snapshot

    Fabric failes on long file names due to copy docker-logs
    subdirectories into snapshot.
    As hotfix we implicitly set configuration for docker-logs in the Nailgun
    settings.yaml.

    Change-Id: I016ae5182b87b93c0ed474608feab566c84d8d2d
    Closes-Bug: #1541390

tags: added: on-verification
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

verified on 8.0-552 that docker-logs are present
created another bug with generating snapshots https://bugs.launchpad.net/fuel/+bug/1544966

tags: removed: on-verification
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.