Ostf tests start to fail with 500 error in tests fill root filesystem

Bug #1525214 reported by Tatyanka
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Roman Podoliaka
8.0.x
Fix Released
High
Roman Podoliaka
Mitaka
Fix Released
High
Roman Podoliaka

Bug Description

fill root filesystem on primary controller and check pacemaker on 3 Controller, 3 Compute & Ceph on Neutron/VLAN
Scenario:
0. Setup master node
1. Configuration releases
2. Bootstrap slave and make snapshot
3. Revert bootstrapped nodes
4. Create Fuel Environment
5. Add nodes to environment
6. Run network checker
7. Deploy environment
8. Run network checker
9. Run health checker
10. Load existent environment from snapshot or save it
11. Get controllers initial status in pacemaker
12. Filling root filesystem on primary controller
13. Check stopping pacemaker resources
14. Run health checker
15. Fill root more to below rabbit disk free limit
16. Check stopping pacemaker resources
17. Run health checker
18. Clean up space on root filesystem on primary controller
19. Check starting pacemaker resources
20. Run health checker Sanity, Smoke and HA

Actual:
Ostf failed on step 15
Failed 8 OSTF tests; should fail 0 tests. Names of failed tests:
- Request snapshot list (failure) Snapshots list is unavailable. Please refer to OpenStack logs for more details.
- Check create, update and delete image actions using Glance v1 (failure) Error creating image. Please refer to Openstack logs for more information. Please refer to OpenStack logs for more details.
- Check create, update and delete image actions using Glance v2 (failure) Error creating image. Please refer to Openstack logs for more information. Please refer to OpenStack logs for more details.
- Create volume and boot instance from it (failure) Image can not be retrieved. Please refer to OpenStack logs for more details
- Create volume and attach it to instance (failure) Failed to get to expected status. In error state. Please refer to OpenStack logs for more details.
- Check network connectivity from instance via floating IP (failure) Image can not be retrieved. Please refer to OpenStack logs for more details
- Launch instance (failure) Image can not be retrieved. Please refer to OpenStack logs for more details
- Launch instance, create snapshot, launch instance from snapshot (failure) Image can not be retrieved. Please refer to OpenStack logs for more details

Os services returns 500

lientException: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'glanceclient.exc.HTTPUnauthorized'> (HTTP 500) (Request-ID: req-ca169ecd-41aa-4ca1-9daa-3db32410d538)

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "283"
  build_id: "283"
  fuel-nailgun_sha: "073beb9f02e103e8f78abe9425d4c9feb5fab6e4"
  python-fuelclient_sha: "f96659066e522e28b389de3cc685f6f2aacca3da"
  fuel-agent_sha: "2f18b7596bc7da79d2f28c34f42620b2090d8a35"
  fuel-nailgun-agent_sha: "a33a58d378c117c0f509b0e7badc6f0910364154"
  astute_sha: "f7df18ab6b2089649abcfa8cf667ffd6b332be1a"
  fuel-library_sha: "66fc43373039a23fb63c26e7919a60b3560e0db1"
  fuel-ostf_sha: "632730169e8c01afe7fd5d78a898f00d4646358b"
  fuel-mirror_sha: "31b9df814960ec69b644ca9b689dacec0c7e10a1"
  fuelmenu_sha: "680b720291ff577f4c058cee25f85e563c96312e"
  shotgun_sha: "a0bd06508067935f2ae9be2523ed0d1717b995ce"
  network-checker_sha: "a3534f8885246afb15609c54f91d3b23d599a5b1"
  fuel-upgrade_sha: "1e894e26d4e1423a9b0d66abd6a79505f4175ff6"
  fuelmain_sha: "1577a306c2c9e7bd12f28c0e16cf3652997da004"

Tags: area-qa
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Changed in fuel:
importance: Undecided → High
importance: High → Medium
assignee: nobody → Fuel Enhancements (fuel-enhancements-team)
milestone: none → 8.0
summary: - Ostf tests start to fail with 550 error in tests fill root filesystem
+ Ostf tests start to fail with 500 error in tests fill root filesystem
Changed in fuel:
status: New → Confirmed
tags: added: area-python ostf
tags: added: swarm-fail-driver
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Moved to High, because we have to fix it in 8.0 milestone.

Changed in fuel:
importance: Medium → High
Changed in fuel:
assignee: Fuel Enhancements (fuel-enhancements-team) → Alex Schultz (alex-schultz)
Dmitry Pyzhov (dpyzhov)
tags: added: team-enhancements
Revision history for this message
Alex Schultz (alex-schultz) wrote :

In looking at the test and the history, it seems that this test passes sometimes and sometimes it doesn't. I think we need to add an additional wait after all services have been stopped prior to starting the health checks.

https://github.com/openstack/fuel-qa/blob/master/system_test/tests/strength/filling_root.py#L46-L47

I believe we are hitting the resources before the pools have completely cleared the controller that is no longer running.

Changed in fuel:
status: Confirmed → In Progress
Dmitry Pyzhov (dpyzhov)
tags: removed: swarm-fail-driver
Revision history for this message
Alex Schultz (alex-schultz) wrote :

I've reproduced this and it appears to be an issue with the token used with testing sometimes. The errors for the failed services seems to point to an invalid token being used.

2015-12-24 17:49:17.065 27310 DEBUG keystoneclient.session [-] REQ: curl -g -i -X GET http://10.109.11.3:5000/v3/auth/tokens -H "X-Subject-Token: {SHA1}89dcee625240d1b957500f
2db6828f9a7e9d181c" -H "User-Agent: python-keystoneclient" -H "Accept: application/json" -H "X-Auth-Token: {SHA1}488a4adea195462d09c734ec65265eb9de57da33" _http_log_request /
usr/lib/python2.7/dist-packages/keystoneclient/session.py:198
2015-12-24 17:49:17.213 27310 DEBUG keystoneclient.session [-] RESP: [404] content-length: 98 vary: X-Auth-Token server: Apache connection: close date: Thu, 24 Dec 2015 17:49
:17 GMT content-type: application/json x-openstack-request-id: req-e402c5ff-0051-4e68-a5ba-840e9a2fb397
RESP BODY: {"error": {"message": "This is not a recognized Fernet token", "code": 404, "title": "Not Found"}}

Revision history for this message
Alex Schultz (alex-schultz) wrote :

So i've spent a bunch of time trying to track down this issue and it appears to be somewhere in the nova/glance/keystone world where the token being used by nova to talk to glance is being reported as not a valid token. I can't figure out how to debug this further as it seems to be something within the openstack clients or services and not specific to fuel library. I'm wondering if it might be because of some sort of time skew with the tokens as the time on the keystone server is behind the the glance/nova servers. If a token is generated on a controller that is ahead from the controller that validates it, will the controller that is behind report the token as invalid because it's in the future?

For example here are the logs for one of the ostf glance failures.
Nova:
http://paste.openstack.org/show/482896/

Glance:
http://paste.openstack.org/show/482897/

keystone:
http://paste.openstack.org/show/482898/

The token issue seems to be pretty consistent in the failure related to the filling root tests.

Revision history for this message
Alex Schultz (alex-schultz) wrote :

assigning to mos-nova to take a look as the nova api is what is throwing the error and I've verified that we are passing what appears to be a valid token

Changed in fuel:
status: In Progress → Confirmed
assignee: Alex Schultz (alex-schultz) → MOS Nova (mos-nova)
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Alex, I believe you are right and the problem is in the clock skew (based on logs you provided the host with keystone-api is more than 1 minute behind) and keystone does the following checks of tokens in cryptography:

https://github.com/pyca/cryptography/blob/master/src/cryptography/fernet.py#L94-L95

Max clock skew allowed is 60 seconds:

https://github.com/pyca/cryptography/blob/master/src/cryptography/fernet.py#L26

Thus, tokens from the future are considered to be invalid.

QA team, could you please make sure we sync time in the test?

Changed in fuel:
assignee: MOS Nova (mos-nova) → Fuel QA Team (fuel-qa)
tags: added: area-qa
removed: area-python team-enhancements
tags: removed: ostf
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

@Roma, yep I can confirm that we sync time in the test.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

@Nastya, I know that

But the fact is the time is not synced, as we see in logs provided by Alex:

Nova:
http://paste.openstack.org/show/482896/

Glance:
http://paste.openstack.org/show/482897/

keystone:
http://paste.openstack.org/show/482898/

Keystone node is more than 1 minute behind.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

All services work properly: Keystone refuses to authenticate a token, as it's from the "future" as far as Keystone is concerned, and both Glance and Nova simply propagate the error to the user.

If there is no clock skew between nodes - everything works fine.

Revision history for this message
Sergey Shevorakov (sshevorakov) wrote :

Moving the bug to Incomplete state after discussion with NastyaU.
Also area-qa tag has been removed.

tags: removed: area-qa
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Nastya

Could someone from your team investigate why there is clock skew between nodes after reverting a snapshot?

Sergey

I'm not sure why you've removed area-qa tag. As I stated in my comment: all services work as expected - we can't guarantee you won't see errors, if there is clock skew between nodes.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/268242

Changed in fuel:
assignee: MOS Nova (mos-nova) → Roman Podoliaka (rpodolyaka)
status: Invalid → In Progress
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Ok, I took another look at one of the recent failures ( https://product-ci.infra.mirantis.net/view/8.0_swarm/job/8.0.system_test.ubuntu.filling_root/45/consoleText )

What I see is exactly the same behaviour Alex and I described before:

after making a snapshot in the middle of the case one of the controller nodes is 1-2 minutes behind others, which causes Keystone Ferten tokens to be treated as invalid.

According to the job console log time sync is done once for boostrap nodes, but not after a snapshot of an environment is taken later.

fuel-devops can't do a snapshot atomically, so it suspends nodes one by one first, which must cause this slight clock skew. Unlike cases when we revert a snapshot later, which enforces a time sync, save_load_environment action only makes a snapshot and resume the environment, time sync is skipped.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

 https://review.openstack.org/268242 must help here, but I want to give it a try first.

tags: added: area-qa
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-qa (master)

Reviewed: https://review.openstack.org/268242
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=7fac46b45ecd7271520242df103d1de2f276b19b
Submitter: Jenkins
Branch: master

commit 7fac46b45ecd7271520242df103d1de2f276b19b
Author: Roman Podoliaka <email address hidden>
Date: Fri Jan 15 19:22:02 2016 +0200

    Force time sync after making a snapshot

    fuel-devops suspends nodes one by one before taking a snapshot of
    each node. This can lead to slight clock skew after resuming of the
    environment later (what we see in #1525214 , when one of controller
    nodes is 1-2 minutes behind, which causes Keystone to treat issued
    Fernet tokens as invalid).

    Force time sync after making a snapshot and resuming an environment
    in save_load_environment action.

    Change-Id: I93ea8bbc297b1b41b6badef5bf5194d8f75c68b8
    Closes-Bug: #1525214

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-qa (stable/8.0)

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/268570

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-qa (stable/8.0)

Reviewed: https://review.openstack.org/268570
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=5716cb11842206458a16d2a2ad07184a32daa697
Submitter: Jenkins
Branch: stable/8.0

commit 5716cb11842206458a16d2a2ad07184a32daa697
Author: Roman Podoliaka <email address hidden>
Date: Fri Jan 15 19:22:02 2016 +0200

    Force time sync after making a snapshot

    fuel-devops suspends nodes one by one before taking a snapshot of
    each node. This can lead to slight clock skew after resuming of the
    environment later (what we see in #1525214 , when one of controller
    nodes is 1-2 minutes behind, which causes Keystone to treat issued
    Fernet tokens as invalid).

    Force time sync after making a snapshot and resuming an environment
    in save_load_environment action.

    Change-Id: I93ea8bbc297b1b41b6badef5bf5194d8f75c68b8
    Closes-Bug: #1525214
    (cherry picked from commit 7fac46b45ecd7271520242df103d1de2f276b19b)

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

Still reproduces:
filling_root on 8.0-466
https://product-ci.infra.mirantis.net/job/8.0.system_test.ubuntu.filling_root/56/testReport/(root)/Case_FillRootPrimaryController__Config_ceph_all_on_neutron_vlan/
Failed 9 OSTF tests; should fail 0 tests. Names of failed tests:
- Request snapshot list (failure) Snapshots list is unavailable. Please refer to OpenStack logs for more details.
- Request volume list (failure) Volume list is unavailable. Please refer to OpenStack logs for more details.
- Request user list (failure) User list is unavailable. Please refer to OpenStack logs for more details.
- Request list of networks (failure) Networks list is unavailable. Please refer to OpenStack logs for more details.
- Create volume and boot instance from it (failure) Image can not be retrieved. Please refer to OpenStack logs for more details
- Check network connectivity from instance via floating IP (failure) Image can not be retrieved. Please refer to OpenStack logs for more details
- Launch instance (failure) Image can not be retrieved. Please refer to OpenStack logs for more details
- Launch instance, create snapshot, launch instance from snapshot (failure) Image can not be retrieved. Please refer to OpenStack logs for more details
- Create user and authenticate with it. (failure) User role can not be created. Please refer to OpenStack logs for more details.

Revision history for this message
Alex Schultz (alex-schultz) wrote :

I took at look at the latest failure (https://product-ci.infra.mirantis.net/job/8.0.system_test.ubuntu.filling_root/59) and it still appears that it might be time related as the exception error was at 2016-01-25 00:03:59.452 but the log about the 500 being returned was at 2016-01-25T00:05:02.005193+00:00. Are we sure the previous fix for time syncing is in place?

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Alex, good catch! This works on master, but fails on 8.0, because the following commit is missing:

http://paste.openstack.org/show/484979/

In 8.0 sync_time() looks like:

https://github.com/openstack/fuel-qa/blob/stable/8.0/fuelweb_test/models/environment.py#L547-L573

and if nailgun_nodes parameter is not passed explicitly, it will only sync time on the master node, what we see in the job logs:

http://paste.openstack.org/show/484981/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-qa (stable/8.0)

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/272593

Revision history for this message
Egor Kotko (ykotko) wrote :
Download full text (4.0 KiB)

Reproduced on BAREMETAL - in my case the haproxy was dead:

[root@fuel-lab-cz5555 ~]# ssh node-14 "ps axuf | grep [h]aproxy"
Warning: Permanently added 'node-14' (ECDSA) to the list of known hosts.
[root@fuel-lab-cz5555 ~]# ssh node-15 "ps axuf | grep [h]aproxy"
Warning: Permanently added 'node-15' (ECDSA) to the list of known hosts.
haproxy 7058 0.2 0.0 45420 15364 ? Ss 11:35 0:09 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -sf 2260
[root@fuel-lab-cz5555 ~]# ssh node-16 "ps axuf | grep [h]aproxy"
Warning: Permanently added 'node-16' (ECDSA) to the list of known hosts.
haproxy 6374 0.2 0.0 33580 3456 ? Ss 11:34 0:07 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -sf 1735

[root@fuel-lab-cz5555 ~]# ssh node-14 "df -h"
Warning: Permanently added 'node-14' (ECDSA) to the list of known hosts.
Filesystem Size Used Avail Use% Mounted on
udev 7.9G 12K 7.9G 1% /dev
tmpfs 1.6G 32M 1.6G 2% /run
/dev/dm-4 44G 44G 0 100% /
none 4.0K 0 4.0K 0% /sys/fs/cgroup
none 5.0M 0 5.0M 0% /run/lock
none 7.9G 60M 7.8G 1% /run/shm
none 100M 0 100M 0% /run/user
/dev/sda3 196M 39M 148M 21% /boot
/dev/mapper/logs-log 9.8G 125M 9.1G 2% /var/log
/dev/mapper/image-glance 352G 46M 352G 1% /var/lib/glance
/dev/mapper/mysql-root 20G 1.6G 17G 9% /var/lib/mysql
/dev/mapper/horizon-horizontmp 11G 33M 11G 1% /var/lib/horizon

[root@fuel-lab-cz5555 ~]# fuel node
id | status | name | cluster | ip | mac | roles | pending_roles | online | group_id
---|----------|------------------|---------|---------------|-------------------|------------|---------------|--------|---------
15 | ready | Untitled (4c:54) | 3 | 172.16.39.139 | 9e:26:cb:48:b5:dc | controller | | True | 3
16 | ready | Untitled (4c:50) | 3 | 172.16.39.141 | 9e:18:a3:a7:3c:68 | controller | | True | 3
14 | ready | Untitled (43:68) | 3 | 172.16.39.140 | a0:2b:b8:1f:43:68 | controller | | True | 3
13 | discover | Untitled (57:e1) | None | 172.16.39.138 | 52:54:00:d4:57:e1 | | | False | None
17 | discover | Untitled (49:64) | None | 172.16.39.143 | a0:2b:b8:1f:49:64 | | | True | None
18 | discover | Untitled (4b:ec) | None | 172.16.39.142 | a0:2b:b8:1f:4b:ec | | | True | None
8 | discover | Untitled (e9:62) | None | 172.16.39.130 | 0c:c4:7a:6a:e9:62 | | | True | None
12 | discover | Untitled (0c:a3) | None | 172.16.39.137 | 52:54:00:c8:0c:a3 | | | False | None

Snapshot:
https://drive.google.com/file/d/0BzWDM1PONYEuUjhpQ0tnVEdYM3M/view?usp=sharing

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
 ...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-qa (stable/8.0)

Reviewed: https://review.openstack.org/272593
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=a317fe5453da56b54f35734ac27f51970a625a33
Submitter: Jenkins
Branch: stable/8.0

commit a317fe5453da56b54f35734ac27f51970a625a33
Author: Roman Podoliaka <email address hidden>
Date: Tue Jan 26 17:23:42 2016 +0200

    Fix time sync after taking a snapshot

    Currently time sync is only performed for the admin node, as an empty
    list of nodes (default argument value) is passed to sync_time()
    method.

    On master branch this works as expected, as sync_time() behaviour
    was changed in commit 03159d5cfb740ec1fb41420d3867a2477f31b2d1. The
    latter seems to be a rather big change to be backported to the stable
    branch, though. The alternative is to fix this in 8.0 only by passing
    the list of nodes to sync_time().

    No commit is done to master first as API of sync_time() changed
    completely since 8.0 and those two would be entirely different
    changes (and master defaults to all nodes of the cluster - so
    we don't need to pass a list of nodes explicitly at all).

    Closes-Bug: #1525214

    Change-Id: Ibf639a5605f0ce29c9c3d3105b0c9c5880c8a965

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-qa (master)

Change abandoned by Alex Schultz (<email address hidden>) on branch: master
Review: https://review.openstack.org/262813

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

verified on 8.0 iso #529

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "529"
  build_id: "529"
  fuel-nailgun_sha: "baec8643ca624e52b37873f2dbd511c135d236d9"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "658be72c4b42d3e1436b86ac4567ab914bfb451b"
  fuel-nailgun-agent_sha: "b2bb466fd5bd92da614cdbd819d6999c510ebfb1"
  astute_sha: "b81577a5b7857c4be8748492bae1dec2fa89b446"
  fuel-library_sha: "e2d79330d5d708796330fac67722c21f85569b87"
  fuel-ostf_sha: "3bc76a63a9e7d195ff34eadc29552f4235fa6c52"
  fuel-mirror_sha: "fb45b80d7bee5899d931f926e5c9512e2b442749"
  fuelmenu_sha: "e071216cb214e34b4d861478033425ee6a54a3be"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "a43cf96cd9532f10794dce736350bf5bed350e9d"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "a365f05b903368225da3fea9aa42afc1d50dc9b4"

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (master)

Change abandoned by Alex Schultz (<email address hidden>) on branch: master
Review: https://review.openstack.org/260751

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.