swift-proxy is showed as running even if it's not working

Bug #1637443 reported by Raoul Scarazzini
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Emilien Macchi

Bug Description

While testing a TripleO environment I stop the core resources (Galera, Redis and Rabbit), then I wait 20 minutes and if I don't get problems from the related services from a systemd perspective and from the cluster side (failed actions) I restart those core services. If everything goes fine, then test is passed.
If what I described happens with success I then try to do another test, which is basically to deploy an instance. And while trying to do this, loading an image inside glance, I get this error:

Error finding address for http://172.20.0.11:9292/v2/images/cf4bb285-d6ef-4737-88fe-fd7397c953cf/file: Unable to establish connection to http://172.20.0.11:9292/v2/images/cf4bb285-d6ef-4737-88fe-fd7397c953cf/file

Looking at the image status in glance I get this:

[stack@haa-01 ~]$ glance image-list
+--------------------------------------+--------+-------------+------------------+------+--------+
| ID | Name | Disk Format | Container Format | Size | Status |
+--------------------------------------+--------+-------------+------------------+------+--------+
| cf4bb285-d6ef-4737-88fe-fd7397c953cf | CirrOS | raw | bare | | queued |
+--------------------------------------+--------+-------------+------------------+------+--------+

So image is queued and not available. Basically this is related to swift, because after the last test swift is not responding correctly. Commands like **swift stat** hangs.
Basically the problem is around swift-proxy, the status of the service is this one:

● openstack-swift-proxy.service - OpenStack Object Storage (swift) - Proxy Server
   Loaded: loaded (/usr/lib/systemd/system/openstack-swift-proxy.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2016-10-28 03:16:54 UTC; 4h 33min ago
 Main PID: 127252 (swift-proxy-ser)
   CGroup: /system.slice/openstack-swift-proxy.service
           ├─127252 /usr/bin/python2 /usr/bin/swift-proxy-server /etc/swift/proxy-server.conf
           └─127476 /usr/bin/python2 /usr/bin/swift-proxy-server /etc/swift/proxy-server.conf

Oct 28 07:45:25 overcloud-controller-2 proxy-server[127476]: STDERR: ERROR:oslo.messaging._drivers.impl_rabbit:[2b9e269f-786a-4309-991a-9f21b1b5d10f] AMQP server 172.17.0.22:567...05812c7fa)
Oct 28 07:45:57 overcloud-controller-2 proxy-server[127476]: STDERR: ERROR:oslo.messaging._drivers.impl_rabbit:[8e649560-d97d-4a74-b7b5-0987f5d7b08a] AMQP server 172.17.0.22:567...05812c9ad)
Oct 28 07:46:11 overcloud-controller-2 proxy-server[127476]: STDERR: ERROR:oslo.messaging._drivers.impl_rabbit:[61ad7c0a-3215-40a2-a8f3-7cb0455b4b72] AMQP server 172.17.0.22:567...05812c69e)
Oct 28 07:46:43 overcloud-controller-2 proxy-server[127476]: STDERR: ERROR:oslo.messaging._drivers.impl_rabbit:[25a47c93-44df-4c16-ad44-042a405622d7] AMQP server 172.17.0.22:567...05812caed)
Oct 28 07:47:15 overcloud-controller-2 proxy-server[127476]: STDERR: ERROR:oslo.messaging._drivers.impl_rabbit:[001478b5-663d-4f62-96cd-9f58b69c32be] AMQP server 172.17.0.22:567...05812c69e)
Oct 28 07:47:47 overcloud-controller-2 proxy-server[127476]: STDERR: ERROR:oslo.messaging._drivers.impl_rabbit:[c39374ab-1abc-4fa0-9687-ab3963a6ad4a] AMQP server 172.17.0.22:567...05812c7fa)
Oct 28 07:48:13 overcloud-controller-2 proxy-server[127476]: STDERR: ERROR:oslo.messaging._drivers.impl_rabbit:[140d692b-fe68-4a0a-b803-b2ca45876732] AMQP server 172.17.0.22:567...ket closed
Oct 28 07:48:45 overcloud-controller-2 proxy-server[127476]: STDERR: ERROR:oslo.messaging._drivers.impl_rabbit:[7d0e62a6-5aae-42ea-b998-f9c3a64fa7cd] AMQP server 172.17.0.22:567...05812cf3d)
Oct 28 07:49:07 overcloud-controller-2 proxy-server[127476]: STDERR: ERROR:oslo.messaging._drivers.impl_rabbit:[78bed003-fe52-4e91-8d03-d0b7bfe0314c] AMQP server 172.17.0.22:567...05812cc2d)
Oct 28 07:49:39 overcloud-controller-2 proxy-server[127476]: STDERR: ERROR:oslo.messaging._drivers.impl_rabbit:[def8f475-d0b1-4e40-86d9-afc8e14076d1] AMQP server 172.17.0.22:567...ket closed
Hint: Some lines were ellipsized, use -l to show in full.

And this is really wrong, since even if the status of the service shows *running*, the service is not working.
The only way to fix things is to restart the service:

[root@overcloud-controller-2 ~]# systemctl restart openstack-swift-proxy.service

Then the status is alright:

[root@overcloud-controller-2 ~]# systemctl status openstack-swift-proxy.service
● openstack-swift-proxy.service - OpenStack Object Storage (swift) - Proxy Server
   Loaded: loaded (/usr/lib/systemd/system/openstack-swift-proxy.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2016-10-28 07:50:44 UTC; 8s ago
 Main PID: 6690 (swift-proxy-ser)
   CGroup: /system.slice/openstack-swift-proxy.service
           ├─6690 /usr/bin/python2 /usr/bin/swift-proxy-server /etc/swift/proxy-server.conf
           └─6707 /usr/bin/python2 /usr/bin/swift-proxy-server /etc/swift/proxy-server.conf

Oct 28 07:50:45 overcloud-controller-2 proxy-server[6690]: Pipeline was modified. New pipeline is "gatekeeper ceilometer catch_errors healthcheck proxy-logging cache ratelimit ...xy-server".
Oct 28 07:50:45 overcloud-controller-2 proxy-server[6690]: Starting Keystone auth_token middleware
Oct 28 07:50:45 overcloud-controller-2 proxy-server[6690]: Using /var/cache/swift as cache directory for signing certificate
Oct 28 07:50:46 overcloud-controller-2 proxy-server[6690]: Started child 6707
Oct 28 07:50:46 overcloud-controller-2 proxy-server[6707]: Adding required filter copy to pipeline at position 12
Oct 28 07:50:46 overcloud-controller-2 proxy-server[6707]: Adding required filter dlo to pipeline at position 13
Oct 28 07:50:46 overcloud-controller-2 proxy-server[6707]: Adding required filter gatekeeper to pipeline at position 0
Oct 28 07:50:46 overcloud-controller-2 proxy-server[6707]: Pipeline was modified. New pipeline is "gatekeeper ceilometer catch_errors healthcheck proxy-logging cache ratelimit ...xy-server".
Oct 28 07:50:46 overcloud-controller-2 proxy-server[6707]: Starting Keystone auth_token middleware
Oct 28 07:50:46 overcloud-controller-2 proxy-server[6707]: Using /var/cache/swift as cache directory for signing certificate
Hint: Some lines were ellipsized, use -l to show in full.

And you are able again to interact with swift:

[root@overcloud-controller-2 ~]# swift stat
Auth version 1.0 requires ST_AUTH, ST_USER, and ST_KEY environment variables
to be set or overridden with -A, -U, or -K.

Auth version 2.0 requires OS_AUTH_URL, OS_USERNAME, OS_PASSWORD, and
OS_TENANT_NAME OS_TENANT_ID to be set or overridden with --os-auth-url,
--os-username, --os-password, --os-tenant-name or os-tenant-id. Note:
adding "-V 2" is necessary for this.
[root@overcloud-controller-2 ~]# source /home/heat-admin/overcloudrc
[root@overcloud-controller-2 ~]# swift stat
        Account: AUTH_3a8bec1b907f4907b3513865e2c45792
     Containers: 0
        Objects: 0
          Bytes: 0
X-Put-Timestamp: 1477641134.80437
    X-Timestamp: 1477641134.80437
     X-Trans-Id: tx78a006105d294548b2d10-00581303ae
   Content-Type: text/plain; charset=utf-8

So this should not happen, since the service must be able to reconnect by itself without any manual intervention.

I've attached all of the overcloud controllers sosreports.

Revision history for this message
Raoul Scarazzini (rasca) wrote :
Changed in tripleo:
assignee: nobody → Christian Schwede (cschwede)
status: New → In Progress
Revision history for this message
Christian Schwede (cschwede) wrote :

This is most likely related to ceilometermiddleware, which is placed before the catch_errors middleware and raises these exceptions. Because these errors are not catched properly, the proxy-server can't process the requests anymore.

Related bugreport: https://bugs.launchpad.net/tripleo/+bug/1637471

Revision history for this message
Christian Schwede (cschwede) wrote :

Raoul, could you please verify if the RabbitMQ service is working properly? These error message should only occur if the RabbitMQ service is not reachable.

If I stop RabbitMQ and do a "swift stat" (or any other Swift CLI command) it will behave like you reported. However, once RabbitMQ is up again Swift immediately works again - even when the ceilometermiddleware is in the wrong pipeline position (ie on current master).

Revision history for this message
Raoul Scarazzini (rasca) wrote :

Hi Christian,
as I wrote in the bug this is part of a test in which rabbit (with galera and redis) is not available for 20 minutes and then is started again. The test want to prove that all the services are able, without a restart, to work properly after such an outage.
While testing this if I don't restart swift-proxy I'm not able (for example) to load new images into glance.
So, yes, rabbit is working while doing the test, but was stopped for the 20 minutes before.
I have a similar behavior with neutron, which for some time after the restart is not able to create a new router, but waiting the internal reconnection everything starts to work again. This does not happen with swift-proxy and a manual restart is needed.

Revision history for this message
Christian Schwede (cschwede) wrote :

It looks like this only happens in HA setups, where pacemaker controlls RabbitMQ.

If I use a single controller OOO deployment it doesn't happen; after restarting RabbitMQ all messages are sent again by the ceilometermiddleware again.

But with a three-node controller setup controlled by pcs this won't work. Even after the controllers are up again, the ceilometermiddleware (or more exact oslo.messaging) won't reconnect successfully. The following message will be repeated over and over:

Oct 31 10:04:00 host-192-0-2-15 proxy-server: STDERR: ERROR:oslo.messaging._drivers.impl_rabbit:[275d772c-b7d4-4010-b07e-01dd10c3b1a4] AMQP server on 172.17.0.16:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 14 seconds. Client port: 43564 (txn: txa0bd3ae6c911476aa0698-0058174f25) (client_ip: 172.18.0.18)

However, the server is up - connecting manually to 172.17.0.16:5672 (for example using the telnet client) succeeds. But looking at other services these are connecting to a different RabbitMQ instances after RabbitMQ was restarted.

So I think it makes sense to add the other RabbitMQ hosts as well to the list of nodes.

Another option is to use the nonblocking_notify option in ceilometermiddleware, which also fixes the problem itself.

I'm going to propose another upstream patch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (master)

Fix proposed to branch: master
Review: https://review.openstack.org/391890

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on puppet-tripleo (master)

Change abandoned by Emilien Macchi (<email address hidden>) on branch: master
Review: https://review.openstack.org/391890
Reason: addressed by https://review.openstack.org/#/c/391862/ and even fix ipv6

Changed in tripleo:
assignee: Christian Schwede (cschwede) → Emilien Macchi (emilienm)
Revision history for this message
Christian Schwede (cschwede) wrote :
Changed in tripleo:
importance: Undecided → High
milestone: none → ocata-1
tags: added: newton-backport-potential
removed: glance reconnect swift
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/393770

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on puppet-tripleo (stable/newton)

Change abandoned by Christian Schwede (<email address hidden>) on branch: stable/newton
Review: https://review.openstack.org/393770
Reason: Not merged on master yet.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (master)

Reviewed: https://review.openstack.org/391862
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=8a4fc9c18e8ebfccc7f5bd0c4820c87bebd61d31
Submitter: Jenkins
Branch: master

commit 8a4fc9c18e8ebfccc7f5bd0c4820c87bebd61d31
Author: Emilien Macchi <email address hidden>
Date: Mon Oct 31 10:39:43 2016 -0400

    swift/proxy: configure rabbitmq properly

    Use rabbitmq_node_ips to find out where rabbitmq nodes are, and have
    correct ipv6 syntax if required.

    Closes-Bug: 1637443
    Change-Id: Ibc0ed642931dd3ada7ee594bb8c70a1c3462206d

Changed in tripleo:
status: In Progress → Fix Released
tags: added: in-stable-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/newton)

Reviewed: https://review.openstack.org/393770
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=e34dc3d4e8fb0ea62b8c46c3f0bfc6887aed15c7
Submitter: Jenkins
Branch: stable/newton

commit e34dc3d4e8fb0ea62b8c46c3f0bfc6887aed15c7
Author: Emilien Macchi <email address hidden>
Date: Mon Oct 31 10:39:43 2016 -0400

    swift/proxy: configure rabbitmq properly

    Use rabbitmq_node_ips to find out where rabbitmq nodes are, and have
    correct ipv6 syntax if required.

    Closes-Bug: 1637443
    Change-Id: Ibc0ed642931dd3ada7ee594bb8c70a1c3462206d
    (cherry picked from commit a072c4d4a2a48f0f63b97733b1a3af7094f2987b)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 5.4.0

This issue was fixed in the openstack/puppet-tripleo 5.4.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 6.0.0

This issue was fixed in the openstack/puppet-tripleo 6.0.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 5.4.0

This issue was fixed in the openstack/puppet-tripleo 5.4.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.