firewall manifest in tripleo breaks some assumptions

Bug #1781147 reported by Michele Baldessari
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Michele Baldessari

Bug Description

The tripleo firewall module has fundamentally three pieces:
1) firewall::pre (allows existing connections/ssh/icmp)
2) firewall::rule (allows services traffic)
3) firewall::post (drops all traffic)

One of the assumptions coded in the module is the following line:
Service<||> -> Class['tripleo::firewall::post']

Which has been added so that:
"""
use ordering to make sure we start all Services in catalog before post
rules. It ensure that we don't drop all traffic before starting the
services, which could lead to services errors (e.g. trying to reach database or amqp)
"""
(see also bug LP#1643575)

Now the problem is that while we guarantee that pre comes before post and that services should start before post, we are not guaranteeing that the rules are applied before post.
In fact in my deployment I see the following:
Jul 10 05:04:13 overcloud-controller-1 systemd: Started OpenSSH server daemon.
Jul 10 05:04:13 overcloud-controller-1 puppet-user[32418]: (/Stage[main]/Ssh::Server::Service/Service[sshd]) Triggered 'refresh' from 2 events
Jul 10 05:04:13 overcloud-controller-1 puppet-user[32418]: (/Stage[main]/Tripleo::Firewall::Pre/Tripleo::Firewall::Rule[000 accept related established rules]/Firewall[000 accept related established rules ipv4]/ensure) created
Jul 10 05:04:13 overcloud-controller-1 puppet-user[32418]: (/Stage[main]/Tripleo::Firewall::Pre/Tripleo::Firewall::Rule[000 accept related established rules]/Firewall[000 accept related established rules ipv6]/ensure) created
Jul 10 05:04:13 overcloud-controller-1 puppet-user[32418]: (/Stage[main]/Tripleo::Firewall::Pre/Tripleo::Firewall::Rule[001 accept all icmp]/Firewall[001 accept all icmp ipv4]/ensure) created
Jul 10 05:04:13 overcloud-controller-1 puppet-user[32418]: (/Stage[main]/Tripleo::Firewall::Pre/Tripleo::Firewall::Rule[001 accept all icmp]/Firewall[001 accept all icmp ipv6]/ensure) created
Jul 10 05:04:13 overcloud-controller-1 puppet-user[32418]: (/Stage[main]/Tripleo::Firewall::Pre/Tripleo::Firewall::Rule[002 accept all to lo interface]/Firewall[002 accept all to lo interface ipv4]/ensure) created
Jul 10 05:04:13 overcloud-controller-1 puppet-user[32418]: (/Stage[main]/Tripleo::Firewall::Pre/Tripleo::Firewall::Rule[002 accept all to lo interface]/Firewall[002 accept all to lo interface ipv6]/ensure) created
Jul 10 05:04:13 overcloud-controller-1 puppet-user[32418]: (/Stage[main]/Tripleo::Firewall::Pre/Tripleo::Firewall::Rule[003 accept ssh]/Firewall[003 accept ssh ipv4]/ensure) created
Jul 10 05:04:13 overcloud-controller-1 puppet-user[32418]: (/Stage[main]/Tripleo::Firewall::Pre/Tripleo::Firewall::Rule[003 accept ssh]/Firewall[003 accept ssh ipv6]/ensure) created
Jul 10 05:04:13 overcloud-controller-1 puppet-user[32418]: (/Stage[main]/Tripleo::Firewall::Pre/Tripleo::Firewall::Rule[004 accept ipv6 dhcpv6]/Firewall[004 accept ipv6 dhcpv6 ipv6]/ensure) created
Jul 10 05:04:13 overcloud-controller-1 puppet-user[32418]: (/Stage[main]/Tripleo::Firewall::Post/Tripleo::Firewall::Rule[998 log all]/Firewall[998 log all ipv4]/ensure) created
Jul 10 05:04:14 overcloud-controller-1 puppet-user[32418]: (/Stage[main]/Tripleo::Firewall::Post/Tripleo::Firewall::Rule[998 log all]/Firewall[998 log all ipv6]/ensure) created
Jul 10 05:04:14 overcloud-controller-1 puppet-user[32418]: (/Stage[main]/Tripleo::Firewall::Post/Tripleo::Firewall::Rule[999 drop all]/Firewall[999 drop all ipv4]/ensure) created
Jul 10 05:04:14 overcloud-controller-1 puppet-user[32418]: (/Stage[main]/Tripleo::Firewall::Post/Tripleo::Firewall::Rule[999 drop all]/Firewall[999 drop all ipv6]/ensure) created
Jul 10 05:04:14 overcloud-controller-1 puppet-user[32418]: (/Stage[main]/Tripleo::Firewall/Tripleo::Firewall::Service_rules[cinder_api]/Tripleo::Firewall::Rule[119 cinder]/Firewall[119 cinder ipv4]/ensure) created
Jul 10 05:04:14 overcloud-controller-1 puppet-user[32418]: (/Stage[main]/Tripleo::Firewall/Tripleo::Firewall::Service_rules[cinder_api]/Tripleo::Firewall::Rule[119 cinder]/Firewall[119 cinder ipv6]/ensure) created
Jul 10 05:04:14 overcloud-controller-1 puppet-user[32418]: (/Stage[main]/Tripleo::Firewall/Tripleo::Firewall::Service_rules[cinder_volume]/Tripleo::Firewall::Rule[120 iscsi initiator]/Firewall[120 iscsi initiator ipv4]/ensure) created

As we can see above the service rules (aka item 2) were added after the post rules, which is breaking the assumption that the service is up and running and reachable.

In fact I am hitting this issue while trying to get controllers to scale up because the cluster is up and running and only later we apply pre+post, and since it takes some time to apply all the iptables rules the cluster thinks the other nodes are unreachable and will fence them

Changed in tripleo:
assignee: nobody → Michele Baldessari (michele)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (master)

Fix proposed to branch: master
Review: https://review.openstack.org/581634

Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I believe the associated services downtime impact of that issue, makes it a critical issue

Changed in tripleo:
importance: High → Critical
tags: added: queens-backport-potential
tags: added: pike-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (master)
Download full text (4.1 KiB)

Reviewed: https://review.openstack.org/581634
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=c525c64f6a9535b0b040cca1a7813e93dfaa797c
Submitter: Zuul
Branch: master

commit c525c64f6a9535b0b040cca1a7813e93dfaa797c
Author: Michele Baldessari <email address hidden>
Date: Wed Jul 11 11:01:13 2018 +0200

    Enforce proper ordering when applying firewall rules

    The tripleo firewall module has fundamentally three pieces:
    1) firewall::pre (allows existing connections/ssh/icmp)
    2) firewall::rule (allows services traffic)
    3) firewall::post (drops all traffic)

    One of the assumptions coded in the module is the following line:
    Service<||> -> Class['tripleo::firewall::post']

    Which has been added so that (see also bug LP#1643575):
    """
    use ordering to make sure we start all Services in catalog before post
    rules. It ensure that we don't drop all traffic before starting the
    services, which could lead to services errors (e.g. trying to reach database or amqp)
    """

    The problem is that there is nothing specifying that the firewall rules
    created by tripleo services need to be implemented between the pre and
    post classes. So the following can happen:
    Jul 10 05:04:13 overcloud-controller-1 systemd: Started OpenSSH server daemon.
    Jul 10 05:04:13 overcloud-controller-1 puppet-user[32418]: (/Stage[main]/Ssh::Server::Service/Service[sshd]) Triggered 'refresh' from 2 events
    Jul 10 05:04:13 overcloud-controller-1 puppet-user[32418]: (/Stage[main]/Tripleo::Firewall::Pre/Tripleo::Firewall::Rule[000 accept related established rules]/Firewall[000 accept related established rules ipv4]/ensure) created
    ...
    Jul 10 05:04:13 overcloud-controller-1 puppet-user[32418]: (/Stage[main]/Tripleo::Firewall::Post/Tripleo::Firewall::Rule[998 log all]/Firewall[998 log all ipv4]/ensure) created
    ...
    Jul 10 05:04:14 overcloud-controller-1 puppet-user[32418]: (/Stage[main]/Tripleo::Firewall/Tripleo::Firewall::Service_rules[cinder_api]/Tripleo::Firewall::Rule[119 cinder]/Firewall[119 cinder ipv4]/ensure) created

    This means that we can actually open the traffic for our services *after*
    said traffic has been completely blocked. In order to fix this we
    tag the pre/post rules with a different tag and add resource collectors
    to actually enforce proper ordering. We now get:
    ...
    Jul 11 08:54:43 overcloud-controller-0 puppet-user[32554]: (/Stage[main]/Tripleo::Firewall::Pre/Tripleo::Firewall::Rule[000 accept related established rules]/Firewall[000 accept related established rules ipv4]/ensure) created
    ...
    Jul 11 08:54:43 overcloud-controller-0 puppet-user[32554]: (/Stage[main]/Tripleo::Firewall/Tripleo::Firewall::Service_rules[cinder_api]/Tripleo::Firewall::Rule[119 cinder]/Firewall[119 cinder ipv4]/ensure) created
    ...
    Jul 11 08:54:52 overcloud-controller-0 puppet-user[32554]: (/Stage[main]/Tripleo::Firewall::Post/Tripleo::Firewall::Rule[998 log all]/Firewall[998 log all ipv4]/ensure) created

    Tested this by doing 20 deploys of 1ctrl+1cmp and then scaling up the
    overcloud to 3ctrl+2cmp.

    The reason t...

Read more...

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/582942

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/queens)
Download full text (4.3 KiB)

Reviewed: https://review.openstack.org/582942
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=9fdb07ec6653c751d79e66e7f0ab4a8ce68d120c
Submitter: Zuul
Branch: stable/queens

commit 9fdb07ec6653c751d79e66e7f0ab4a8ce68d120c
Author: Michele Baldessari <email address hidden>
Date: Wed Jul 11 11:01:13 2018 +0200

    Enforce proper ordering when applying firewall rules

    The tripleo firewall module has fundamentally three pieces:
    1) firewall::pre (allows existing connections/ssh/icmp)
    2) firewall::rule (allows services traffic)
    3) firewall::post (drops all traffic)

    One of the assumptions coded in the module is the following line:
    Service<||> -> Class['tripleo::firewall::post']

    Which has been added so that (see also bug LP#1643575):
    """
    use ordering to make sure we start all Services in catalog before post
    rules. It ensure that we don't drop all traffic before starting the
    services, which could lead to services errors (e.g. trying to reach database or amqp)
    """

    The problem is that there is nothing specifying that the firewall rules
    created by tripleo services need to be implemented between the pre and
    post classes. So the following can happen:
    Jul 10 05:04:13 overcloud-controller-1 systemd: Started OpenSSH server daemon.
    Jul 10 05:04:13 overcloud-controller-1 puppet-user[32418]: (/Stage[main]/Ssh::Server::Service/Service[sshd]) Triggered 'refresh' from 2 events
    Jul 10 05:04:13 overcloud-controller-1 puppet-user[32418]: (/Stage[main]/Tripleo::Firewall::Pre/Tripleo::Firewall::Rule[000 accept related established rules]/Firewall[000 accept related established rules ipv4]/ensure) created
    ...
    Jul 10 05:04:13 overcloud-controller-1 puppet-user[32418]: (/Stage[main]/Tripleo::Firewall::Post/Tripleo::Firewall::Rule[998 log all]/Firewall[998 log all ipv4]/ensure) created
    ...
    Jul 10 05:04:14 overcloud-controller-1 puppet-user[32418]: (/Stage[main]/Tripleo::Firewall/Tripleo::Firewall::Service_rules[cinder_api]/Tripleo::Firewall::Rule[119 cinder]/Firewall[119 cinder ipv4]/ensure) created

    This means that we can actually open the traffic for our services *after*
    said traffic has been completely blocked. In order to fix this we
    tag the pre/post rules with a different tag and add resource collectors
    to actually enforce proper ordering. We now get:
    ...
    Jul 11 08:54:43 overcloud-controller-0 puppet-user[32554]: (/Stage[main]/Tripleo::Firewall::Pre/Tripleo::Firewall::Rule[000 accept related established rules]/Firewall[000 accept related established rules ipv4]/ensure) created
    ...
    Jul 11 08:54:43 overcloud-controller-0 puppet-user[32554]: (/Stage[main]/Tripleo::Firewall/Tripleo::Firewall::Service_rules[cinder_api]/Tripleo::Firewall::Rule[119 cinder]/Firewall[119 cinder ipv4]/ensure) created
    ...
    Jul 11 08:54:52 overcloud-controller-0 puppet-user[32554]: (/Stage[main]/Tripleo::Firewall::Post/Tripleo::Firewall::Rule[998 log all]/Firewall[998 log all ipv4]/ensure) created

    Tested this by doing 20 deploys of 1ctrl+1cmp and then scaling up the
    overcloud to 3ctrl+2cmp.

    The r...

Read more...

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/583107

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/pike)
Download full text (4.4 KiB)

Reviewed: https://review.openstack.org/583107
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=5013eb278d3352dc997911079c391a75c01c7fa6
Submitter: Zuul
Branch: stable/pike

commit 5013eb278d3352dc997911079c391a75c01c7fa6
Author: Michele Baldessari <email address hidden>
Date: Wed Jul 11 11:01:13 2018 +0200

    Enforce proper ordering when applying firewall rules

    The tripleo firewall module has fundamentally three pieces:
    1) firewall::pre (allows existing connections/ssh/icmp)
    2) firewall::rule (allows services traffic)
    3) firewall::post (drops all traffic)

    One of the assumptions coded in the module is the following line:
    Service<||> -> Class['tripleo::firewall::post']

    Which has been added so that (see also bug LP#1643575):
    """
    use ordering to make sure we start all Services in catalog before post
    rules. It ensure that we don't drop all traffic before starting the
    services, which could lead to services errors (e.g. trying to reach database or amqp)
    """

    The problem is that there is nothing specifying that the firewall rules
    created by tripleo services need to be implemented between the pre and
    post classes. So the following can happen:
    Jul 10 05:04:13 overcloud-controller-1 systemd: Started OpenSSH server daemon.
    Jul 10 05:04:13 overcloud-controller-1 puppet-user[32418]: (/Stage[main]/Ssh::Server::Service/Service[sshd]) Triggered 'refresh' from 2 events
    Jul 10 05:04:13 overcloud-controller-1 puppet-user[32418]: (/Stage[main]/Tripleo::Firewall::Pre/Tripleo::Firewall::Rule[000 accept related established rules]/Firewall[000 accept related established rules ipv4]/ensure) created
    ...
    Jul 10 05:04:13 overcloud-controller-1 puppet-user[32418]: (/Stage[main]/Tripleo::Firewall::Post/Tripleo::Firewall::Rule[998 log all]/Firewall[998 log all ipv4]/ensure) created
    ...
    Jul 10 05:04:14 overcloud-controller-1 puppet-user[32418]: (/Stage[main]/Tripleo::Firewall/Tripleo::Firewall::Service_rules[cinder_api]/Tripleo::Firewall::Rule[119 cinder]/Firewall[119 cinder ipv4]/ensure) created

    This means that we can actually open the traffic for our services *after*
    said traffic has been completely blocked. In order to fix this we
    tag the pre/post rules with a different tag and add resource collectors
    to actually enforce proper ordering. We now get:
    ...
    Jul 11 08:54:43 overcloud-controller-0 puppet-user[32554]: (/Stage[main]/Tripleo::Firewall::Pre/Tripleo::Firewall::Rule[000 accept related established rules]/Firewall[000 accept related established rules ipv4]/ensure) created
    ...
    Jul 11 08:54:43 overcloud-controller-0 puppet-user[32554]: (/Stage[main]/Tripleo::Firewall/Tripleo::Firewall::Service_rules[cinder_api]/Tripleo::Firewall::Rule[119 cinder]/Firewall[119 cinder ipv4]/ensure) created
    ...
    Jul 11 08:54:52 overcloud-controller-0 puppet-user[32554]: (/Stage[main]/Tripleo::Firewall::Post/Tripleo::Firewall::Rule[998 log all]/Firewall[998 log all ipv4]/ensure) created

    Tested this by doing 20 deploys of 1ctrl+1cmp and then scaling up the
    overcloud to 3ctrl+2cmp.

    The rea...

Read more...

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 9.2.0

This issue was fixed in the openstack/puppet-tripleo 9.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 7.4.15

This issue was fixed in the openstack/puppet-tripleo 7.4.15 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 8.3.5

This issue was fixed in the openstack/puppet-tripleo 8.3.5 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.