Classic provisioning failed on some nodes due to DNS resolution failure during OS installation in ks-post

Bug #1458533 reported by Andrey Sledzinskiy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
Undecided
Vladimir Kuklin
Declined for 3.2.x by Aleksandra Fedorova
Declined for 4.0.x by Aleksandra Fedorova
6.1.x
Fix Committed
High
Vladimir Kuklin

Bug Description

{

    "build_id": "2015-05-24_15-51-50",
    "build_number": "462",
    "release_versions":

{

    "2014.2.2-6.1":

{

    "VERSION":

{

    "build_id": "2015-05-24_15-51-50",
    "build_number": "462",
    "api": "1.0",
    "fuel-library_sha": "889c2534ceadf8afd5d1540c1cadbd913c0c8c14",
    "nailgun_sha": "76441596e4fe6420cc7819427662fa244e150177",
    "feature_groups":

            [
                "mirantis"
            ],
            "openstack_version": "2014.2.2-6.1",
            "production": "docker",
            "python-fuelclient_sha": "e19f1b65792f84c4a18b5a9473f85ef3ba172fce",
            "astute_sha": "0bd72c72369e743376864e8e8dabfe873d40450a",
            "fuel-ostf_sha": "9a5f55602c260d6c840c8333d8f32ec8cfa65c1f",
            "release": "6.1",
            "fuelmain_sha": "5c8ebddf64ea93000af2de3ccdb4aa8bb766ce93"
        }
    }

},
"auth_required": true,
"api": "1.0",
"fuel-library_sha": "889c2534ceadf8afd5d1540c1cadbd913c0c8c14",
"nailgun_sha": "76441596e4fe6420cc7819427662fa244e150177",
"feature_groups":

    [
        "mirantis"
    ],
    "openstack_version": "2014.2.2-6.1",
    "production": "docker",
    "python-fuelclient_sha": "e19f1b65792f84c4a18b5a9473f85ef3ba172fce",
    "astute_sha": "0bd72c72369e743376864e8e8dabfe873d40450a",
    "fuel-ostf_sha": "9a5f55602c260d6c840c8333d8f32ec8cfa65c1f",
    "release": "6.1",
    "fuelmain_sha": "5c8ebddf64ea93000af2de3ccdb4aa8bb766ce93"

}

Steps:
1. Create next cluster - CentOS, HA, classic provisioning, Neutron GRE, 3 controllers, 2 compute
2. Start deployment

Actual result - provisioning failed on node-2 and node-5
Error in anaconda.log:
2015-05-25 05:15:48 ERR

Error downloading treeinfo file: [Errno 14] PYCURL ERROR 22 - "The requested URL returned error: 404 Not Found"

2015-05-25 05:15:48 ERR

Error downloading treeinfo file: [Errno 14] PYCURL ERROR 22 - "The requested URL returned error: 404 Not Found"

Logs are attached

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

This very much seems to be like a connectivity issue between master node and the slave node

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Fuel provisioning team (fuel-provisioning)
Revision history for this message
Alexander Gordeev (a-gordeev) wrote :

Yes, there's a traces of network connectivity issues such as:

./10.109.9.2/var/log/docker-logs/remote/node-5.test.domain.local/install/ks-post.log:2015-05-25T05:19:16.480007+00:00 info: http://mirror.fuel-infra.org/mos/centos-6/mos6.1/updates/repodata/repomd.xml: [Errno 14] PYCURL ERROR 6 - "Couldn't resolve host 'mirror.fuel-infra.org'"
./10.109.9.2/var/log/docker-logs/remote/node-5.test.domain.local/install/anaconda.log:2015-05-25T05:15:49.800332+00:00 err: Error downloading treeinfo file: [Errno 14] PYCURL ERROR 22 - "The requested URL returned error: 404 Not Found"
./10.109.9.2/var/log/docker-logs/remote/node-5.test.domain.local/install/anaconda.log:2015-05-25T05:15:49.801185+00:00 err: Error downloading treeinfo file: [Errno 14] PYCURL ERROR 22 - "The requested URL returned error: 404 Not Found"
./10.109.9.2/var/log/docker-logs/remote/node-3.test.domain.local/install/anaconda.log:2015-05-25T05:15:56.843486+00:00 err: Error downloading treeinfo file: [Errno 14] PYCURL ERROR 22 - "The requested URL returned error: 404 Not Found"
./10.109.9.2/var/log/docker-logs/remote/node-3.test.domain.local/install/anaconda.log:2015-05-25T05:15:56.843800+00:00 err: Error downloading treeinfo file: [Errno 14] PYCURL ERROR 22 - "The requested URL returned error: 404 Not Found"
./10.109.9.2/var/log/docker-logs/remote/node-2.test.domain.local/install/ks-post.log:2015-05-25T05:19:11.777919+00:00 info: http://mirror.fuel-infra.org/mos/centos-6/mos6.1/updates/repodata/repomd.xml: [Errno 14] PYCURL ERROR 6 - "Couldn't resolve host 'mirror.fuel-infra.org'"

it is still reproducible?

Also, 404 errors looks related to improper repomedata http://bugs.centos.org/view.php?id=6277

MOS-linux folks should a take a look into that too.

Changed in fuel:
status: New → Incomplete
assignee: Fuel provisioning team (fuel-provisioning) → MOS Linux (mos-linux)
Revision history for this message
Aleksander Mogylchenko (amogylchenko) wrote :

network connectivity != 404 errors or DNS resolution problems.

Changed in fuel:
assignee: MOS Linux (mos-linux) → Fuel provisioning team (fuel-provisioning)
Revision history for this message
Alexander Gordeev (a-gordeev) wrote :

closing as invalid as it looks as some sort of temporary networking outage. It's not reproducible.

Changed in fuel:
status: Incomplete → Invalid
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Reproduced on {
build_id: "2015-05-31_20-55-26",
build_number: "490",
api: "1.0",
fuel-library_sha: "c9a86ac0e6da95d36e328ce5130715792a2eb177",
nailgun_sha: "3830bdcb28ec050eed399fe782cc3dd5fbf31bde",
feature_groups: [
"mirantis"
],
openstack_version: "2014.2.2-6.1",
production: "docker",
python-fuelclient_sha: "4fc55db0265bbf39c369df398b9dc7d6469ba13b",
astute_sha: "5d570ae5e03909182db8e284fbe6e4468c0a4e3e",
fuel-ostf_sha: "7413186490e8d651b8837b9eee75efa53f5e230b",
release: "6.1",
fuelmain_sha: "6b5712a7197672d588801a1816f56f321cbceebd"
}

scenario 1:
  1. Create cluster
            2. Add 3 nodes with controller role
            3. Add 2 nodes with compute role
            4. Deploy the cluster
            5. Run network verification
            6. Run OSTF

scenario 2:
1. Create cluster
            2. Add 3 nodes with controller and ceph OSD roles
            3. Add 1 node with ceph OSD roles
            4. Add 2 nodes with compute and ceph OSD roles
            5. Deploy the cluster

Changed in fuel:
status: Invalid → Confirmed
Revision history for this message
Nastya Urlapova (aurlapova) wrote :
Revision history for this message
Nastya Urlapova (aurlapova) wrote :
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

Nastya,

please file a new bug. The problem you've found (ceph deployment failure) has nothing to do with the original one (CentOS nodes fail to provision).

Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

@Aleksey, I've reverted env and issue was not in deployment, provisioning on one node failed.

Changed in fuel:
status: Incomplete → Confirmed
Revision history for this message
Alexander Gordeev (a-gordeev) wrote :

Confirmed.

Provisioning of nodes failed by timeout.

Failure scenario:
1) cobbler tried to install the OS
2) then it removed ruby,ruby-gems packages and tried to reinstall them in ks-post.

rpm -e --nodeps ruby
+ rpm -e --nodeps ruby
yum install --exclude=ruby21*,ruby-2.1.1* -y ruby rubygems
+ yum install '--exclude=ruby21*,ruby-2.1.1*' -y ruby rubygems
Loaded plugins: fastestmirror, priorities
http://mirror.fuel-infra.org/mos/centos-6/mos6.1/updates/repodata/repomd.xml: [Errno 14] PYCURL ERROR 6 - "Couldn't resolve host 'mirror.fuel-infra.org'"
Trying other mirror.
Error: Cannot retrieve repository metadata (repomd.xml) for repository: mos-updates. Please verify its path and try again
yum update -y --exclude --exclude=ruby*
+ yum update -y --exclude '--exclude=ruby*'
Loaded plugins: fastestmirror, priorities
Determining fastest mirrors
Setting up Update Process
No Packages marked for Update

3) cobbler will reboot the nodes, once the installation completes.
4) After reboot node doesn't appear as alive. No ruby, means unable to start nailgun-agent. Later leads to provisioning timeout to be exceeded.
Eg.:
./10.109.10.2/var/log/docker-logs/remote/node-1.test.domain.local/nailgun-agent.log:2015-06-01T05:30:01.658422+00:00 notice: /usr/bin/env: ruby: No such file or directory

The questions are: why did only few nodes get DNS resolution errors? how it could happen?

summary: - Classic provisioning failed on some nodes with Error downloading
- treeinfo file: [Errno 14] PYCURL ERROR 22 - "The requested URL returned
- error: 404 Not Found"
+ Classic provisioning failed on some nodes due to DNS resolution failure
+ during OS installation in ks-post
Revision history for this message
Mike Scherbakov (mihgen) wrote :

What dnsmasq logs say on master node? Any other logs stating connectivity issues meanwhile?

Revision history for this message
Alexander Gordeev (a-gordeev) wrote :
Download full text (3.4 KiB)

dns resolition failure occured, let's say within 04:02:10-04:02:20

dnsmasq log showed a lot of incoming SIGTERMs
http://paste.openstack.org/show/256285/

looks like 2 SIGTERMS within 2 secs could lead to dns outage
Jun 1 04:02:12 dnsmasq[1994]: exiting on receipt of SIGTERM
Jun 1 04:02:14 dnsmasq[2103]: exiting on receipt of SIGTERM
Jun 1 04:02:20 dnsmasq[2132]: exiting on receipt of SIGTERM

cobbler produced warning with error
http://paste.openstack.org/show/256283/

looks like those warning/errors in cobbler trigger dnsmasq restart.

I've searched on '$tCD3X7ji' and found it in

./node-4.test.domain.local/root/anaconda-ks.cfg:rootpw --iscrypted $6$tCD3X7ji$1urw6qEMDkVxOkD33b4TpQAjRiCeDZx0jmgMhDYhfB9KuGfqO9OcMaKyUxnGGWslEDQ4HxTw7vcAMP85NxQe61
./node-4.test.domain.local/root/cobbler.ks:rootpw --iscrypted $6$tCD3X7ji$1urw6qEMDkVxOkD33b4TpQAjRiCeDZx0jmgMhDYhfB9KuGfqO9OcMaKyUxnGGWslEDQ4HxTw7vcAMP85NxQe61
./node-4.test.domain.local/root/cobbler.ks:sshpw --username root --iscrypted $6$tCD3X7ji$1urw6qEMDkVxOkD33k2jjklHSDG2hg2234kJHESJ3hwhsjHshSJshHSJSh333je34DHJHDr4je4AMP85NxQe61
./node-5.test.domain.local/root/anaconda-ks.cfg:rootpw --iscrypted $6$tCD3X7ji$1urw6qEMDkVxOkD33b4TpQAjRiCeDZx0jmgMhDYhfB9KuGfqO9OcMaKyUxnGGWslEDQ4HxTw7vcAMP85NxQe61
./node-5.test.domain.local/root/cobbler.ks:rootpw --iscrypted $6$tCD3X7ji$1urw6qEMDkVxOkD33b4TpQAjRiCeDZx0jmgMhDYhfB9KuGfqO9OcMaKyUxnGGWslEDQ4HxTw7vcAMP85NxQe61
./node-5.test.domain.local/root/cobbler.ks:sshpw --username root --iscrypted $6$tCD3X7ji$1urw6qEMDkVxOkD33k2jjklHSDG2hg2234kJHESJ3hwhsjHshSJshHSJSh333je34DHJHDr4je4AMP85NxQe61
./node-3.test.domain.local/root/anaconda-ks.cfg:rootpw --iscrypted $6$tCD3X7ji$1urw6qEMDkVxOkD33b4TpQAjRiCeDZx0jmgMhDYhfB9KuGfqO9OcMaKyUxnGGWslEDQ4HxTw7vcAMP85NxQe61
./node-3.test.domain.local/root/cobbler.ks:rootpw --iscrypted $6$tCD3X7ji$1urw6qEMDkVxOkD33b4TpQAjRiCeDZx0jmgMhDYhfB9KuGfqO9OcMaKyUxnGGWslEDQ4HxTw7vcAMP85NxQe61
./node-3.test.domain.local/root/cobbler.ks:sshpw --username root --iscrypted $6$tCD3X7ji$1urw6qEMDkVxOkD33k2jjklHSDG2hg2234kJHESJ3hwhsjHshSJshHSJSh333je34DHJHDr4je4AMP85NxQe61
./node-2.test.domain.local/root/anaconda-ks.cfg:rootpw --iscrypted $6$tCD3X7ji$1urw6qEMDkVxOkD33b4TpQAjRiCeDZx0jmgMhDYhfB9KuGfqO9OcMaKyUxnGGWslEDQ4HxTw7vcAMP85NxQe61
./node-2.test.domain.local/root/cobbler.ks:rootpw --iscrypted $6$tCD3X7ji$1urw6qEMDkVxOkD33b4TpQAjRiCeDZx0jmgMhDYhfB9KuGfqO9OcMaKyUxnGGWslEDQ4HxTw7vcAMP85NxQe61
./node-2.test.domain.local/root/cobbler.ks:sshpw --username root --iscrypted $6$tCD3X7ji$1urw6qEMDkVxOkD33k2jjklHSDG2hg2234kJHESJ3hwhsjHshSJshHSJSh333je34DHJHDr4je4AMP85NxQe61
./node-1.test.domain.local/root/anaconda-ks.cfg:rootpw --iscrypted $6$tCD3X7ji$1urw6qEMDkVxOkD33b4TpQAjRiCeDZx0jmgMhDYhfB9KuGfqO9OcMaKyUxnGGWslEDQ4HxTw7vcAMP85NxQe61
./node-1.test.domain.local/root/cobbler.ks:rootpw --iscrypted $6$tCD3X7ji$1urw6qEMDkVxOkD33b4TpQAjRiCeDZx0jmgMhDYhfB9KuGfqO9OcMaKyUxnGGWslEDQ4HxTw7vcAMP85NxQe61
./node-1.test.domain.local/root/cobbler.ks:sshpw --username root --iscrypted $6$tCD3X7ji$1urw6qEMDkVxOkD33k2jjklHSDG2hg2234kJHESJ3hwhsjHshSJshHSJSh333je34DHJHDr4je4AMP85NxQe61

So it's cobbler issue. Finally triaged.

fuel has hardcoded root password in kickstar...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/187598

Changed in fuel:
assignee: Fuel provisioning team (fuel-provisioning) → Aleksandr Gordeev (a-gordeev)
status: Confirmed → In Progress
no longer affects: fuel
no longer affects: fuel/4.1.x
no longer affects: fuel/5.0.x
no longer affects: fuel/5.1.x
no longer affects: fuel/6.0.x
Revision history for this message
Ryan Moe (rmoe) wrote :

I don't think the un-escaped string is the root cause. According to the cobbler documentation [0] escaping isn't required, if the variable doesn't exist the string is left as-is (I could only find documentation for 2.6 so it's possible it doesn't work this way in 2.4.4). I can reproduce this issue and I see warnings in the cobbler log (but it still generates a correct kickstart file) but dnsmasq is never restarted. The dnsmasq restarts from the diagnostic snapshot above all seem to come as a result of sync_dhcp actions.

[0] http://www.cobblerd.org/manuals/2.6.0/3/5_-_Kickstart_Templating.html (See the section on escaping)

Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

Let's add retry to yum

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-main (master)

Fix proposed to branch: master
Review: https://review.openstack.org/188075

Changed in fuel:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/188331

Changed in fuel:
assignee: Dmitry Ilyin (idv1985) → Vladimir Kuklin (vkuklin)
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

Retry to yum for the individual calls is a great quick fix. Yum devels want yum to fail fast and soon on particular "hard" errors, such as dns resolution failure. Our case is rare where dnsmasq goes down briefly. We could configure DNS to use fuel master + upstream DNS for the time of provisioning. That should solve our issues in a less invasive way.

Another option is modifying the urlgrabber library that yum uses to treat PYCURL ERROR 6 as a non-fatal error (-1 instead of 14). I'll prepare a patch for it and test.

Finally, another solution could be to use BIND 9, which can do live updates without any downtime.

Revision history for this message
Ryan Moe (rmoe) wrote :

The issue is caused by the snippet 'kickstart_done' [0]. The dnsmasq restart is triggered by setting nopxe on the current system in that snippet [1]. That snippet is the very last thing that runs in %post. So there is a possible race condition where one node can finish provisioning and trigger a dnsmasq restart while another node that is slower or started later is still installing packages. Given that, I agree with Matt that having yum retry is a good solution to this problem.

[0] https://github.com/cobbler/cobbler/blob/v2.4.4/snippets/kickstart_done#L16
[1] https://github.com/cobbler/cobbler/blob/v2.4.4/cobbler/remote.py#L1243

Revision history for this message
Mike Scherbakov (mihgen) wrote :

I had a discussion with Ryan about it. Ryan will check the following:
1) if image-based provisioning is also affected
2) why cobbler restarts dnsmasq if it could potentially just do reload (SIGHUP)

In general, looks like we started to see the issue since we started to use domain name in repo url, as before we were always using IP of master node.

Revision history for this message
Ryan Moe (rmoe) wrote :

1) Image-based provisioning should not be affected by this.
2) dnsmasq has limited support for re-reading configuration with a SIGHUP e.g. a separate static leases file or /etc/ethers. Cobbler doesn't manage the static leases in a separate file so it has to restart dnsmasq anytime it adds/updates a system definition.

Revision history for this message
Mike Scherbakov (mihgen) wrote :

So let's do yum retry for this release and then see how often will it occur after.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/188331
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=22b33e4515fe345db20f28d92927a74a86f33095
Submitter: Jenkins
Branch: master

commit 22b33e4515fe345db20f28d92927a74a86f33095
Author: Vladimir Kuklin <email address hidden>
Date: Thu Jun 4 13:43:04 2015 +0300

    Rewrite yum in kickstart to do retries on failures

    This commit makes yum run in installation retry
    up to 10 times in case of failures making
    classic provisioning for CentOS more tolerant
    do DNS and repo connectivity failures

    Change-Id: Ibc450de102c0f76b10c945ada4c35dd6845969ef
    Partial-bug: 1458533

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (master)

Change abandoned by Aleksandr Gordeev (<email address hidden>) on branch: master
Review: https://review.openstack.org/187598
Reason: cosmetic change

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-main (master)

Change abandoned by Dmitry Ilyin (<email address hidden>) on branch: master
Review: https://review.openstack.org/188075

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.