Github.com is used in TripleO CI for cloning distgit repos

Bug #1730931 reported by Sagi (Sergey) Shnaidman
This bug report is a duplicate of:  Bug #1721702: Possible DNS issue in the CI. Edit Remove
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Triaged
Critical
Sagi (Sergey) Shnaidman

Bug Description

gate jobs fail because github.com is not resolvable, we shouldn't use github.com in our code at all:

http://logs.openstack.org/79/516979/2/gate/legacy-tripleo-ci-centos-7-nonha-multinode-oooq/aa9d782/job-output.txt.gz#_2017-11-08_10_37_03_808711

2017-11-08 10:37:03,720 INFO:dlrn-repositories:Getting https://github.com/rdo-packages/tripleo-heat-templates-distgit.git to ./data/openstack-tripleo-heat-templates_distro (rpm-master)
2017-11-08 10:37:03,755 ERROR:dlrn-repositories:Error cloning https://github.com/rdo-packages/tripleo-heat-templates-distgit.git into ./data/openstack-tripleo-heat-templates_distro:

  RAN: /usr/bin/git clone https://github.com/rdo-packages/tripleo-heat-templates-distgit.git ./data/openstack-tripleo-heat-templates_distro

  STDOUT:
Cloning into './data/openstack-tripleo-heat-templates_distro'...

  STDERR:
fatal: unable to access 'https://github.com/rdo-packages/tripleo-heat-templates-distgit.git/': Could not resolve host: github.com; Unknown error

Traceback (most recent call last):
  File "/home/zuul/dlrn-venv/bin/dlrn", line 11, in <module>
    sys.exit(main())
  File "/home/zuul/dlrn-venv/lib/python2.7/site-packages/dlrn/shell.py", line 257, in main
    db_connection=config_options.
  File "/home/zuul/dlrn-venv/lib/python2.7/site-packages/dlrn/shell.py", line 661, in getinfo
    dev_mode=dev_mode)
  File "/home/zuul/dlrn-venv/lib/python2.7/site-packages/dlrn/drivers/rdoinfo.py", line 111, in getinfo
    full_path=distro_dir_full)
  File "/home/zuul/dlrn-venv/lib/python2.7/site-packages/dlrn/repositories.py", line 32, in refreshrepo
    sh.git.clone(url, path)
  File "/home/zuul/dlrn-venv/lib/python2.7/site-packages/sh.py", line 1427, in __call__
    return RunningCommand(cmd, call_args, stdin, stdout, stderr)
  File "/home/zuul/dlrn-venv/lib/python2.7/site-packages/sh.py", line 774, in __init__
    self.wait()
  File "/home/zuul/dlrn-venv/lib/python2.7/site-packages/sh.py", line 792, in wait
    self.handle_command_exit_code(exit_code)
  File "/home/zuul/dlrn-venv/lib/python2.7/site-packages/sh.py", line 815, in handle_command_exit_code
    raise exc
sh.ErrorReturnCode_128:

  RAN: /usr/bin/git clone https://github.com/rdo-packages/tripleo-heat-templates-distgit.git ./data/openstack-tripleo-heat-templates_distro

  STDOUT:
Cloning into './data/openstack-tripleo-heat-templates_distro'...

  STDERR:
fatal: unable to access 'https://github.com/rdo-packages/tripleo-heat-templates-distgit.git/': Could not resolve host: github.com; Unknown error

Changed in tripleo:
milestone: none → queens-2
Revision history for this message
wes hayutin (weshayutin) wrote :

We could probably rework the query here [1] to include that error

[1] https://review.openstack.org/#/c/517776/

tags: added: promotion-blocker
Revision history for this message
Paul Belanger (pabelanger) wrote :

Clarkb and I talked about this at PTG in Denver. What could be done is downloading the srpm and extracting the spec files from there. This will allow jobs to use the contents from the reverse proxy cache we have in each cloud region.

Revision history for this message
Alan Pevec (apevec) wrote :

From Wes: tripleo CI role rebuilding RPMs from distgit is
https://github.com/openstack/tripleo-quickstart-extras/tree/master/roles/install-built-repo

My initial brain dump:
That's where dlrn call could be replaced by taking SRPM from trunk.rdo mirror-cache in infra and unpacking (not just spec, it'd be rpmbuild -rp <package.src.rpm>) then replacing tarball in SOURCES and rebuilding in mock. Last two steps could be part of dlrn tool, TBD.

Revision history for this message
Alan Pevec (apevec) wrote :

@Wes above role is not building packages, I guess role building them is https://github.com/openstack/tripleo-quickstart-extras/tree/master/roles/build-test-packages ?

Revision history for this message
Alan Pevec (apevec) wrote :

One more TBD: moving to SRPMs will break Upstream-Id feature in dlrn
https://github.com/softwarefactory-project/DLRN/commit/9c670dcd8a837d7701ddc522d613e3f7fae125b7

Revision history for this message
Alan Pevec (apevec) wrote :
Revision history for this message
Javier Peña (jpena-c) wrote :

My spider-sense tells me that fetching the spec (and associated files from the distgit) from a (maybe outdated if we use current-tripleo) .src.rpm is not the best idea, but we should be able to do it without modifying DLRN code.

In [1] we copy the project source Git to DLRN/data/<project_name>. We could fetch the .src.rpm, extract everything but the source tar.gz (something we can detect by parsing Source0 in the spec), then copy that into DLRN/data/<project_name>_distro. We are already passing the --dev switch to DLRN, so that should be transparent.

We could do all that in dlrn-build.yml, ideally with a toggle so we can still reuse the playbook outside the OpenStack infra and then fetch the distgit from GitHub.

I still have some other doubts about this approach:

- How can we detect the proper file name for the .src.rpm to fetch? We know it should be {{ project_name_mapped.stdout }}-*.src.rpm, but we cannot use that wildcard with wget.

- We would still be cloning the rdoinfo repo from GitHub, do we have plans to mirror that inside the OpenStack infra?

[1] - https://github.com/openstack/tripleo-quickstart-extras/blob/8d85a55436e8a93b8081b351f06ad4100e6dd825/roles/build-test-packages/tasks/dlrn-build.yml#L50-L64

Revision history for this message
Alan Pevec (apevec) wrote :

We could do wget -r --accept '{{ project_name_mapped.stdout }}-*.src.rpm' https://trunk.rdoproject.org/centos7-master/current/

but robots.txt prevents recursion, also pattern could match multiple SRPMs
e.g. 'instack-*.src.rpm' would match both instack and instack-undercloud.

Revision history for this message
Alan Pevec (apevec) wrote :

@jpena in the backtrace, dlrn is attempting to git clone rdoinfo,
while dlrn is called with --info-repo rdoinfo and also rdopkg findpkg -l rdoinfo

We could also try to change artg_rdoinfo_repo_url parameter to https://review.rdoproject.org/r/rdoinfo to avoid github and/or configure in SF additional git mirror to trunk.rdoproject.org, which is already proxy-cached in upstream infra?

Revision history for this message
Alan Pevec (apevec) wrote :

BTW Could not resolve host: github.com - isn't that local DNS issue??

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-quickstart-extras (master)

Fix proposed to branch: master
Review: https://review.openstack.org/519833

Revision history for this message
Alan Pevec (apevec) wrote :

Another idea: could we proxy-cache in upstream infra https://review.rdoproject.org/r/* and https://softwarefactory-project.io/r/* ?

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

@Alan, I'm not sure it's better. Github has more resources and more fail tolerant, and exactly as github can't be resolved, also softwarefactory-project.io and review.rdoproject.org couldn't be resolved in the same way. The problem is not with hostname, but with resolving.

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

For downloading a specific RPM we'd better use repoquery, then wget, curl, etc.
But it won't solve the problem with resolving that happens in openstack infrastructure, because we use unbound for resolving which is set by infra. So either problem is with unbound (unlikely), or with forwarders in infra, maybe they can't accept all requests.

Also there are few workaround that could help:
1) We can resolve github.com in the beginning of the job (by ping, wget, whatever), if unbound is configured properly by infra, it should cache the result until job ends. In case of fail, we can repeat it until it resolves, because if it doesn't, it doesn't make sense to continue job.

2) We can add some of public DNS in addition to 127.0.0.1 for resolving in case of unbound failure.

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

I submitted two patches to help debugging DNS issues:

https://review.openstack.org/#/c/520007
https://review.openstack.org/#/c/520055

Maybe it's worth to enable unbound debug for all jobs while we are hitting this issues.

Revision history for this message
Paul Belanger (pabelanger) wrote :

When this was brought up at PTG Denver, there was also some concern from openstack-infra members that tripleo was depending on unreleased source (hosted on github) for workflow development. That is the main reason we shouldn't be downloading things from there. This meant python projects would tag a release, and get uploaded into pypi, and jobs would use that.

We likely wouldn't proxy review.rdoproject.org or software-factory.io, for the same reason, it would be along the lines of using development versions of something.

The DNS issues, are unrelated to the discussions. We need to narrow down where the failures are happening, so far only centos-7 appears to be reporting issue, but this could also be a specific cloud too. EG: tripleo-test-cloud-rh1. Either way, it seems networking related to me and should be able to debug it.

Revision history for this message
Alfredo Moralejo (amoralej) wrote :

".. unreleased source (hosted on github) for workflow development..", means the distgits?. Distgits are not released by themselves but used to create a set of deliverables, the RPM packages.

Revision history for this message
Alan Pevec (apevec) wrote :

I guess concern was about rdoinfo specifically? It is currently a mix of data (it's RDO distro description) and some code to parse it. The latter is planned to be moved to a generic python module (working module name: distroinfo) and published on pypi.

If it was about distgits, then what Alfredo said, it's not releasable code but packaging description. Is the solution to move all distgits to openstack/rdo-* ?

Revision history for this message
Javier Peña (jpena-c) wrote :

About rdoinfo, we could release it in PyPi while we move the generic Python code out. The only issue I see is that we should release new versions pretty frequently, but that's about it.

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

@Paul, DNS issues are topic of this bug and source of problem in gate jobs. So maybe it's better to split this bug for two and to talk about services in separate bug? Because we try to talk about 2 different issues here in parallel and it's not productive for problem solution.

And back to topic of the bug, as I see from unbound logs we query for these hostnames:
http://logs.openstack.org/55/520055/4/check/legacy-tripleo-ci-centos-7-nonha-multinode-oooq/fce0b13/logs/undercloud/var/lib/unbound/unbound.log.txt.gz

mirror.centos.org
mirror.iad.rax.openstack.org.
mirror01.iad.rax.openstack.org.
trunk.rdoproject.org.
trunk-backup.rdoproject.org.
pypi.python.org.
prod.python.map.fastly.net.
github.com.
download.cirros-cloud.net. (handled in https://review.openstack.org/520349)

I'd like to set unbound for debug, it doesn't cost us anything. In case of DNS resolving failure we will catch the problem in log.

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

@Paul, if you can see first message in this bug report it's example when resolving didn't work. It happened in cloud "inap-mtl01", and I don't know about any failures in rh1 cloud.

Revision history for this message
Alfredo Moralejo (amoralej) wrote :

If the concern is about reproducibility of the jobs, we could report the commit of the repos used to build the packages on each Tripleo Job (rdoinfo and distgit). In koji the commit of the distgit is reported on each build https://koji.fedoraproject.org/koji/taskinfo?taskID=22328544.

Revision history for this message
Paul Belanger (pabelanger) wrote :

Yes, feel free to propose a change to project-config in our nodepool elements to enable debug logs on unbound. Shouldn't be an issue enabling them

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Folks, do you think there should be a more generic bug https://bugs.launchpad.net/openstack-gate/+bug/1713703 for that?

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

@Paul, I submitted a change to nodepool, please review/approve:
Set unbound logging to debug: https://review.openstack.org/521500

Revision history for this message
Paul Belanger (pabelanger) wrote :

I added some comments in https://bugs.launchpad.net/tripleo/+bug/1721702 but the TTL on github.com is 30 seconds, which doesn't make it great to cache in unbound. So, this likely explains the failures we are getting, possible we are being rate limited again.

Arx Cruz (arxcruz)
Changed in tripleo:
assignee: nobody → Sagi (Sergey) Shnaidman (sshnaidm)
Changed in tripleo:
milestone: queens-2 → queens-3
Revision history for this message
Alan Pevec (apevec) wrote :

Actual issue was DNS min.TTL fixed in 1721702

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-quickstart-extras (master)

Change abandoned by Sagi Shnaidman (<email address hidden>) on branch: master
Review: https://review.openstack.org/519833
Reason: According to TripleO policy of patch abandonment https://github.com/openstack/tripleo-specs/blob/master/specs/policy/patch-abandomment.rst, this patch is abandoned. If you would like to continue to work on it, please ask for restoration either on #tripleo in Freenode IRC, or openstack dev mailing list - <email address hidden> with [TripleO] in subject. Thanks.

Revision history for this message
Alan Pevec (apevec) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.