buildd-manager intermittently trying to make incorrect DB connection

Bug #578338 reported by Gary Poster on 2010-05-10
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
High
[LEGACY] Canonical WebOps

Bug Description

2010-05-04 10:49:47+0100 [-] Log opened.
2010-05-04 10:49:47+0100 [-] twistd 10.0.0 (/usr/bin/python2.5 2.5.2) starting up.
2010-05-04 10:49:47+0100 [-] reactor class: twisted.internet.selectreactor.SelectReactor.
2010-05-04 10:49:47+0100 [-] Starting scanning cycle.
2010-05-04 10:49:47+0100 [-] Scanning failed with: could not connect to server: No such file or directory
2010-05-04 10:49:47+0100 [-] Is the server running locally and accepting
2010-05-04 10:49:47+0100 [-] connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?
2010-05-04 10:49:47+0100 [-]
2010-05-04 10:49:47+0100 [-] Traceback (most recent call last):
2010-05-04 10:49:47+0100 [-] File "/home/pqm/for_rollouts/production/eggs/Twisted-10.0.0-py2.5-linux-x86_64.egg/twisted/scripts/_twistd_unix.py", line 317, in startApplication
2010-05-04 10:49:47+0100 [-]
2010-05-04 10:49:47+0100 [-] File "/home/pqm/for_rollouts/production/eggs/Twisted-10.0.0-py2.5-linux-x86_64.egg/twisted/application/app.py", line 713, in startApplication
2010-05-04 10:49:47+0100 [-]
2010-05-04 10:49:47+0100 [-] File "/home/pqm/for_rollouts/production/eggs/Twisted-10.0.0-py2.5-linux-x86_64.egg/twisted/application/service.py", line 278, in startService
2010-05-04 10:49:47+0100 [-]
2010-05-04 10:49:47+0100 [-] File "/srv/launchpad.net/codelines/soyuz-production-rev-9329/lib/lp/buildmaster/manager.py", line 250, in startService
2010-05-04 10:49:47+0100 [-] d = defer.maybeDeferred(self.scan)
2010-05-04 10:49:47+0100 [-] --- <exception caught here> ---
2010-05-04 10:49:47+0100 [-] File "/home/pqm/for_rollouts/production/eggs/Twisted-10.0.0-py2.5-linux-x86_64.egg/twisted/internet/defer.py", line 117, in maybeDeferred
2010-05-04 10:49:47+0100 [-]
2010-05-04 10:49:47+0100 [-] File "/srv/launchpad.net/codelines/soyuz-production-rev-9329/lib/canonical/librarian/db.py", line 38, in retry_transaction_decorator
2010-05-04 10:49:47+0100 [-] return func(*args, **kwargs)
2010-05-04 10:49:47+0100 [-] File "/srv/launchpad.net/codelines/soyuz-production-rev-9329/lib/canonical/database/sqlbase.py", line 738, in reset_store_decorator
2010-05-04 10:49:47+0100 [-] _get_sqlobject_store().reset()
2010-05-04 10:49:47+0100 [-] File "/srv/launchpad.net/codelines/soyuz-production-rev-9329/lib/canonical/database/sqlbase.py", line 119, in _get_sqlobject_store
2010-05-04 10:49:47+0100 [-] return getUtility(IStoreSelector).get(MAIN_STORE, DEFAULT_FLAVOR)
2010-05-04 10:49:47+0100 [-] File "/srv/launchpad.net/codelines/soyuz-production-rev-9329/lib/canonical/launchpad/webapp/adapter.py", line 571, in get
2010-05-04 10:49:47+0100 [-] return db_policy.getStore(name, flavor)
2010-05-04 10:49:47+0100 [-] File "/srv/launchpad.net/codelines/soyuz-production-rev-9329/lib/canonical/launchpad/webapp/dbpolicy.py", line 86, in getStore
2010-05-04 10:49:47+0100 [-] store_name, 'launchpad:%s' % store_name)
2010-05-04 10:49:47+0100 [-] File "/home/pqm/for_rollouts/production/eggs/storm-0.15danilo_storm_launchpad_r342-py2.5-linux-x86_64.egg/storm/zope/zstorm.py", line 156, in get
2010-05-04 10:49:47+0100 [-]
2010-05-04 10:49:47+0100 [-] File "/home/pqm/for_rollouts/production/eggs/storm-0.15danilo_storm_launchpad_r342-py2.5-linux-x86_64.egg/storm/zope/zstorm.py", line 133, in create
2010-05-04 10:49:47+0100 [-]
2010-05-04 10:49:47+0100 [-] File "/home/pqm/for_rollouts/production/eggs/storm-0.15danilo_storm_launchpad_r342-py2.5-linux-x86_64.egg/storm/store.py", line 73, in __init__
2010-05-04 10:49:47+0100 [-]
2010-05-04 10:49:47+0100 [-] File "/home/pqm/for_rollouts/production/eggs/storm-0.15danilo_storm_launchpad_r342-py2.5-linux-x86_64.egg/storm/database.py", line 381, in connect
2010-05-04 10:49:47+0100 [-]
2010-05-04 10:49:47+0100 [-] File "/home/pqm/for_rollouts/production/eggs/storm-0.15danilo_storm_launchpad_r342-py2.5-linux-x86_64.egg/storm/database.py", line 182, in __init__
2010-05-04 10:49:47+0100 [-]
2010-05-04 10:49:47+0100 [-] File "/srv/launchpad.net/codelines/soyuz-production-rev-9329/lib/canonical/launchpad/webapp/adapter.py", line 385, in raw_connect
2010-05-04 10:49:47+0100 [-] raw_connection = super(LaunchpadDatabase, self).raw_connect()
2010-05-04 10:49:47+0100 [-] File "/home/pqm/for_rollouts/production/eggs/storm-0.15danilo_storm_launchpad_r342-py2.5-linux-x86_64.egg/storm/databases/postgres.py", line 328, in raw_connect
2010-05-04 10:49:47+0100 [-]
2010-05-04 10:49:47+0100 [-] psycopg2.OperationalError: could not connect to server: No such file or directory
2010-05-04 10:49:47+0100 [-] Is the server running locally and accepting
2010-05-04 10:49:47+0100 [-] connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?

See:
https://irclogs.canonical.com/2010/05/04/%23launchpad-code.html#t11:34
https://lists.ubuntu.com/mailman/private/canonical-launchpad/2010-May/056269.html

This is a new issue as of 10.04.

I have a tar available of the pertinent logfiles (the files will be rotated so keeping them around might be useful). From bigjools and danilo:

- look for "psycopg2.OperationalError: could not connect to server: No such file or directory"
- specifically it tries to connect to PG via /var/run/postgresql/.s.PGSQL.5432
- the only way I know that can happen is that LPCONFIG is not set so I guess we also need to look at the startup scripts (the one that starts up the daemons)

Michael Nelson (michael.nelson) wrote :

Just a bit of info from when I tried to look into this on Friday:

On Fri, May 7, 2010 at 5:47 PM, Michael Nelson <email address hidden> wrote:
> So I've been playing with r9336 production stable, and running the
> buildd-manager with a non-local postgres connection (and killing my
> local psql), and I cannot reproduce this without removing the host
> option for the db config.
>
> So far, I cannot see why we are trying to connect to a local postgres
> intsance on cesium, or why the issue disappears after a restart, nor
> reproduce the issue :/
>
> The possibilities:
> 1) LPCONFIG env. variable was not set during first run - unlikely but possible?
> 2) A change to some of the dbconfig loading options (there were some
> during the 10.04 cycle) introduced some kind of config loading issue -
> very unlikely.
>
> I'd recommend that we ask a losa to re-update cesium to
> production-stable and restart again (and find out exactly what command
> they run to restart it - the wiki only has examples for the old slave
> scanner/buildd-sequencer).

It would have been nice to be able to test the above using the same startup script used in production.

Stuart Bishop (stub) wrote :

Which LPCONFIG is being used when this error is triggered?

Does this always happen, or dones the system sometimes start up correctly?

Tom tells me it's LPCONFIG=ftpmaster that's being used. Since we had it during original rollout, it failed at first and then when it was restarted it succeeded.

Gary Poster (gary) wrote :

Reminder to self: pertinent logs are devpad:/srv/launchpad.net-logs/soyuz/cesium/buildd-manager.log.XX

Gary Poster (gary) wrote :

Given the comments from bigjools danilo and noodles, LPCONFIG looks like the key. Chex let me look at the script that starts the buildd-manager: https://pastebin.canonical.com/32161/ . I think it is very suspicious that the script does not export LPCONFIG in start_buildd_manager, as it does for stop_buildd_manager? As danilo pointed out on IRC, this might explain the observed behavior: first it doesn't work, then you stop (and LPCONFIG is exported!) and then you start and it does work.

Could LOSAs add the export in start_buildd_manager and see if it addresses the problem?

Thank you

Changed in launchpad-foundations:
assignee: nobody → Canonical LOSAs (canonical-losas)
Michael Nelson (michael.nelson) wrote :

Thanks Gary... that certainly makes sense of the situation.

Stuart Bishop (stub) wrote :

If it is just a matter of ensuring LPCONFIG is exported correctly, a quick fix would be to create ~/.lpconfig - the contents of this will be used as the configuration name if LPCONFIG environment variable is not set.

Tom Haddon (mthaddon) wrote :

We've added "export LPCONFIG=${LPCONFIG}" to the start_buildd_manager function.

Changed in launchpad-foundations:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers