Too many ephemeral ports open on postgres container

Bug #1440857 reported by Charlie Ott
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
Medium
Fuel Library (Deprecated)
6.0.x
Won't Fix
Medium
Matthew Mosesohn
7.0.x
Invalid
Medium
Charlie Ott
8.0.x
Invalid
Medium
Fuel Library (Deprecated)

Bug Description

While trying to rebuild my postgres container and restore from pgsql/9.3 data directory from a previously lost container(http://docs.mirantis.com/openstack/fuel/master/operations.html#docker-metadata-corruption-loses-containers), i am having trouble connecting to the postgres database after starting the postgres container. 'dockerctl start postgres'

The TCP connection never happens and is timed out over and over. Thus I cannot log in to the fuel UI:

2015-04-06 17:29:29.881 868 TRACE keystone.common.wsgi OperationalError: (OperationalError) could not connect to server: Connection refused
2015-04-06 17:29:29.881 868 TRACE keystone.common.wsgi Is the server running on host "10.20.0.2" and accepting
2015-04-06 17:29:29.881 868 TRACE keystone.common.wsgi TCP/IP connections on port 5432?

So I checked the container status with docker:
[root@fuel ~]# docker ps -a | grep postgres | grep Up
d78296d733f2 fuel/postgres_6.0:latest /usr/local/bin/start 34 minutes ago Up 23 minutes 10.20.0.2:5432->5432/tcp, 127.0.0.1:5432->5432/tcp fuel-core-6.0-postgres

and it looks good, so then looking at netstat inside the container:

dockerctl shell postgres su postgres -c "netstat -a"

Shows that there are 748 waiting connections on my instance. (Currently)

This seems very excessive and inefficient.

using openstack paste, as this output is very long and will make this ticket ridiculous to scroll through: http://paste.openstack.org/show/198743/

ps -ef inside the container shows postmaster running and listening on the right port... but running it multiple times shows another instance starting/stopping over and over...???

[root@fuel ~]# dockerctl shell postgres su postgres -c "ps -ef | grep pgsql"
root 2895 1 0 19:14 ? 00:00:00 sudo -u postgres /usr/pgsql-9.3/bin/postmaster -p 5432 -D /var/lib/pgsql/9.3/data
postgres 2896 2895 9 19:14 ? 00:02:20 /usr/pgsql-9.3/bin/postmaster -p 5432 -D /var/lib/pgsql/9.3/data
root 11654 0 0 19:37 ? 00:00:00 su postgres -c ps -ef | grep pgsql
postgres 11661 11654 0 19:37 ? 00:00:00 bash -c ps -ef | grep pgsql
postgres 11663 11661 0 19:37 ? 00:00:00 grep pgsql
postgres 11664 2896 0 19:37 ? 00:00:00 /usr/pgsql-9.3/bin/postmaster -p 5432 -D /var/lib/pgsql/9.3/data

[root@fuel ~]# dockerctl shell postgres su postgres -c "ps -ef | grep pgsql"
root 2895 1 0 19:14 ? 00:00:00 sudo -u postgres /usr/pgsql-9.3/bin/postmaster -p 5432 -D /var/lib/pgsql/9.3/data
postgres 2896 2895 9 19:14 ? 00:02:30 /usr/pgsql-9.3/bin/postmaster -p 5432 -D /var/lib/pgsql/9.3/data
root 4732 0 0 19:39 ? 00:00:00 su postgres -c ps -ef | grep pgsql
postgres 4738 4732 0 19:39 ? 00:00:00 bash -c ps -ef | grep pgsql
postgres 4741 4738 0 19:39 ? 00:00:00 grep pgsql

[root@fuel ~]# dockerctl shell postgres su postgres -c "ps -ef | grep pgsql"
root 2895 1 0 19:14 ? 00:00:00 sudo -u postgres /usr/pgsql-9.3/bin/postmaster -p 5432 -D /var/lib/pgsql/9.3/data
postgres 2896 2895 9 19:14 ? 00:02:31 /usr/pgsql-9.3/bin/postmaster -p 5432 -D /var/lib/pgsql/9.3/data
root 6480 0 0 19:39 ? 00:00:00 su postgres -c ps -ef | grep pgsql
postgres 6487 6480 0 19:39 ? 00:00:00 bash -c ps -ef | grep pgsql
postgres 6489 6487 0 19:39 ? 00:00:00 grep pgsql
postgres 6490 2896 0 19:39 ? 00:00:00 /usr/pgsql-9.3/bin/postmaster -p 5432 -D /var/lib/pgsql/9.3/data

Revision history for this message
Charlie Ott (charlieott) wrote :

this is on a fuel 6.0 instance i have been upgrading since 5.0, (approx 8 months)

description: updated
description: updated
description: updated
description: updated
Charlie Ott (charlieott)
description: updated
description: updated
Changed in fuel:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Matthew Mosesohn (raytrac3r)
tags: added: customer-found
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

This isn't part of normal operations, but instead is related to an issue when recovering postgres container. Moving to medium. Charlie, can you let me know when you're available again?

Changed in fuel:
importance: High → Medium
milestone: none → 6.1
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Hi, Charlie

This is a medium priority bug and we reaching code freeze stage, so we will not fix it in 6.1 release, byt Matthew will post a workaround here to make it possible for you to fix your particular issue.

Revision history for this message
Charlie Ott (charlieott) wrote :

I understand. Thank you Vladimir for confirming and responding.

Matthew, I would like to spend some time tomorrow (4/7/2015) on this. I'll find you via #fuel on IRC and shoot you a message around 8AM EST.

thanks again to the mirantis team, you guys have a great product and great support.

Revision history for this message
Charlie Ott (charlieott) wrote :

following the script in /usr/local/bin/start.sh I discovered postmaster has a startup log declaring that logs were being handled with the postgres logging process and future logs would be generated in 'pg_log'

Then, found some errors in the log @ /var/lib/pgsql/9.3/data/pg_log/postgresql-Mon.log with yesterday's timestamps,

http://paste.openstack.org/show/199728/

investigating further.

Revision history for this message
Charlie Ott (charlieott) wrote :

bottom of the log actually shows this:
< 2015-04-07 19:59:35.746 BST >FATAL: database "nailgun" does not exist
< 2015-04-07 19:59:35.750 BST >FATAL: database "nailgun" does not exist
< 2015-04-07 19:59:35.754 BST >FATAL: database "nailgun" does not exist
< 2015-04-07 19:59:35.758 BST >FATAL: database "nailgun" does not exist
< 2015-04-07 19:59:35.762 BST >FATAL: database "nailgun" does not exist
< 2015-04-07 19:59:35.766 BST >FATAL: database "nailgun" does not exist
< 2015-04-07 19:59:35.769 BST >FATAL: database "nailgun" does not exist
< 2015-04-07 19:59:35.773 BST >FATAL: database "nailgun" does not exist
< 2015-04-07 19:59:35.776 BST >FATAL: database "nailgun" does not exist
< 2015-04-07 19:59:35.781 BST >FATAL: database "nailgun" does not exist
< 2015-04-07 19:59:35.784 BST >FATAL: database "nailgun" does not exist
< 2015-04-07 19:59:35.788 BST >FATAL: database "nailgun" does not exist

the log is 1.3 GB

Revision history for this message
Charlie Ott (charlieott) wrote :

so i started a new postgres container with 'dockerctl start postgres', let it boot and then stopped it once it kept timing out connecting to postgres. i dont know why it was timing out and restarting, but if i had to guess something like supervisord must be starting it over and over and the db config had some kind of issue.

i then exported my old postgres container i found using the docker logs and fetching the container via devicemapper, and saved it as a gzipped tarball. NOTE: use the --numeric-owner flag with tar to preserve the postgres uid/gid (26/26), then started the container in centos 6.6 instance and fiddled with permission issues and disabling selinux on the docler host to get the database daemon to start.

next i took a text based export of the databases: nailgun, keystone and ostf. did the with pg_dump. and scp'd the exports to the fuel master.

next i started the container with 'docketctl start postgres' and copied the exports to /var/docker/devicemapper/mnt/containerid/rootfs/var/lib/postgres/

last i used 'dockerctl shell postgres' with the postgres user and directed each sql export into the psql binary using the postgres user to the proper database. e.g. import the nailgun export into the nailgun db.

after a bit i checked the connections inside postgres container with dockerctl shell netstat (which apparently is deprecated by ss in debian jessie???) and saw that other connections were established with the container. hit the fuel ui and ran health check and all is well.

i recommend configuring an auto backup strategy for postgres db and allow 'snapshot roll back' option in the fuel UI; to automate this process. maybe change this status to feature request.

Matthew,
when i get a chance i'll re-read your docs and see if anything can be clarified better for someone less familiar with devicemapper and docker. re-mounting the old container with devicemapper and getting the exports,, i think was most challenging.

tags: added: qa-agree-7.0 release-notes
no longer affects: fuel/6.1.x
Changed in fuel:
status: Confirmed → Won't Fix
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Charlie, you promised to provide more details on the issue. Have you figured out the details.

Changed in fuel:
status: Incomplete → Won't Fix
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

I don't think we need release notes for this one. If you follow directions for recovering postgres container, you you should stop nailgun/ostf/keystone containers during this process. That will prevent too many connections from appearing and impacting performance.

tags: removed: release-notes
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

This bug doesn't apply to 6.1, 7.0 or 8.0. Marking won't fix.

Dmitry Pyzhov (dpyzhov)
tags: added: area-library
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
assignee: Charlie Ott (charlieott) → Fuel Library Team (fuel-library)
milestone: 6.1 → 8.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.