filesystem getting full on a contrail analytics node installation

Bug #1749900 reported by Prashanth Nageshappa on 2018-02-16
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R4.0
Invalid
Critical
Sundaresan Rajangam
R4.1
Invalid
High
Sundaresan Rajangam
R5.0
Invalid
High
Sundaresan Rajangam
Trunk
Invalid
High
Sundaresan Rajangam

Bug Description

file system is getting full on a contrail analytics node installation.

root@anc1-prd1-csp-can-03:~# df -h
Filesystem Size Used Avail Use% Mounted on
udev 16G 4.0K 16G 1% /dev
tmpfs 3.2G 3.4M 3.2G 1% /run
/dev/sda1 280G 251G 15G 95% /
none 4.0K 0 4.0K 0% /sys/fs/cgroup
none 5.0M 0 5.0M 0% /run/lock
none 16G 1.4M 16G 1% /run/shm
none 100M 0 100M 0% /run/user
none 280G 251G 15G 95% /var/lib/docker/aufs/mnt/0cdd862de767c10f139ad8f7d6979350509bf8112c70937ab72679b0154c2933
shm 64M 0 64M 0% /var/lib/docker/containers/ea0084b2ed9bc46a949b391d9de6ef24cc7c4eb700d19433e351e82ad6ed1b6c/shm
none 280G 251G 15G 95% /var/lib/docker/aufs/mnt/c1c83a33f0fcd9ed24a2f1a314f93e50ca5e63d7f816a1a9de8e59d823f6538a
shm 64M 0 64M 0% /var/lib/docker/containers/8f92055f1cdcf00b14a426ea8c850e04be9760bf906bf4484666dd6f29a18b68/shm
none 280G 251G 15G 95% /var/lib/docker/aufs/mnt/45dd227e0405a3305116e1613a2b4ef35b881de3a6f58d6f6cb00edbf9389401
shm 64M 148K 64M 1% /var/lib/docker/containers/c37694501abdaa500b64136ac030455a9bf39484d3bbaa7457ca53e340f0b917/shm

root@anc1-prd1-csp-can-03:/# du -sh /var/lib/docker/*
241G /var/lib/docker/aufs
4.2G /var/lib/docker/containers
8.2M /var/lib/docker/image
60K /var/lib/docker/network
20K /var/lib/docker/plugins
4.0K /var/lib/docker/swarm
4.0K /var/lib/docker/tmp
4.0K /var/lib/docker/trust
28K /var/lib/docker/volumes
root@anc1-prd1-csp-can-03:/# du -sh /var/lib/docker/aufs/*
121G /var/lib/docker/aufs/diff
224K /var/lib/docker/aufs/layers
121G /var/lib/docker/aufs/mnt

Inside the controller docker:
root@anc1-prd1-csp-can-03(controller):/# du -sh /var/lib/zookeeper/*
0 /var/lib/zookeeper/myid
34G /var/lib/zookeeper/version-2

information type: Proprietary → Public
summary: - filesystem getting full
+ filesystem getting full on a contrail analytics node installation
Ramkantha R Guthi (rkrguthi) wrote :

Copying comments history from JIRA bug https://aspg-jira.juniper.net/browse/CXU-18060?filter=14433

Hi Prasanth,

Customer has 3 CAN nodes running on 3.0 release.

One of the node / partition is 100% hence no containers running on it .

I have logged in to other CAN nodes and ~80G was occupied by /var/crash. That will explain why 80G space was utilized right but on each CAN node has ~280G allocated to / partition. Do you think cleaning up /var/crashes in each container will help the situation ? Also containers not running on node with 100% ( anc1-prd1-csp-adm-01) / partition , how are we going to clean up space ( assuming you are targeting only /var/crashes) on this anc1-prd1-csp-adm-01 node.

Please find the attachment for session log.

Thanks
Ram

Prashanth Nageshappa added a comment - 2/15/18 19:58
Clearing core files will help to some extent, but still we need to look at what else is taking ~200GB space.
We will need to have Contrail team debug this to understand why so much space is being taken. Will discuss with Mohan to check who from Contrail team can be involved to look into this.

Prashanth Nageshappa added a comment - 2/15/18 22:19
I have opened contrail bug https://bugs.launchpad.net/juniperopenstack/+bug/1749900

Changed in juniperopenstack:
importance: Undecided → Critical
Ramkantha R Guthi (rkrguthi) wrote :

Any update on this ?

Customer looking for resolution as CAN nodes are in this state for a while.

Ramkantha R Guthi (rkrguthi) wrote :
Ramkantha R Guthi (rkrguthi) wrote :
Ramkantha R Guthi (rkrguthi) wrote :
Download full text (3.6 KiB)

Hi Sundar,

I have attached other logs collected from customer CAN nodes. These logs were uploaded to CSO JIRA bug and are missing in launchpad.

As per logs it seems more space consumed by active containers compare to old images . So we can rule out issues mentioned in github forums about stale or old images.

root@anc1-prd1-csp-can-01:/var/lib/docker# cd aufs/
root@anc1-prd1-csp-can-01:/var/lib/docker/aufs# ls
diff layers mnt

root@anc1-prd1-csp-can-01:/var/lib/docker/aufs/diff# du -sh *
12K 00a0206078b9c8a520e4d2ebd82ec1033aa4c91001f87fe9b165a46aeee71f83
20K 00cf9f653764931f31c55cc3f0a041721fe646e1a0f56c9fb1c696309d4b8709

..
..
24K 12b85e14145df7328c7bc706fc3cfb1b506830de770fca2a4d9a7feb09ba104c
20K 12b85e14145df7328c7bc706fc3cfb1b506830de770fca2a4d9a7feb09ba104c-init
...

1.3G 56fde5f11d9656d27c25d3f1f2d15da30267240f4488e95e113bda26d71ec17f
24K 56fde5f11d9656d27c25d3f1f2d15da30267240f4488e95e113bda26d71ec17f-init
..
37G 6e31524d1bedfc2f27bfdecd91c2ca0bbcc0e25eda5d41e3acf75daba8b9257c
24K 6e31524d1bedfc2f27bfdecd91c2ca0bbcc0e25eda5d41e3acf75daba8b9257c-init
..
86G bdeeb2089d2e519d9c4f5fe9392fd4e9c3db83be41b481336e68e0b08e891531
24K bdeeb2089d2e519d9c4f5fe9392fd4e9c3db83be41b481336e68e0b08e891531-init

CAN-2 node :

root@anc1-prd1-csp-can-02:/var/lib/docker/aufs# du -sh *
119G diff
224K layers
119G mnt

root@anc1-prd1-csp-can-02:/var/lib/docker/aufs/mnt# du -sh *
4.0K 0560635be9b776566523735df40f8b1dffe5afc62485f9b526334c6c8e9fca4c
4.0K 074953fd509f9dd6b51825278c37e482e75cff7fd71ce3f4cc1eece020cea018

..
..
33G 25f577baf2f68a1d21a2292057930699c0aef7e318edaea9fbfbe261fe80fe45
4.0K 25f577baf2f68a1d21a2292057930699c0aef7e318edaea9fbfbe261fe80fe45-init
..
..
82G 5fe1c6812ec09036e909831a7960d2cf4f197969d645e6e84de7f8507c56e772
4.0K 5fe1c6812ec09036e909831a7960d2cf4f197969d645e6e84de7f8507c56e772-init
..
..
4.0K ecf1d4413761230bc46c24d4a2bc07e5f6784191cd7a57ad076b4e521271ab61-init
4.0K ede479c43090e998f2b64c3828b4f764424ef0c97b214a87165cf0991fd4657f

As discussed in other emails thread we need answers for below from analytic perspective . Based on above logs it should be 120G for analytic containers. So we need more answers from analytic team. Is there any auto purge mechanism in place ??

From: Himanshu Bahukhandi
Sent: Monday, March 5, 2018 9:06 AM
To: Prashanth Nageshappa <email address hidden>; Soumit Mishra <email address hidden>; Santosh Gupta <email address hidden>; Jeba Paulaiyan <email address hidden>; Kamlesh Parmar <email address hidden>; Ritam Gangopadhyay <email address hidden>; Contrail Systems Analytics Team <email address hidden>
Cc: Rudra Rugge <email address hidden>; Abhay Joshi <email address hidden>; Viswanath KJ <email address hidden>; Srinivasan Dhamotharan <email address hidden>; Ramakantha Guthi <email address hidden>
Subject: Re: CAN production server reached 100% disk space

Hello,
We can rule out 1 and 2 for now as customer moved to 3.2.1 and doesn't see a lot of zookeeper snapshots. Also, there are no analytics core in the new CSO build that they are running. We still need to focus on the analytics table size that grows with time. T...

Read more...

Ramkantha R Guthi (rkrguthi) wrote :

From: Ramakantha Guthi <email address hidden>
Date: Thursday, March 8, 2018 at 12:15 PM
To: Prashanth Nageshappa <email address hidden>, Himanshu Bahukhandi <email address hidden>, Santosh Gupta <email address hidden>, Sundaresan Rajangam <email address hidden>
Cc: Rudra Rugge <email address hidden>, Abhay Joshi <email address hidden>, Viswanath KJ <email address hidden>, Srinivasan Dhamotharan <email address hidden>, Soumit Mishra <email address hidden>, Kamlesh Parmar <email address hidden>, Contrail Systems Analytics Team <email address hidden>, Ritam Gangopadhyay <email address hidden>, Jeba Paulaiyan <email address hidden>
Subject: Re: CAN production server reached 100% disk space

HI Prasanth / Santosh / Sundar

I have updated LP # 1749900 . I think we can ignore stale images option as logs shows most of the space under /var/lib/docker/aufs/diff occupoied by active containers.

Per earlier Himanshu email need more answers on

1. Zookeeper snapshot files in controller container (~38G)
/var/lib/zookeeper/version-2/

2. Contrail Collector process crash files in the analytics container (~80G)
 /var/crashes/core.contrail-collec.*

3. Cassandra database table in the analyticsdb container (~80G)
/var/lib/cassandra/data/ContrailAnalyticsCql/statstablebystrtagv3-3028f790891111e78c7b0f519d1da8b1
   Also from /var/lib/docker/afus/diff shows active conainer images taking ~120G ..

Currently issues 1 & 2 are not seen in customer setup as they moved from CSO 3.0 to CSO 3.2.1 . If possible it would be good to know it these #1 and #2 issues fixed in CSO 3.2.1 build ?

Thanks
Ram

Jeba Paulaiyan (jebap) on 2018-03-20
tags: added: analytics
tags: added: 2018-0129-0643 jtac
tags: added: gci
Sundaresan Rajangam (srajanga) wrote :

the analytics data ttl values (in contrail-collector.conf) seem to have not been set appropriately based on the available resources. Hence marking the bug as Invalid

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers