Landscape breaks if the filesystem is full

Bug #1488516 reported by Junien F
30
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Landscape Client
Triaged
High
Unassigned

Bug Description

Hi,

Upon tracking why a computer, an OpenStack instance, got removed from Landscape, I realized it was due to a removal profile, after 3 days of inactivity.

I found nothing on the computer (no network issue, Landscape processes are the ones from the boot of the server), except the following logs (note how they all become more silent after 21:56) :

broker.log :
2015-08-21 21:26:58,380 INFO [MainThread] Message exchange completed in 0.41s.
2015-08-21 21:41:58,384 INFO [MainThread] Starting message exchange with https://landscape.example.com/message-system.
2015-08-21 21:41:58,760 INFO [PoolThread-twisted.internet.reactor-1] Sent 1308 bytes and received 236 bytes in 0.37s.
2015-08-21 21:41:58,765 INFO [MainThread] Message exchange completed in 0.38s.
2015-08-21 21:56:58,767 INFO [MainThread] Starting message exchange with https://landscape.example.com/message-system.
2015-08-23 06:27:17,978 INFO [MainThread] Landscape Logs rotated
2015-08-23 06:27:18,015 INFO [MainThread] Landscape Logs rotated

manager.log :
2015-08-21 21:26:47,974 INFO [MainThread] Got notification of impending exchange. Notifying all plugins.
2015-08-21 21:41:48,385 INFO [MainThread] Got notification of impending exchange. Notifying all plugins.
2015-08-21 21:56:48,769 INFO [MainThread] Got notification of impending exchange. Notifying all plugins.
2015-08-23 06:27:17,982 INFO [MainThread] Landscape Logs rotated
2015-08-23 06:27:18,027 INFO [MainThread] Landscape Logs rotated
2015-08-23 06:27:18,185 INFO [MainThread] Landscape Logs rotated

And finally monitor.log, which is where I think the root cause lie. Notice how there is no more queuing of messages after the ENOSPC :

2015-08-21 21:26:47,979 INFO [MainThread] Got notification of impending exchange. Notifying all plugins.
2015-08-21 21:26:48,011 INFO [MainThread] Queueing a message with updated data watcher info for landscape.monitor.activeprocessinfo.ActiveProcessInfo.
2015-08-21 21:41:48,389 INFO [MainThread] Got notification of impending exchange. Notifying all plugins.
2015-08-21 21:41:48,419 INFO [MainThread] Queueing a message with updated data watcher info for landscape.monitor.activeprocessinfo.ActiveProcessInfo.
2015-08-21 21:55:36,673 INFO [MainThread] 240 of 240 expected load average snapshot events (100.00%) occurred in the last 3600.00s.
2015-08-21 21:55:36,677 INFO [MainThread] 240 of 240 expected memory/swap snapshot events (100.00%) occurred in the last 3600.00s.
2015-08-21 21:55:36,677 INFO [MainThread] 12 of 12 expected mount info snapshot events (100.00%) occurred in the last 3600.00s.
2015-08-21 21:55:36,713 INFO [MainThread] 120 of 119 expected CPU usage snapshot events (100.84%) occurred in the last 3600.00s.
2015-08-21 21:55:36,714 WARNING [MainThread] 0 of 719 expected Swift device usage snapshot events (0.00%) occurred in the last 3600.00s.
2015-08-21 21:55:36,714 WARNING [MainThread] 0 of 119 expected Ceph usage snapshot events (0.00%) occurred in the last 3600.00s.
2015-08-21 21:56:48,775 INFO [MainThread] Got notification of impending exchange. Notifying all plugins.
2015-08-21 21:56:48,784 ERROR [MainThread] Error running event handler landscape.broker.client.Monitor.notify_exchange() for event type 'impending-exchange' with args () {}.
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/landscape/reactor.py", line 92, in fire
    results.append(handler(*args, **kwargs))
  File "/usr/lib/python2.7/dist-packages/landscape/broker/client.py", line 220, in notify_exchange
    self.exchange()
  File "/usr/lib/python2.7/dist-packages/landscape/monitor/monitor.py", line 34, in exchange
    self.flush()
  File "/usr/lib/python2.7/dist-packages/landscape/monitor/monitor.py", line 29, in flush
    self.persist.save(self.persist_filename)
  File "/usr/lib/python2.7/dist-packages/landscape/lib/persist.py", line 158, in save
    self._backend.save(filepath, self._hardmap)
  File "/usr/lib/python2.7/dist-packages/landscape/lib/persist.py", line 633, in save
    file.write(self._bpickle.dumps(map))
IOError: [Errno 28] No space left on device
2015-08-21 21:56:48,813 INFO [MainThread] Queueing a message with updated data watcher info for l2015-08-21 23:55:36,674 INFO [MainThread] 240 of 239 expected load average snapshot events (100.42%) occurred in the last 3600.00s.
2015-08-21 23:55:36,676 INFO [MainThread] 240 of 240 expected memory/swap snapshot events (100.00%) occurred in the last 3600.00s.
2015-08-21 23:55:36,676 INFO [MainThread] 12 of 12 expected mount info snapshot events (100.00%) occurred in the last 3600.00s.
2015-08-21 23:55:36,713 INFO [MainThread] 120 of 119 expected CPU usage snapshot events (100.84%) occurred in the last 3600.00s.
2015-08-21 23:55:36,714 WARNING [MainThread] 0 of 719 expected Swift device usage snapshot events (0.00%) occurred in the last 3600.00s.
2015-08-21 23:55:36,714 WARNING [MainThread] 0 of 119 expected Ceph usage snapshot events (0.00%) occurred in the last 3600.00s.
2015-08-22 00:55:36,672 INFO [MainThread] 240 of 239 expected load average snapshot events (100.42%) occurred in the last 3600.00s.
2015-08-22 00:55:36,674 INFO [MainThread] 240 of 239 expected memory/swap snapshot events (100.42%) occurred in the last 3600.00s.
2015-08-22 00:55:36,675 INFO [MainThread] 12 of 11 expected mount info snapshot events (109.09%) occurred in the last 3600.00s.

Could the ENOSPC make some part of Landscape die, which would in turn make it stop exchanging with the server ? If so, I think this should be fixed in some ways.

Thanks

Revision history for this message
Junien F (axino) wrote :

This is landscape 14.12-0ubuntu0.14.04 by the way.

tags: added: is
Revision history for this message
Tom Haddon (mthaddon) wrote :

We were just hit by this for an archive server than ran out disk space due to apache holding open hidden files

Tom Haddon (mthaddon)
Changed in landscape-client:
status: New → Confirmed
Changed in landscape-client:
status: Confirmed → Triaged
importance: Undecided → High
tags: added: bug-squad kanban
tags: removed: kanban
Bogdana Vereha (bogdana)
Changed in landscape-client:
assignee: nobody → Bogdana Vereha (bogdana)
status: Triaged → In Progress
Bogdana Vereha (bogdana)
Changed in landscape-client:
assignee: Bogdana Vereha (bogdana) → nobody
status: In Progress → Triaged
Revision history for this message
David Britton (dpb) wrote :

@mthaddon -- do you have a suggestion on how you would like this work conceptually? Right now it's basically working as designed.

Changed in landscape-client:
status: Triaged → Incomplete
Revision history for this message
James Troup (elmo) wrote :

landscape-client on noticing write failures (due to -ENOSPACE or any other reason) could send an emergency one-shot notification back to the landscape server which could then be used a) to inform the administrator of a problem with the machine and/or b) to stop the machine from being auto-purged due to ping timeout.

Changed in landscape-client:
status: Incomplete → New
Jacek Nykis (jacekn)
tags: added: canonical-is
Changed in landscape-client:
status: New → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.