User Tools

Site Tools



Monitoring the CONFINE testbed presents specific challenges that the designed system should address:

  • Large scale of infrequently used data.
  • The monitoring system should support active measurements that provide insight into the functioning of a node without revealing too much information of what is running on it.
  • It should also gather Slice and Sliver specific information.
  • The system should be flexible enough to add new metrics without hampering the functionality.
  • Monitoring logs should never lose precision and should support passively measured data such as last-time SSH succeeded, number of ports in use, resource hogs (which experiments are using the most CPU, memory, bandwidth and ports).

A general-purpose monitoring system does not meet these special-purpose requirements of CONFINE and are meant for different workloads and properties. Slice specific and sliver specific information (lxc monitoring) cannot be obtained directly by any of the existing monitoring systems. Nagios, Zenoss, ntop, Ganglia and cacti use RRDtool for storing data.

RRDtool (Round Robin Database tool) is great for storing time series data and aggregating information, but are quite inflexible. It becomes necessary to compromise between flexibility and efficiency. Adding new metrics would require updating the database file (RRA). Once an RRA (Round Robin Archive) is created, it is possible to change existing values and add new data sources, it is not possible to add or remove metrics and change their properties. If modelling of data is not considered carefully, it can lead to a number of updates as and when new slivers are created in a node. Slice specific data implies data from different nodes and would result in a dynamic list of RRD which in turn would need additional scripts to fetch, aggregate and display data. For instance in Comon (monitoring system of PlanetLab), the data model is carefully chosen, but still old database files are deleted when the format changes. In many cases (depending on configuration) if an update is made to an RRD series but is not followed up by another update soon, the original update will be lost. This makes it less suitable for recording data such as operational metrics. There is no way to back-fill data in an RRD series and depending on the data model, a single RRD receiving data from multiple sources can be affected by this. Given the large scale varying resource consumption and the dynamic nature of CONFINE, flexibility is a key requirement.

Apart from that, sliver-centric information is not easily integrated into node-centric data provided by off-the-shelf monitoring systems. This kind of data gathering is also part of the motivation for developing a separate monitoring system to meet the specific needs of CONFINE.

monitor/motivation.txt · Last modified: 2014/07/22 18:21 by esunly