User Tools

Site Tools


requirements:resource-monitoring

Resource monitoring

Code SRRM-1
Responsible Ivan Vilata
Components testbed server, testbed node

Description

The status of the resource should be available to the system or administrator both in real time and aggregate over some time periods (hours, days), to measure the impact of experiments and to allow automatic and manual resource allocation policies.

Comments

This is different from resource information (SRRM-3) that is more static.

This is an usual practice in most management infrastructures. This doesn't imply we need to use different tools, just that there are different requirements:

  • Used/Available CPU
  • Used/Available disk
  • Used/Available bandwidth
  • Average uptime/Fail probability

Requirements related with SRRM7.

Analysis

Details

Node or testbed administrators should be able to receive or access information about the usage of node resources in a readily comprehensible manner, with the support of tables, graphs, maps and whatever means needed for its better understanding. This information would reflect the current state of the node or historical data for different time spans and resolutions. Information should include node uptime, CPU load, available memory, disk, processes and bandwidth, number of running, queued and deployed slivers, number of available and used interfaces and size of sliver queue for those, and any other measurement deemed useful for administration. Setting triggers when certain limits are reached on these parameters would enable admin notifications. Making historical trends visible should also help in dimensioning the node (global testbed trends would help dimensioning the testbed itself), and giving programmatic access to at least current measurements should help automated resource allocation.

Solutions

Resource monitoring data is collected by several existing tools in community networks. For CONFINE nodes we may rely on that data, but the set of measurements varies considerably among different tools and networks. This may justify adopting our own tool (new or existing) for uniformity inside and among CONFINE testbeds. Since data collected from resource monitoring varies constantly, it may be convenient to keep it out of relational databases and in specialised databases like RRDtool's (a common practice in existing tools), which also ease graphing averaged data along different time spans.

Synchronicity options:

  1. Asynchronous (push), nodes send status to server/admin.
    • Delivery options:
    1. Register subscribers in node.
    2. Query external host for node's subscribers.
    3. Use external host as a relay.
  2. Synchronous (pull), server/admin queries node for its status.
    • Accessibility options:
    1. Data is publicly available.
    2. Access to data is restricted.
    • Collection options:
    1. Node runs an SNMP daemon.
    2. Node runs a custom daemon with an HTTP interface to data.

Result format options:

  1. Multimedia data, designed for human consumption (e.g. web page navigation with text, graphs, maps…).
  2. Structured data, designed for machine consumption (e.g. API-like access to XML, YAML, JSON…).

Unless caching (unfit for such rapidly changing data) is used, each access to the monitor triggers a collection of resource data. Thus asynchronous monitoring should be adequate to keep monitoring overhead low even for many subscribers (one update triggers many notifications). However, we expect very few subscribers per node (a CONFINE server, a node admin). This may not justify configuring the required messaging system (SMTP, XMPP, OStatus…) which may be more sensitive to connectivity problems (since the node must take care of delivery).

Synchronous monitoring is used at guifi.net (public SNMP), Freifunk Oldenburg and wlan slovenija (public HTTP). Monitoring overhead may be big (resulting in DoS attacks in the extreme) if no access control is performed, but no messaging system is needed and connectivity problems don't interfere with monitoring. SNMP is an industry standard available on many platforms, but HTTP access to monitoring results can be very simple to implement and allow for both web and API access. Public access may be more sensitive to DoS attacks but it should also enable unplanned, third-party experiments on nodes' data while not revealing security-sensitive information.

Structured data is amenable for collection by servers which can offer a global view of the testbed while taking care of presentation tasks, reducing monitoring overhead on nodes. All the examples above use structured data, while Freifunk Dresden routers offer a web site with human readable text and graphs.

Using a server and asynchronous monitoring can make the representation of periodic and global data more complicated.

Recommendation

I recommend using synchronous data monitoring started by the CONFINE testbed server, with nodes providing structured, text data publicly via HTTP. Monitoring data should be stored at the server in RRD files. Human readable renderings of monitoring data should be performed by the server based on those files. This is the model followed by Freifunk Oldenburg and wlan slovenija, so we may consider using or adapting their software.

requirements/resource-monitoring.txt · Last modified: 2012/01/31 18:56 by ivilata