User Tools

Site Tools


monitor:metrics:node

Node Metrics

The daemon running on the research device provide node-centric data including sliver specific information and monitors periodically (e.g. every sixty seconds). Slice centric information is generated by analysing the node centric logs and is stored in the database. The client daemon accepts HTTP requests and responds with HTTP responses, to allow them to be accessed from web browsers in addition to being used with automated systems. The response is provided in JSON format to allow researchers to query and use monitored data.

Figure 9 shows the node type document in JSON format, received by the server at time = 1385076755.63 and containing node-centric and sliver specific data of the RD with IPv6 = fdf5:5351:1dfd:9b::2. As we can observed the name of the document is therefore, a combination of the IPv6 and the timestamp: [fdf5:5351:1dfd:9b::2]-1385076755.63.

                                   Figure 9: Monitored metrics in the Research Device              

The system reports the following OS-provided metrics:

  • Uptime: It is a measure of how long the system has been on since it was last restarted. Is the total number of seconds the system has been on until last boot.
  • Network data:
    • Total number of bytes sent and received.
    • The number of bytes sent and received during the last second
    • For each one of the network interfaces on the RD we also have the total number of bytes sent and received and the number of bytes received the last minute.
  • Average load: This metric is relative to the number of cores in the RD' CPU. For example, if we have 2 cores and load_avg = 2, this implies that the system is running at maximum capacity and therefore the CPU is occupied all the time. If load_avg > 2 it means that the system is overloaded and there are tasks waiting for the CPU.
    • Load_avg_15min, load_avg_5min, load_avg_1min how many tasks are waiting or running in the CPU in the last 15 min, 5 min and 1 min respectively. These metrics help us to get the history of the CPU load in the lasts minutes and have and idea of its evolution.
    • Tasks scheduled to run.
    • Total number of tasks.
  • Virtual memory utilization:
    • “available” is the total memory space available in bytes.
    • “total” is the total memory size in bytes.
    • “used” is the total memory space used in bytes.
    • “percent_used” is the total percentage of memory space used.
  • Disk:
    • “size” is the disk size in bytes.
    • For each disk partition we have the total size in bytes, the space used (in bytes and in percentage) and the free disk space in bytes.
  • CPU utilization:
    • “num_cpus”: is the number of cores in the CPU.
    • For each processor we have the percentage of CPU usage.
    • The total percentage of CPU utilization.
  • Slivers:
    • For each sliver running on the RD we have its state and metrics of the memory and CPU utilization.
    • We also have the slice name to which the specific sliver belongs to.
  • Timestamps: We have three different timestamps in the node documents.
    • “server_timestamp” is the local time of the server when it receives the JSON document with the monitored information.
    • “monitored_timestamp” is the local time on the RD when the client performs monitoring tasks.
    • “relative_timestamp” is calculated by the client daemon (see the Timestamp section of the client daemon). It allows the server to know the relative time interval between all the different node documents generated by a specific RD.

The “relative_timestamp” allows us to visualize in the monitor homepage not only the most recent information (from the “most recent” JSON documents, but also the node metrics evolution over time.

As an example, figure 10 shows the CPU usage over time for a Research Device.

                                     Figure 10: Average CPU usage

Also, the status of slivers along with their IP address and resource usage is shown in Figure 11.

                                     Figure 11: Sliver usage

The system maintains only a manageable set of metrics that help the researchers get insight of any strange behaviour in a given node. To facilitate the researchers in selecting nodes to run their experiments, the web interface provides a Treemap view of all the nodes based on the historical trend (customizable) of resource usage. Figure 12 shows a Treemap view of all the monitored research devices in the testbed.

                                          Figure 12: Treemap representation
monitor/metrics/node.txt · Last modified: 2014/07/22 18:55 by esunly