User Tools

Site Tools


admin:testbed-maintenance

Testbed maintenance

As a testbed operator, there are some tasks that you are likely to come across to keep the testbed working properly. Here you will find some recommendations to maintain the controller, apply some changes massively to the nodes, or tools to debug possible trouble.

Upgrading Controller to a newer version

From time to time a new version of the Controller software is released. It is announced on the confine-devel mailing list and the release changes are published on Release Notes.

To check the current version of your Controller log into the controller server (e.g. via SSH) as the system user and run the following command (use your own system user and testbed name if different):

$ python ~/mytestbed/manage.py controllerversion
1.0.1

Then, you can check if you have the latest version by running:

$ pip search confine-controller
confine-controller     - Django-based framework for building control servers for
                         computer networking and distributed systems testbeds.
  INSTALLED: 1.0.1
  LATEST:    1.0.2

If you want to upgrade, check Release Notes for the latest version and then run:

$ sudo python ~/mytestbed/manage.py upgradecontroller

In a few minutes you will have your Controller up to date.

Remote node maintenance

Controller provides via Maintenance Application the mechanism to perform maintenance operations remotely in a set of nodes of the testbed.

If you go to Administration > Maintenance > Operation (e.g. http://controller.example.com/admin/maintenance/operation/), you can create a new operation (e.g. a script to fix a wrong configuration), select the nodes where you want to run it and check the status of the execution.

Please note that node administrators can choose to allow or not the centralized access to their nodes, which means that the Controller will only be able to perform maintenance operations in the nodes which accept its SSH key.

Testbed status monitoring

The Controller provides several monitoring tools which can help testbed operators to detect and debug problems and also provide an overview of the health of the testbed. There are three Controller applications in charge of collecting information about the current status of the testbed, one of them monitors the Controller itself (Monitor Application) and the others keep track of the other components of the testbed (State Application and Pings Application).

Besides the tools provided by these applications, the Controller provides a set of reports which provide a general overview of the testbed:

  • Slices allocation per group in the testbed: Slices > Slices > Status overview (e.g. http://controller.example.com/admin/state/state/slices/)
  • Slivers allocation by node in the testbed: Slices > Slivers > Status overview (e.g. http://controller.example.com/admin/state/state/slivers/)
  • Testbed status report: Nodes > Summary (e.g. http://controller.example.com/admin/state/state/report/)
  • Map of the testbed: Nodes > Nodes Map (e.g. http://controller.example.com/gis/map/)

Monitor application

The Monitor Application collects information about the status of the server where the Controller is running: memory usage, CPU load, storage, bandwidth, etc. You can get an overview of the historical resources usage by clicking on the State link of the main server: Nodes > Server > Main server (e.g. http://controller.example.com/admin/nodes/server/monitor/).

State application

The State Application retrieves information about nodes and slivers via the Node API (see CONFINE REST API). The information retrieved is shown as a summary in the node list (e.g. http://controller.example.com/admin/nodes/node/) and sliver list (e.g. http://controller.example.com/admin/slices/sliver/), displaying states such as OFFLINE, PRODUCTION, or STARTED. You can get detailed information by clicking on them or going to the state page of a node or sliver (e.g. http://controller.example.com/admin/nodes/node/1/state). Other information provided is the node firmware version, so you can know what version of the CONFINE node system the nodes are running.

Pings application

The Pings Application checks the connectivity through the management network. Periodically, the controller performs a ping to the other components of the testbed (nodes, hosts and slivers). Go to the State page of the component and then click on the Pings button of the action links (in the top right corner) to see the result of the last pings.

Controller logs

You can find the logs related to the controller environment in /var/log/ (use your own system user and testbed name if different):

  • NginX (web server): /var/log/nginx/[access|error].log. These can grow unwieldy in big testbeds, so you may want to set access_log off in the /api locations of your configuration (e.g. /etc/nginx/conf.d/mytestbed.conf).
  • tinc (management network overlay): grep tinc.mytestbed /var/log/syslog
  • Celery (task queue): /var/log/celery/
    • w1 & w2: Celery workers
    • beat: periodic task executer
    • celeryev: celery monitor

Controller Celery tasks

Celery is a distributed task queue used by the Controller to execute tasks in an asynchronous way (e.g. firmware generation, monitoring tasks…). The admin site provides an interface to manage tasks run by Celery workers. You can access it via Administration > Djcelery > Tasks (e.g. http://controller.example.com/admin/djcelery/taskstate/). There you get an overview of the tasks handled by Celery and their current state, which may help you to debug failures (the task state may include a traceback).

Controller disk usage

The Controller has a complex environment which involves several services (Celery, RabbitMQ, PostgreSQL…), all of them having its own particularities. As a testbed operator you should take care about all of them to avoid exhausting disk space.

One of the known issues when there is not enough free disk space (less than 1 GiB) is that the RabbitMQ (messaging system used by the Controller to communicate with Celery) daemon stops (see RabbitMQ disk alarms for more details).

The following subsections provide some recommendations to keep your disk clean.

Firmware generation temporary files

During firmware generation, the Controller creates a temporary directory /tmp/tmpXXXXXX to unpack and customize the base image. In some situations, the Controller may not clean up the workspace and leave this temporal directory behind.

$ ls /tmp/
tmprJlfI_
$ du -hs /tmp/tmprJlfI_
257M	/tmp/tmprJlfI_

You can safely remove them if they are older than one day.

Clean orphan files

Base images for firmware, slices and sliver templates or firmware images need a lot of disk space and can raise insufficient disk space errors (see issue #326).

The controller provides a periodic task that automatically removes old files, but an extra configuration step is required: enable clean orphan files task. To keep clean the filesystem when files associated to a model are not longer necessary (e.g. firmware builds after its node's deletion), the Controller provides a periodic task (disabled by default) that deletes those files (see related issue #192).

This task requires the django-orphaned app installed (as root):

# pip install https://github.com/ledil/django-orphaned/archive/master.zip

Add django-orphaned to INSTALLED_APPS in the Controller settings file (see Controller configuration) and configure the cleanup apps:

INSTALLED_APPS = (
    'django_orphaned',
    ...
)
 
from os import path
from firmware.settings import FIRMWARE_BUILD_IMAGE_PATH, FIRMWARE_BASE_IMAGE_PATH
from slices.settings import (SLICES_TEMPLATE_IMAGE_DIR,
    SLICES_SLICE_DATA_DIR, SLICES_SLIVER_DATA_DIR)
 
FW_BUILD_IMAGE_ROOT = os.path.join(PRIVATE_MEDIA_ROOT , FIRMWARE_BUILD_IMAGE_PATH)
FW_BASE_IMAGE_ROOT = os.path.join(MEDIA_ROOT, FIRMWARE_BASE_IMAGE_PATH)
SLICES_TEMPLATE_ROOT = os.path.join(MEDIA_ROOT, SLICES_TEMPLATE_IMAGE_DIR)
SLICES_DATA_ROOT = os.path.join(MEDIA_ROOT, SLICES_SLICE_DATA_DIR)
SLIVER_DATA_ROOT = os.path.join(MEDIA_ROOT, SLICES_SLIVER_DATA_DIR)
 
ORPHANED_APPS_MEDIABASE_DIRS = {
    'firmware': {
        'root': (FW_BUILD_IMAGE_ROOT, FW_BASE_IMAGE_ROOT),
        'exclude': ('.gitignore')
    },
    'slices': {
        'root': (SLICES_TEMPLATE_ROOT, SLICES_DATA_ROOT, SLIVER_DATA_ROOT),
        'exclude': ('.gitignore')
    },
}

You can check if is properly configured by running:

$ python ~/mytestbed/manage.py deleteorphaned --info

Controller database

The monitoring applications of the Controller (Pings Application and State Application) make an intensive usage of the database (they periodically store monitoring data there). Although the Controller implements mechanisms to reduce the disk usage footprint by aggregating old data, in some situations the database may become huge and some write access operations very slow or even fail.

Here you can find some tips to monitor and optimize the database:

  • Adjust the aggregation periods used for downsampling the pings (according to the accuracy you want for older pings) in the Controller settings file (see Controller configuration):
    PING_DEFAULT_INSTANCE['downsamples'] = (
        # Limitations: you can not say 16 months or 40 days
        #              but you can say 2 years or 2 months
        # pings older than 1 year aggregates as 4 hour samples
        (relativedelta(years=1), timedelta(minutes=240)),
        # pings older than 6 months aggregates as 1 hour samples
        (relativedelta(months=6), timedelta(minutes=60)),
        # pings older than 2 weeks aggregates as 5 minutes samples
        (relativedelta(weeks=2), timedelta(minutes=5)),
    )

    If the periodic downsampling task uses to fail with TimeLimitExceeded, you may want to run the following code manually in manage.py shell_plus:

    from pings.settings import PING_INSTANCES
    from pings.tasks import downsample
    for instance in PING_INSTANCES:
        downsample(instance['model'])

    Please note that this may take several hours to complete.

  • Check a database size and consider removing very old data to reduce its size:
    $ psql controller
    controller=> select pg_size_pretty(pg_database_size('controller'));
     pg_size_pretty 
    ----------------
     529 MB
    (1 row)
    
    controller=> SELECT nspname || '.' || relname AS "relation",
      pg_size_pretty(pg_relation_size(C.oid)) AS "size"
      FROM pg_class C
      LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
      WHERE nspname NOT IN ('pg_catalog', 'information_schema')
      ORDER BY pg_relation_size(C.oid) DESC
      LIMIT 5;
    
                        relation                    |  size   
    ------------------------------------------------+---------
     public.monitor_timeserie                       | 216 MB
     public.monitor_timeserie_name_42a3bc16d24d56d6 | 101 MB
     public.monitor_timeserie_pkey                  | 36 MB
     public.pings_ping                              | 34 MB
     public.pings_ping_date_f29a98176e19536         | 32 MB
  • Execute operations to optimize tables like VACUUM and REINDEX. For instance, to reclaim all unused space in all tables, give it back to operating system storage, and reindex all tables (which can take a while), run VACUUM FULL and REINDEX DATABASE controller.

Caching API requests

A testbed with many nodes, slivers and users may result in high controller CPU usage and long response times while replying to API requests. The web server can be configured to perform some caching if the controller has enough free memory, so that the response for some requests can be temporarily stored and served quickly with low delay and CPU usage.

An example caching configuration for NginX in the /etc/nginx/conf.d/mytestbed.conf file may include the following options for an in-memory (/dev/shm/nginx) 50 MiB storage:

# Define an in-memory cache storage named ``cache`` with 50 MiB.
proxy_cache_path /dev/shm/nginx levels=1:2 keys_zone=cache:50m;
server {
    listen [fdc0:7246:b03f::2]:443 ssl;  # cache mgmt net requests…
    […]
    location /api/ {  # …for the API
        […]
        proxy_cache       cache;
        proxy_cache_key   $host$uri$is_args$args$http_accept_encoding$http_accept;
        proxy_cache_valid 1m;  # keep entries for at most 1 month
        expires           1m;

        set $skip_cache 0;

        if ($request_method != GET) {  # only cache GET requests…
            set $skip_cache 1;
        }
        if ($http_cookie) {  # …without cookies…
           set $skip_cache 1;
        }
        if ($http_authorization) { # …and anonymous (like most nodes')
           set $skip_cache 1;
        }

        proxy_cache_bypass  $skip_cache;
        […]
    }

Restart NginX when done with service nginx restart.

admin/testbed-maintenance.txt · Last modified: 2016/10/19 09:57 by ivilata