User Tools

Site Tools


admin:troubleshooting

Troubleshooting

Here you may find ways to fix some common known issues with software used to run a CONFINE testbed.

Problem with generated certificates

As discussed on issue #625, there is a firmware generation bug that affects the generation of certificates used by uhttpd (node web server). Although it is fixed on Controller version 0.11.7, operators of existing testbeds need to perform a few actions in their Controller as the system user.

  1. Upgrade the Controller to version 0.11.7 or later (see Upgrading Controller to a newer version):
    $ sudo python ~/mytestbed/manage.py upgradecontroller \
      --controller_version=0.11.7
  2. Patch the controller server API certificate in the testbed registry:
    $ # Get path of server certificate.
    $ python ~/mytestbed/manage.py print_settings | grep PKI_CA_CERT_PATH
    PKI_CA_CERT_PATH                         = '/var/lib/vct/server/pki/ca/cert'
    $ # Backup current certificate.
    $ mv ~/mytestbed/pki/ca/cert ~/mytestbed/pki/ca/cert.old
    $ # Show current certificate information (keep it to generate new certificate).
    $ openssl x509 -in ~/mytestbed/pki/ca/cert.old -text
    [...]
    $ # Generate a new certificate with version 3 (0x2).
    $ python ~/mytestbed/manage.py setuppki  # include your organization details
  3. Remove the invalid node certificate from NodeKeys (NOT from node@registry:/api/cert) using python ~/mytestbed/manage.py shell_plus:
    from M2Crypto import RSA, X509
     
    def get_node_certificate_version(node):
        """Check if certificate has invalid version (0x3)."""
        if node.keys.cert is None:
            return False
        pem_string = str(node.keys.cert)
        cert = X509.load_cert_string(pem_string)
        return cert.get_version()
     
    def is_valid_node_certificate_version(node):
        if  get_node_certificate_version(node) == 3:
            return False
        return True
     
    def fix_node_certificate_version(node):
        """
        Remove invalid stored /etc/uhttpd.crt.pem file
        Will be regenerated on next firmware build (node.api.cert too)
        NOTE: should be executed with patched controller (X509 version 0x2)
        """
        assert not is_valid_node_certificate_version(node)
        cert = node.files.get(path=NodeKeys.CERT)
        assert cert.content == node.api.cert, "Node %s" % node.pk
        cert.delete()
     
    # Get nodes with invalid certificate version.
    affected_nodes = []
    for node in Node.objects.all():
        if not is_valid_node_certificate_version(node):
            affected_nodes.append(node.pk)
            #fix_node_certificate_version(node) # UNCOMMENT to massive fix
     
    print "Fixed %i" % len(affected_nodes)
  4. Upgrade the affected nodes (see Node upgrade).

Celery tasks not shown on Django admin interface

If you go to Administration > Djcelery > Tasks and the lists show no objects or only old tasks (e.g. received yesterday, 2 days ago…), you need to check if the Celery monitor components are running:

  • celeryev (e.g. ps ax | grep celeryev)
  • celerybeat (e.g. ps ax | grep celerybeat)

If any of them is not running, start it as root with service SERVICE start (with SERVICE being celeryevcam or celerybeat). Otherwise you may try restarting it with service SERVICE restart.

Uploading templates or other big files fails

Probably you have reached NginX's POST maximum size. This limit exists for discouraging the upload of big sliver templates (because they are supposed to be transferred over not that reliable community networks).

You should remove or increase client_max_body_size in your NginX configuration (it may appear more than once).

Celery task fails with "Too many open files"

The operating system has a limit of open files for processes. As the ping and state apps use one file descriptor per node and sliver for a big number of nodes and slivers, the limit of open files can be reached (e.g. 240 nodes and 800 slivers means 1040 open files). In this situation Celery tasks show the FAILED state and this error message:

OperationalError: could not create socket: Too many open files

You can check the limit by running:

$ ulimit -Hn  # hard limit
4096
$ ulimit -Sn  # soft limit
1024

To temporally increase this limit you can run as root:

# ulimit -Sn 2048
# ulimit -Sn
2048

To make this change permanent in Celery, place the previous commands in Celery's initialization script and restart the daemon. See the Debian Wiki page on limits for more information.

Other issues

If you have found an issue that is not solved here, and you have not seen any other related information to it in the CONFINE wiki or CONFINE's Redmine site, consider asking for help in the confine-devel mailing list.

admin/troubleshooting.txt · Last modified: 2015/09/04 16:09 by ivilata