Open Data Sets

During the project, data sets are being collected to provide to external researchers.

Open data sets are available in this repository

Background information

Open data and data sets are receiving a lot of interest these days. Some background references:

An interesting motivation can be found in this speech by Neelie Kroes.

A similar experiment, on internet scale

Data Set Types

During the Athens meeting of March 2012, two types of data sets were considered: flow data and network topology data.

Flow Data

Having access to all packets in a network creates a large source of information with respect to the network usage. While this is an excellent source of information, immediately a number of remarks have to be made as this is very privacy sensitive data:

  • to protect privacy, only the headers of packets can be exposed, we only publish flow data never payloads
  • anonymization should be applied to the headers
  • this is something which the community members have to agree on, we can not just start sniffing

While some tools exist to anonymise netflow data, providing these data sets will require a lot of preparation. Open questions:

  • what is permissible for each community?
  • how do we notify and get agreement from participating community members?
  • which tools do we use to anonymise data?
  • how do we handle logging data in a live network, to avoid logging the logs?

Network Topology Data

The topology of community network is very interesting for network researchers. It also offers a good source of information about real-world protocol deployments. This information is less privacy sensitive, although care should be taken. Again, tools to gather this information continuously have to be looked for. As this highly depends on the specific community network, this task will involve cooperation from each network which wants to give this information. Open questions:

  • does each community network want to give this information?
  • which information is available?
  • which tools will we use?

Node Information

Each community network maintains a database of nodes (WIND DB, a Drupal node DB, …), which is currently being standardised in the commonNodeDB.

This information is static, but easily accessible and a good candidate for correlation.

Open questions:

  • are there possible privacy issues with this data?
  • is it straightforward to convert from one DB format to that of another DB?
  • can this data be released?

The integration of DLEP in our software will allow us to easily collect L2/L1 information. This can be very helpful to analyse routing protocols and estimate link quality. This information is strongly related to the topology data, and should preferably be collected simultaneously.

Open questions:

  • which tool do we use to continuously record DLEP information?
  • can any privacy implications be expected from this data?
  • do the community networks mind if this information is released?

Radio Planning Data

Some community networks, e.g. FunkFeuer, use radio planning software like RadioMobile to predict attenuation between different radio nodes. Together with the DLEP information, this could be valuable information to estimate link quality and assess both radio planning software quality and hardware quality.

Note: this depends on the quality of the propagation model and the input parameter sets (e.g. ground reflection coefficient, …) used to configure the model, which might require additional tuning and information.

Open questions:

  • this is data created when planning, is this easily available afterwards?
  • is referring researchers to the radio planning tools sufficient?

Data Set Processing

After obtaining a data, the results also have to be “polished”: normalised, and when possible correlated. This highly depends on the collected data, and might involve the development of additional data processing and correlation tools.

While researchers using the data sets will have to perform similar tasks, especially correlation might require action from our side.

Examples is a great example of possibilities of open data sets from a community network

