Describing configurations of software experiments as Linked Data
- 1IDLab, Department of Electronics and Information Systems, Ghent University – imec
- 2Enterprise Information Systems Department, University of Bonn
Abstract
Within computer science engineering, research articles often rely on software experiments in order to evaluate contributions. Reproducing such experiments involves setting up software, benchmarks, and test data. Unfortunately, many articles ambiguously refer to software by name only, leaving out crucial details for reproducibility, such as module and dependency version numbers or the configuration of individual components in different setups. To address this, we created the Object-Oriented Components ontology for the semantic description of software components and their configuration. This article discusses the ontology and its application, and demonstrates with a use case how to publish experiments and their software configurations on the Web. In order to enable semantic interlinking between configurations and modules, we published the metadata of all 500,000+ JavaScript libraries on npm as 200,000,000+ RDF triples. Through our work, research articles can refer by URL to fine-grained descriptions of experimental setups. This brings us faster to accurate reproductions of experiments, and facilitates the evaluation of new research contributions with different software configurations. In the future, software could be instantiated automatically based on these descriptions and configurations, reasoning and querying can be applied to software configurations for meta-research purposes.
Canonical HTML view of this article:
https://linkedsoftwaredependencies.org/articles/describing-experiments/Notifications and annotations
In reply to
Introduction
A large number of computer science articles describe experimental software evaluations, but many of them refer to that software only by name or version number. This information is insufficient for readers to understand which exact version of the software, which versions of its dependencies, and which detailed configuration of the software’s components has obtained the reported results. Therefore, potential users cannot always mirror the correct software installation that will behave according to the article’s conclusions. Other researchers might fail to reproduce the results because of differences in these aspects.
As Claerbout’s Principle [1] explains,
an article about computational science in a scientific publication
is not the scholarship itself, it is merely advertising of the scholarship.
The actual scholarship is the complete software development environment
and the complete set of instructions which generated the figures.
This stresses the importance of reproducibility,
and essentially mandates a detailed description
of the executed experiment,
all of the involved artefacts and actors,
and the processing of the retrieved data.
Using Linked Data [2] to publish experiment descriptions provides two immediate benefits: the experimental setup and parts thereof can be identified by IRIs, and their details can be retrieved by dereferencing those IRIs. Therefore, if research articles complement their textual explanation of an experiment with the IRI of the full setup, reproducibility is strongly facilitated. Moreover, the IRIs of the entire experiment or its parts can be reused in other articles or experiments to unambiguously refer to the same conditions.
In this article, we focus on the description of software configurations and software modules, such that an evaluated software setup can be referred to unambiguously by an IRI.
This article is structured as follows. In Section 2, we discuss related work. Section 3 introduces the semantic description of software modules and discusses a semantic description of software components and configurations. Section 4, describes a use case where we apply software descriptions to an experimental evaluation. Finally, we discuss our conclusions and future work in Section 5.
Ontology
To fully describe experiments in RDF, we need a way to link to the specific software components that were used when running the experiment. As a use case, we considered the largest ecosystem of the popular JavaScript language: the registry of the Node Package Manager (npm). We converted the entire npm registry, consisting of over 500,000 JavaScript packages, to RDF. After conversion, we published the resulting 200,000,000+ triples through multiple interfaces. We opted for the npm registry due to it being one of the largest package repositories available and JavaScript’s close ties to Web technology.
We also created an ontology to describe how software components can be configured. That way, we can not only describe all the packages used by the software, but also how the software itself is configured.
Software modules
There are several levels of granularity on which software can be described, going from a high-level package overview to a low-level description of the actual code. In descriptions, we can use several of these layers, depending on the context and the requirements. Drilling down from the top to the bottom, we have the following layers:
- a bundle is a container with metadata about the software and its functionality across different points in time. An example is the N3.js library.
- a module or version is a concrete software package and an implementation of a bundle. N3.js 0.10.0 is a module.
- a component is a specific part of a module that can be called in a certain way with a certain set of parameters. The N3.js 0.10.0 Parser is a component.
Bundles and modules are described in the npm dataset. For describing components, we will use our newly created ontology.
Node Package Manager (npm)
All npm data is stored in a CouchDB
instance with one entry per bundle.
This corresponds to the metadata, manually added by the package developer in a package.json
file,
with additional metadata automatically added by the npm publishing process.
To uniquely identify and interlink software components,
we developed a server
that converts the JSON metadata provided by the npm registry to RDF.
An important focus of the conversion process was URI re-use,
maximizing the available links between different resources.
Describing components and their configuration
The Object-Oriented Components ontology is an ontology for describing software components and their instantiation in a certain configuration. Within this ontology, we reuse Fowler’s definition of a software component [14] as a “glob” of software. The purpose of a component is to encapsulate functionality that can be reused by other components. The instantiation of a component can require certain parameters, similar to how object-oriented programming (OOP) languages allow constructors to have certain arguments. We assume OOP in the broad sense of the word, which only requires classes, objects and constructor parameters. Fig. 1 shows an overview of the ontology.
We define oo:Component
as a subclass of rdfs:Class
.
The parameters to construct a component can therefore be defined as an rdfs:Property
on a component.
This class structure enables convenient semantic descriptions of components instantiations
through the regular rdf:type
predicate.
For instance,
a software module representing a certain datasource
can be described as
ldfs:Datasource:Hdt rdf:type oo:Class.
,
and a concrete instance is
:myHdtDatasource rdf:type ldfs:Datasource:Hdt
.
To actually link components to their modules there is the oo:component
predicate,
combined with the oo:Module
class.
Several oo:Component
subclasses are defined.
An oo:Component
can be an oo:Class
, which means that it can be instantiated based on parameters.
Each component can refer to its path within a module using the oo:componentPath
predicate,
which can for instance be the package name in Java.
All instantiations of oo:Class
instances are an oo:Instance
.
An oo:Class
can also be an oo:AbstractClass
, which does not allow directly instantiating this component type.
Abstract components can be used to define a set of shared parameters in a common ancestor.
Conforming to the RDF semantics, components can have multiple ancestors, and are indicated using the rdfs:subClassOf
predicate.
The parameters that are used to instantiate an oo:Class
to an oo:Instance
are of type oo:Parameter
.
An oo:Parameter
is a subclass of rdfs:Property
, which simplifies its usage as an RDF property.
oo:defaultValue
allows parameters to have a default value when no other values have been provided.
The oo:uniqueValue
predicate is a flag that can be set to indicate whether the parameter can only have a single value.
Use case: describing a Linked Data Fragments experiment
In this section, we provide a semantic description of the experiment performed in a previous research article, as a guiding example on how to create such descriptions for other evaluations. The intention is that future research articles directly describe their experimental setup this way, either through HTML with embedded RDFa or as a reference to an IRI of an external RDF document, using, for example, the ontologies described in Subsection 2.2.
This experiment we will describe originates from an article on query interfaces [15] and involves specific software configurations of a Linked Data Fragments (LDF) client and server. We have semantically described the LDF server module and its 32 components. Instead of the former domain-specific JSON configuration file, the semantic configuration is Linked Data. Furthermore, we provide an automatically generated semantic description of all concrete installed dependency versions for both the LDF client and server. This is necessary because, modules indicate a compatibility range instead of a concrete version.
The LDF experiment can be described using the following workflow (RDFa-annotated):
- Create 1 virtual machine for the server.
- Create 1 virtual machine for a cache.
- Create 60 virtual machines for clients.
- Copy a generated Berlin SPARQL benchmark \[16\] dataset to the server.
-
Install the server software configuration, implementing the TPF specification, with its dependencies on the server.
-
Install the client software configuration, implementing the SPARQL 1.1 protocol, with its dependencies on each client.
- Execute four processes of the Berlin SPARQL benchmark \[16\] with the client software for each client machine.
- Record CPU time, RAM usage of each client, the CPU time and RAM usage of the server, and measure the ingoing and outgoing bandwidth of the cache.
-
Publish the results online.
An executed workflow corresponding to the abstract experiment workflow above generates entities based on each activity as performed by various agents. The resulting observations of the experiment are among other valuable immutable provenance level data, which plays a vital role in verifying and reproducing the steps that led to the outcome. Concretely, the conclusions in the article have the resulting data as provenance, which in turn was generated by applying the steps above.
Crucially, in the description above, we refer to the exact software configurations by their IRI, their specific dependency versions, and the specifications they implement. These serve as further documentation of the provenance. The Object-Oriented Components ontology captures the low-level wiring between components, enabling researchers to swap individual algorithms or component settings.
For example, based on the above description, the exact same experiment can be performed with different client-side algorithms [17] or different server-side interfaces [18]. A common practice to achieve this currently, as done in the aforementioned works [17][18], is to implement modifications in separate code repository branches. Unfortunately, such branches typically diverge from the main code tree, and hence cannot easily be evaluated afterwards with later versions of other components. By implementing them as independent modules instead, they can be plugged in automatically by minimally altering the declarative experiment description. This simultaneously records their provenance, and ensures their sustainability.
Conclusion
The core idea of the scientific process is standing on the shoulders of giants. This not only means we should build upon the work of others, but also that we should enable others to build upon our work. Reproducibility is an essential aspect of this process. This concept obviously applies to Web research as well—moreover, the Web is an ideal platform to improve the scientific process as a whole.
In this article, we introduced an ontology for semantically describing software components and their configuration. Publishing this information alongside experimental results is beneficial for the reproduction of experiments, and completes the provenance of experimental results.
Through this work,
vague textual references to software configurations
in works of research
can be replaced with unambiguous URLs.
These URLs point to Linked Data
that exactly captures a single software configuration,
which in turn also uses URLs
to identify existing software modules.
We already made such module URLs and corresponding descriptions
available for all npm modules,
which we interconnected using RDF triples.
The same approach can be applied
to other software ecosystems,
such as for instance the Maven Central repository
of the Java language,
the RubyGems repository for the Ruby language,
the Comprehensive Perl Archive Network (CPAN) repository for the Perl language,
or the Python Package Index (PyPI) for Python.
The usage of Linked Data to describe the software and experiments
has the added advantage of linking different descriptions together,
which allows the answering of such questions as
Which experiments made use of this benchmark?
Semantic descriptions can serve many more roles in addition to identifying and describing software configurations. Since the descriptions are exact and machine-readable, an automated dependency injection [14] framework could automatically instantiate the software based on a configuration URI and wire its dependent components together. Swapping components in and out, and trying different configurations, comes down to simply editing the configuration, which can be shared again as Linked Data. This leverages the power of the Web to simplify the reproduction of existing experiments and the creation of new ones.