Reproducible software experiments through semantic configurations

  • 1IDLab, Department of Electronics and Information Systems, Ghent University – imec
  • 2Enterprise Information Systems Department, University of Bonn

Abstract

The scientific process requires reproducible experiments and findings to foster trust and accountability. Within computer science engineering, reproducing experiments involves setting up the exact same software with the same benchmarks and test data, which often requires non-trivial manual work. Unfortunately, many research articles ambiguously refer to software by name only, leaving out crucial details such as module and dependency version numbers or the configuration of the individual components. To this end, we created vocabularies for the semantic description of software components and their configuration, which can be published as Linked Data alongside experimental results. We implemented a dependency injection framework to accurately instantiate these described experimental configurations. This article discusses the approach and its application, and explains with a use case how to publish experiments and their software configurations on the Web. In order to enable semantic interlinking between configurations and modules, we published the metadata of all 480,000+ JavaScript libraries on npm as 174,000,000+ RDF triples. Through our work, research articles can refer by URL to fine-grained, instantiatable descriptions of experimental setups, completing the provenance chain from specifications to implementations, dependencies, and configurations all the way to experimental results. This ultimately brings faster and more accurate reproductions of experiments, and facilitates the evaluation of new research contributions. Moreover, this work can serve other use cases, such as general software instantiation outside of experiments, and reasoning or querying over software configuration metadata.

Notifications and annotations

In reply to

Introduction

A large number of computer science articles describe experimental software evaluations, but many of them refer to that software only by name or version number. This information is insufficient for readers to understand which exact version of the software, which versions of its dependencies, and which detailed configuration of the software’s components has obtained the reported results. Therefore, potential users do not necessarily obtain the correct software installation that will behave according to the article’s conclusions. Moreover, other researchers might fail in reproducing the same results because of differences in any such aspects.

As Claerbout’s Principle [1] explains, an article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures. This stresses the importance of reproducibility, and essentially mandates a detailed description of the executed experiment, all of the involved artefacts and actors, and the processing of the retrieved data.

Using Linked Data [2] to publish such descriptions provides two immediate benefits: the experimental setup and parts thereof can be identified by IRIs, and their details can be retrieved by dereferencing those IRIs. Therefore, if research articles complement their textual explanation of an experiment with the IRI of the full setup, reproducibility is strongly facilitated. Moreover, the IRIs of the entire experiment or its parts can be reused in other articles or experiments to unambiguously refer to the same conditions. Fig. 1 illustrates how this leads to a chain of provenance from the research article to the data and the experiment that generates it, as well as all aspects surrounding that experiment.

[description diagram]

Fig. 1:research article is based on result data, which are the outcomes of an experiment. The experiment in turn also has (multiple) provenance chains, and this article focuses on software configurations and software modules.

In this article, we focus on the description of software configurations and software modules, such that an evaluated software setup can be referred to unambiguously by an IRI. We further facilitate the reproduction of experiments through a mechanism that automatically instantiates the software configuration based on its Linked Data description. Our contributions are the following:

  • the RDF-based description of software modules, applied to the 480,000+ bundles of npm (Node.js);
  • the RDF-based description of available components within software modules;
  • the RDF-based description of a precise configuration of software modules;
  • the automated instantiation of such a configuration;
  • use case explaining the usage of the resulting Linked Data in scientific articles.

This article is structured as follows. In Section 2, we discuss related work. Section 3 introduces the semantic description of software modules. Next, Section 4 discusses a semantic description of software components and configurations, followed by the introduction of a dependency injection framework that can instantiate these in Section 5. Section 6, describes a use case where we apply software descriptions to an experimental evaluation. Finally, we discuss our conclusions and future work in Section 7.

Describing software modules

There are several levels of granularity on which software can be described, going from a high-level package overview to a low-level description of the actual code. In descriptions, we can use several of these layers, depending on the context and the requirements. Drilling down from the top to the bottom, we have the following layers:

  • a bundle is a container with metadata about the software and its functionality across different points in time. An example is the N3.js library.
  • a module or version is a concrete software package and an implementation of a bundle. N3.js 0.10.0 is a module.
  • a component is a specific part of a module that can be called in a certain way with a certain set of parameters. The N3.js 0.10.0 Parser is a component.

Within this section, we will focus on bundles and modules, while components are described more in-depth in Section 4.

Node Package Manager (npm)

An example of a large collection of bundles and modules is the npm library. It contains over 480,000 JavaScript libraries, all with their own features and requirements. Using our terminology, an npm package is a bundle, while a specific version of such a package is a module. The bundle contains the description of the project together with all its versions, while a module contains the specific dependencies and a link to the actual implementation.

All this npm data is stored in a CouchDB instance with one entry per bundle. This corresponds to the metadata, added by the package developer in a package.json file, with additional metadata automatically added by the npm publishing process. To be able to uniquely identify software components and, more importantly, interlink them, we converted the JSON metadata provided by the npm registry to RDF, for which we set up a server.

Interpreting package.json using JSON-LD

Since the input data is JSON, we opted to convert it to JSON-LD [18], an RDF syntax specifically designed for adding semantics to JSON. JSON-LD achieves this by adding a so-called context to the JSON data, which describes how the JSON tags should be interpreted. E.g., having "name":"foaf:name" in your context implies that all name tags should be interpreted as the predicate foaf:name. Other JSON-LD keywords can be used to identify whether certain values are IRIs, or whether an entity has a specific type. For the data where we could not reach the format using just the JSON-LD context, such as concatenating values to create a IRI, we modified some of the input JSON before exporting it to JSON-LD.

Bundles

A bundle represents the general npm package. An example of a JSON representation of an npm bundle can be found at https:/​/​registry.npmjs.org/n3/. This contains all the general descriptions that apply to all bundles in this module, such as the name, homepage and description.

To adapt this JSON to RDF, we start by adding our context, https:/​/​linkedsoftwaredependencies.org/contexts/npm.jsonld, which already maps many of the npm tags to corresponding RDF predicates. This allows these tags to remain the same in the JSON-LD representation. The limitations of context mapping necessitated some other changes, the most important one relating to the specific versions of the bundle. This can be seen by retrieving https:/​/​linkedsoftwaredependencies.org/bundles/npm/n3 with an Accept: application/ld+json header. In this case, the bundle contains links to its corresponding modules, providing semantic connections between them. Additionally, some tags were added to provide identifiers and link to the original repository.

Since JSON-LD is an RDF representation, it can easily be converted to other syntaxes, of which several are supported by our server, such as Turtle and N-Triples. These can be retrieved by sending the corresponding Accept headers. An example of some of the data generated this way can be seen in Listing 1.

npm:n3 a doap:Project;
  dcterms:abstract "Lightning fast, asynchronous, streaming...";
  dcterms:subject "turtle", "rdf", "n3", "streaming", "asynchronous";
  spdx:licenseDeclared <https://spdx.org/licenses/MIT.html>;
  doap:release 
    <http://linkedsoftwaredependencies.org/bundles/npm/n3/0.10.0>;
  doap:bug-database <https://github.com/RubenVerborgh/N3.js/issues>;
  doap:homepage <https://github.com/RubenVerborgh/N3.js#readme>;
  doap:name "n3";
  owl:sameAs "https://www.npmjs.com/package/n3";
  foaf:maker users:rubenverborgh.
users:rubenverborgh foaf:name "Ruben Verborgh".

Listing 1: This listing shows a partial representation of https:/​/​linkedsoftwaredependencies.org/bundles/npm/n3 in the Turtle syntax. Prefixes omitted for brevity.

Modules

A module is a specific version of a package. Continuing with the examples shown above, the JSON metadata of version 0.10.0 of the N3 bundle can be found at https:/​/​registry.npmjs.org/n3/0.10.0, while the IRI in our namespace is https:/​/​linkedsoftwaredependencies.org/bundles/npm/n3/0.10.0. Similarly, many of the tags are mapped by the context, while other tags had to be modified to provide more relevant triples. An example of some of the data generated for this module can be seen in Listing 2.

<https://linkedsoftwaredependencies.org/bundles/npm/n3/0.10.0> 
  doap:revision "0.10.0";
  foaf:maker users:rubenverborgh;
  npm:devDependency 
    <https://linkedsoftwaredependencies.org/bundles/npm/async/%5E2.0.1>;
  npm:nodeVersion 
    <https://linkedsoftwaredependencies.org/engines/node/6.7.0>.

Listing 2: This listing shows a partial representation of https:/​/​linkedsoftwaredependencies.org/bundles/npm/n3/0.10.0 in the Turtle syntax. Prefixes omitted for brevity.

An important part of an npm package description are the dependencies and their semantic versions. For example, N3 0.10.0 has a dependency on async ^2.0.1. ^2.0.1 is a semantic version and corresponds to any version number of async that has a major version of 2. As can be seen in the JSON-LD, this async dependency is converted to https:/​/​linkedsoftwaredependencies.org/bundles/npm/async/%5E2.0.1, with %5E being the URL-encoded character ^. If accessed, the server detects the highest matching version number and redirects to that module. Additionally, the body of the redirect contains the relevant metadata describing this, which in this case results in the following triple (prefixed for clarity):

async:\%5E2.0.1 npm:maxSatisfying async:2.4.0.

Additionally, to properly describe which modules are being used on a machine, we created a tool that outputs the actual dependencies used by a specific package installation in RDF. This way the exact installation that was used can be described, without having to rely on the interpretation of semantic versions which can change over time.

Publication

480,000 npm packages correspond to 174,000,000+ triples when we collect the information from all packages. Next to the subject pages for each bundle, module and user, we also publish all of this data through a Triple Pattern Fragments [19] interface and as HDT [20] and Turtle [21] dumps. This data is republished daily to stay up-to-date with the available information on npm. Every day, we collect all triples that are generated by our system in a Turtle file. After that, we convert this Turtle file to an HDT file. Finally, this HDT file is loaded into a TPF server instance, which allows us to publish this data through a low-cost interface that still enables querying. Using TPF, custom SPARQL queries can be evaluated over this dataset, such as retrieving all dependencies of a bundle and finding the author of a bundle.

Describing components and their configuration

In this section, we introduce the Object-Oriented Components ontology for describing software components and their instantiation in a certain configuration, and provide an example of its application to JavaScript. Within this ontology, we reuse Fowler’s definition of a software component [15] as a “glob” of software. The purpose of a component is to provide operations that can be used by other components. The instantiation of a component can require certain parameters, just like object-oriented programming (OOP) languages allow constructors to have certain arguments. In this section, we assume OOP in the broad sense of the word, which only requires classes, objects and constructor parameters. Fig. 2 shows an overview of the ontology.

[Object-Oriented Components ontology diagram]

Fig. 2: Classes and properties in the Object-Oriented Components ontology, with as prefix oo.

Following Section 3, we have defined a module as a collection of components. Within OOP languages, this can correspond to for example a software library or an application, which can contain a multitude of components.

We define oo:Component as a subclass of rdfs:Class. The parameters to construct a component can therefore be defined as an rdfs:Property on a component. This class structure enables convenient semantic descriptions of components instantiations through the regular rdf:type predicate. For instance, a software module representing a parser can be described as ldfs:Datasource:Hdt a oo:Class., and a concrete instance is :myHdtDatasource a ldfs:Datasource:Hdt.

Several oo:Component subclasses are defined. An oo:Component can be an oo:Class, which means that it can be instantiated based on parameters. Each component can refer to its path within a module using the oo:componentPath predicate, which can for instance be the package name in Java. All instantiations of oo:Class instances are an oo:Instance. An oo:Class can also be an oo:AbstractClass, which does not allow directly instantiating this component type. Abstract components can be used to define a set of shared parameters in a common ancestor. Conforming to the RDF semantics, components can have multiple ancestors, and are indicated using the rdfs:subClassOf predicate.

The parameters that are used to instantiate an oo:Class to an oo:Instance are of type oo:Parameter. An oo:Parameter is a subclass of rdfs:Property, which simplifies its usage as an RDF property. oo:defaultValue allows parameters to have a default value when no other values have been provided: upon instantiation (Section 5), a closed world will be assumed. The oo:uniqueValue predicate is a flag that can be set to indicate whether or not the parameter can only have a single value.

The resulting description can be included in the module (for instance, as a JSON-LD file), or can be created and referred to externally. Afterwards, it can be reused by multiple dependents.

Listing 3 shows a simplified example of the Linked Data Fragments (LDF) server npm module using the components ontology. It exposes several components such as an HDT and SPARQL datasource and a TPF server, each of which can take multiple parameters. These are provided with a unique identifier and definition, such that the software configuration can receive a semantic interpretation. For example, Listing 4 illustrates how instances of these component types can be declared.

<https://linkedsoftwaredependencies.org/bundles/npm/ldf-server/2.2.2>
  a oo:Module;
  oo:component ldfs:Server:Qpf, ldfs:Datasource:Hdt, ldfs:Datasource:Sparql.
ldfs:Server:Tpf a oo:Class;
  oo:parameter ldfs:datasource, ldfs:port.
ldfs:Datasource a oo:AbstractClass;
  oo:parameter ldfs:Datasource:title.
ldfs:Datasource:Hdt a oo:Class;
  rdfs:subClassOf ldfs:Datasource;
  oo:parameter ldfs:Datasource:Hdt:file.
ldfs:Datasource:Sparql a oo:Class;
  rdfs:subClassOf ldfs:Datasource;
  oo:parameter ldfs:Datasource:Sparql:endpoint.

ldfs:datasource                 a oo:Parameter; rdfs:range ldfs:Datasource.
ldfs:port                       a oo:Parameter; rdfs:range xsd:integer.
ldfs:Datasource:title           a oo:Parameter; rdfs:range xsd:string.
ldfs:Datasource:Hdt:file        a oo:Parameter; rdfs:range ldfs:HdtFile.
ldfs:Datasource:Sparql:endpoint a oo:Parameter; rdfs:range ldfs:SparqlEndpoint.

Listing 3: The LDF server module contains, among others, an HDT and SPARQL-based datasource component, which both extend from the abstract datasource component. The HDT and SPARQL datasource are a classes, which both inherit the title parameter from the abstract datasource. The HDT datasource takes an HDT file as parameter. The SPARQL datasource takes a SPARQL endpoint IRI as parameter.

ex:myServer a ldfs:Server:Qpf;
  ldfs:datasource ex:myHdtDatasource, ex:mySparqlDatasource.
ex:myHdtDatasource a ldfs:Datasource:Hdt;
  ldfs:Datasource:title "A DBpedia 2016 datasource";
  ldfs:Datasource:Hdt:file <http://example.org/dbpedia-2016.hdt>.
ex:mySparqlDatasource a ldfs:Datasource:Sparql;
  ldfs:Datasource:title "A SPARQL-based DBpedia 2016 datasource";
  ldfs:Datasource:Sparql:endpoint <http://xample.org/sparql/dbpedia-2016>.

Listing 4: ex:myServer is a TPF server which will be loaded with a HDT and SPARQL-based datasource.

Instantiating component configurations

In the previous section, we introduced a vocabulary for describing software components and their instantiation. In this section, we introduce a dependency injection framework based on these component descriptions. With this, we take semantic software component descriptions to the next level, we don’t only describe components, but also allow them to be instantiated.

Components.js dependency injection framework

We have implemented Components.js, an open-source dependency injection framework for JavaScript, and made it available on npm. It is able to construct component instances based on declarative component constructions in RDF using the vocabulary introduced in Section 4. It accepts raw triple streams or URLs to RDF documents containing these declarations. At the time of writing, the parser accepts RDF documents serialized as either JSON-LD, Turtle, TriG, N-Triples or N-Quads.

Listing 5 illustrates how components can be instantiated using Components.js. It provides a Loader class that acts as an assembler. This Loader provides constructor injection: it dynamically calls the constructor of the component and passes the configured parameters in a single object argument. Additionally, simplified mechanisms are in place for developers that want to use the dependency injector directly without having to semantically describe the component.

const loader = new require('lsd-components').Loader();
loader.registerConfig('http://example.org/my-ldf-server');
let myParser = loader.instantiate('http://example.org/config-ldf#myServer');

Listing 5: First, a new component loader is created after which the component definitions are registered. Finally, a declarative component instantiation is supplied by providing the component IRI.

While Linked Data is based on the open-world assumption, our dependency injector will close the world when we enter the realm of OOP. This is because a closed-world assumption is required for features such as default arguments: we have to assume that all the arguments that are available to the loader is everything there is.

Defining object mappings

The constructor injection described above works out of the box with single-argument constructors that accept a map, as is quite common in JavaScript. Components.js then creates a map with key/value pairs with the property IRIs and corresponding objects of all triples with the instance as subject. This map is then passed to the constructor, which reads its settings from the map. Depending on a flag, the keys and values are either full IRIs or abbreviated JSON-LD strings.

New libraries that use Components.js can be designed for such single-parameter constructors. For all other types constructors, a mapping mechanism is needed between the RDF properties and the concrete parameter order of the constructor. To this end, we introduce the Object Mapping ontology. Fig. 3 shows an overview of all its classes and predicates.

[Object Mapping ontology diagram]

Fig. 3: Overview of the classes and properties in the Object Mapping ontology, with as prefix om.

The ontology introduces the object mapping and the array mapping. An object map can have several object mapping entries, where each entry has a field name and a field value. An array map can have several array mapping entries, where each entry only has a value. Together, they can express all ways in which the flat object from the RDF description maps to an ordered list of simple or complex constructor parameters.

Listing 6 shows the mapping of the LDF component parameters to the constructor implementation. This description complements the component definitions from Listing 3 as it provides an implementation view on the component constructors. Like the component definitions, a mapping is only necessary once per module and can be reused across dependents.

ldfs:Server:Tpf oo:constructorArguments ([ om:field
  [ om:fieldName "datasources"; om:fieldValue
    [ om:fieldName ldfs:Datasource:title, om:fieldValue rdf:object ] ],
  [ om:fieldName "port"; om:fieldValue: ldfs:port ].
]).
ldfs:Datasource:Hdt oo:constructorArguments ([ om:field
  [ om:fieldName "title"; om:fieldValue: ldfs:Datasource:title ],
  [ om:fieldName "file";  om:fieldValue: ldfs:Datasource:Hdt:file ].
]).
ldfs:Datasource:Sparql oo:constructorArguments ([ om:field
  [ om:fieldName "title";    om:fieldValue: ldfs:Datasource:title ],
  [ om:fieldName "endpoint"; om:fieldValue: ldfs:Datasource:Sparql:endpoint ].
]).

Listing 6: The HDT and SPARQL-based datasource constructors both take a custom object as argument for the constructor. The entries of this object are mapped from the parameter values using this mapping. The TPF server constructor similarly requires a custom object, where the datasources entry points to an object that is a mapping from titles to datasources.

Use case: describing a Linked Data Fragments experiment

In this section, we provide a semantic description of the experiment performed in a previous research article, as a guiding example on how to create such descriptions for other evaluations. The intention is that future research articles directly describe their experimental setup this way, either through HTML with embedded RDFa or as a reference to an IRI of an RDF document.

This experiment we will describe originates from an ISWC2014 article [19] and involves specific software configurations of a Linked Data Fragments (LDF) client and server. We have semantically described the LDF server module and its 32 components. Instead of the former domain-specific JSON configuration file, the semantic configuration is Linked Data and can be instantiated automatically by Components.js. Furthermore, we provide an automatically generated semantic description of all concrete installed dependency versions for both the LDF client and server. This is necessary because, as discussed in Subsection 3.4, modules indicate a compatibility range instead of a concrete version.

The ISWC2014 LDF experiment can be described using the following workflow:

  1. Create 1 virtual machine for the server.
  2. Create 1 virtual machine for a cache.
  3. Create 60 virtual machines for clients.
  4. Copy a generated Berlin SPARQL benchmark \[22\] dataset to the server.
  5. Install the server software configuration, implementing the TPF specification, with its dependencies on the server.

  6. Install the client software configuration, implementing the SPARQL 1.1 protocol, with its dependencies on each client.

  7. Execute four processes of the Berlin SPARQL benchmark \[22\] with the client software for each client machine.
  8. Record CPU time, RAM usage of each client, the CPU time and RAM usage of the server, and measure the ingoing and outgoing bandwidth of the cache.
  9. Publish results online.

An executed workflow corresponding to the abstract experiment workflow above generates entities based on each activity as performed by various agents. Essentially the resulting observations of the experiment are among other valuable immutable provenance level data which plays a vital role in verifying and reproducing the steps which led to the outcome. Concretely, the conclusions in the article have the resulting data as provenance, which in turn was generated by applying the steps above.

Crucially, in the description above, we refer to the exact software configurations by their IRI, their specific dependency versions, and the specifications they implement. These serve as further documentation of the provenance. Additionally, based on these IRIs, other researchers can immediately instantiate the same configuration, or derive their own similar configurations to create comparative experiments. While software container solutions (such as Docker) could also provide immediate instantiation, their configuration is on a much higher level. Instead, the Object-Oriented Components ontology captures the low-level wiring between components, enabling researchers to swap individual algorithms or component settings.

For example, based on the above description, the exact same experiment can be performed with different client-side algorithms [23] or different server-side interfaces [24]. A common practice to achieve this currently, as done in the aforementioned works [23][24], is to implement modifications in separate code repository branches. Unfortunately, such branches typically diverge from the main code tree, and hence cannot easily be evaluated afterwards with later versions of other components. By implementing them as independent modules instead, they can be plugged in automatically by minimally altering the declarative experiment description. This simultaneously records their provenance, enables their automated instantiation, and ensures their sustainability.

Conclusion

The core idea of the scientific process is to stand on the shoulders of giants. This means building further upon previous work to derive new work—but also, enabling others to build upon our work. Reproducibility, for instance, is an essential aspect of this process. Not only does this concept apply to Web research as well, but the Web makes an ideal platform to improve the scientific process as a whole.

In this article, we introduce vocabularies for semantically describing software components and their configuration. Publishing this information alongside experimental results is beneficial for the reproduction of experiments. Furthermore, we introduce Components.js, a dependency injection framework that can understand such configurations, and instantiate software in the exact same way as originally described.

In future work, we aim to make the creation of semantic component files more developer-friendly. A tool can automatically parse source code and derive the appropriate semantic description on how components can be instantiated using which parameters. Additionally, these semantic component definition files provide an interesting platform for validating software dependency relations. Reasoning could for instance be done on parameter restrictions to check whether or not different bundle versions will break certain component invocations. Furthermore, the semantic description of software metadata provides a useful platform for simplifying tasks that require a lot manual work, such as discovering license incompatibilities between projects, which are now possible using a SPARQL query. This even allows us to come up with SPARQL queries corresponding to some of the questions that the Web was intended to give an answer for [3], such as Where is this module used? and Who wrote this code?.

Through this work, we make it easier to build sustainable research platforms, which helps pave the stairs to the shoulders of giants. The Linked Data Fragments server, for instance, is a reusable research platform. The LDF server and client can be compatible with multiple APIs, have multiple algorithms, etc. Through this work, only one “core version” is necessary, and many different configurations can co-exist, where support for different APIs and algorithms are simply pluggable components that are referred to within a configuration. Since components and configurations are identified by a IRI, they can exist anywhere on the Web. Based on a IRI, the injection framework can therefore instantiate software, and wire its dependent components together automatically. We thereby leverage the power of the Web to simplify the reproduction of existing experiments and the creation of new ones.