To address research questions, researchers design and execute experiments that lead to data, which are processed, analysed and interpreted. Through such activities, researchers generate results that support the research question and communicate their findings as scientific knowledge in scholarly literature.

In experiments, researchers typically perform a sequence of activities that transform primary data (e.g., observational, experimental, computational data) into secondary, tertiary or even higher-order data. For example, primary sensor data may be processed to secondary data, analysed and interpreted to derive tertiary data about instances of a population, and further processed to derive descriptive statistics about the population, quaternary data communicated in scholarly literature. In the context of the present study, such data are information i.e., well-formed, meaningful and truthful data (Floridi, 2011). This project aims to acquire and curate such information.

The typical scientific article is filled with information—i.e., data such as the characters forming a sentence or the pixels of an image representing a diagram that, in the context of the article, are well-formed, meaningful and truthful. The article integrates information into a "body of knowledge" consisting of what is already known that is of relevance. This project aims to acquire and curate (some of) that body of knowledge, whereby knowledge is represented in machine interpretable form. The project also aims to develop infrastructure that supports the publishing (access) and further processing of scientific knowledge communicated in scholarly literature.

We are exploring numerous pathways for information and knowledge acquisition. Currently most advanced is a kind of prospective pathway. In prospective pathways, semantic scientific knowledge is available before the article is written. They stand in contrast with retrospective pathways whereby semantic scientific knowledge is acquired after the article is written.

As documented by the use cases in earth and life sciences, here the aim is to advance data analysis infrastructure so that derived data are machine interpretable. Along this prospective pathway, we suggest that higher-order data (i.e., information) communicated in scholarly literature will be semantic as a result of performing data analysis on accordingly engineered infrastructure.

Another prospective pathway is the acquisition of semantic scientific knowledge through annotation of text and other content (e.g., figures) at the time of writing the article. We plan to develop this pathway for the current use cases.

Among retrospective pathways, we are developing form based, text mining and crowd-sourcing approaches to acquire semantic scientific knowledge communicated in scholarly literature. Currently most advanced are form based approaches that we aim to integrate into article submission workflows. Text mining and crowd-sourcing approaches support and complement the various pathways and are particularly relevant for already published literature.

To support these aims, we build on existing technologies and develop new infrastructure. An important infrastructure component is the system for the curation and processing of semantic scientific knowledge. Here, we are currently working on a Neo4J based graph database that supports the curation of RDF based use cases but extends the RDF data model to enable important features such as statement versioning.


Floridi, L. (2011). The Philosophy of Information. Oxford University Press.