Semantic Descriptive Statistics

Some research groups in aerosol science study so-called new particle formation events. These are atmospheric events whereby new particulate matter forms and grows in diameter size over time. Such events are studied to increase our understanding for the involved mechanisms as well as to estimate the quantity of formed particulate matter, since it is a factor in climate science and it is a concern for respiratory health.

To study events, researchers first need to determine where and when they occur. For this, they employ what we call here primary observational data, since these data originate from sensing devices. Detected events need to then be described for their properties i.e., spatiotemporal and intrinsic characteristics such as the growth rate or event classification. The result of primary data interpretation is derivative information i.e., meaningful data. Such derivative information describing individual events is subsequently used in statistical analysis e.g., to derive descriptive statistics about events such as the mean duration. Such descriptive statistics are, among other things, published in scholarly literature. An example can be found in Hamed et al. (2007) in their paper titled "Nucleation and growth of new particles in Po Valley, Italy" (, Table 3). Such descriptive statistics are thus scientific information communicated in scholarly literature.

While the reality of the research lifecycle is more complex, the Jupyter notebook presented here demonstrates the essential elements for how primary observational data are interpreted and derivative information processed to computed a descriptive statistic. The notebook also demonstrates how in scholarly communication when authors write their article the descriptive statistic can be linked explicitly to bibliographic information about the article in which this scientific information is published. This explicit and machine readable link enables us to formulate some interesting queries on the resulting knowledge graph.

The use case is also demonstrated in a D4Science Virtual Research Environment, with notebooks of a Jupyter Lab instance operated by EGI, and a CKAN based catalogue. The VRE can supports researchers in detecting and describing new particle formation events and computing descriptive statistics. The key advancement compared to classical computing environments is that the infrastructure takes care to represent a value, say, 05:47 for mean duration as an information object, namely a scalar measurement datum and average value that is a specified output of an arithmetic mean calculation with a data set as specified input. As an example of a system that integrates EOSC services, the VRE was also included in the EOSC Portal Marketplace.