Semantic Descriptive Statistics

Some research groups in aerosol science study so-called new particle formation events. These are atmospheric events whereby new particulate matter forms and grows in diameter size over time. Such events are studied to increase our understanding for the involved mechanisms as well as to estimate the quantity of formed particulate matter, since it is a factor in climate science and it is a concern for respiratory health.

To study events, researchers first need to determine where and when they occur. For this, they employ what we call here primary data, since these data originate from sensing devices. Detected events need to then be described for their properties i.e., spatiotemporal and intrinsic characteristics such as the growth rate or event classification. The result of primary data interpretation is derivative information i.e., meaningful data. Such derivative information describing individual events is subsequently used in statistical analysis e.g., to derive descriptive statistics about events such as the mean duration. Such descriptive statistics are, among other things, published in scholarly literature. An example can be found in Hamed et al. (2007) in their paper titled "Nucleation and growth of new particles in Po Valley, Italy" (doi:10.5194/acp-7-355-2007, Table 3). Such descriptive statistics are thus scientific information communicated in scholarly literature.

For this use case, we have developed a prototype that demonstrates how a D4Science Virtual Research Environment, notebooks of a Jupyter Lab instance operated by EGI, and a CKAN based catalogue can support researchers in performing such data analysis for detecting and describing new particle formation events and computing descriptive statistics. The key advancement compared to classical computing environments is that the infrastructure takes care to represent the value 05:47 for mean duration as an information object, namely a scalar measurement datum and average value that is a specified output of an arithmetic mean calculation with a data set as specified input.

The use case not only demonstrates the representation of semantic scientific information communicated in scholarly literature but also how research infrastructure can represent the provenance of such information as it is derived from primary data.