Operational versus Analytical Data
While many data practitioners have a good intuitive grasp of the differences between operational and analytical data, digging deeper and making these differences explicit allows for a more profound understanding. This helps design information systems in today’s world, where operational and analytical solutions are increasingly intertwined. More and more analytics-based solutions are entering operations, and analytics has always relied heavily on the availability of operational data.
Etymologically the word operation descends from the Latin expression for I work. Today it has several meanings, and one designates a category of work. Considering the principal players in an organisation, Mintzberg¹ states that the “operators do the basic operating work: producing the products, rendering the customer services, and whatever supports these directly”. Another perspective is that the operational work corresponds to the activities of an organisation that do not intend to change the behaviour or structure of the organisation. Based on those definitions, operational data can be plainly defined as data collected and managed within information systems supporting operational work.
Consequently, operational data focuses on the individual entities involved in the operational work. This focus is the first essential characteristic of operational data. Overall three main categories of operational data are distinguished: master data, transactional data and reference data. Master data corresponds with the entities with which the organisation maintains a long-term relationship, providing context to the transactional data. Reference data has a supporting role by standardising values.
Any organisation will keep its operational data in sync with reality. Depending on the context, this will be done with more or less rigour, but in any case, it is the second key characteristic of operational data. Events make the state of entities evolve. Because operational data represents this state, events trigger data updates. Operational data is intended to correspond with the most recent snapshot of reality. The notion of eventual consistency demonstrates that while sometimes hard to achieve instantaneously, the goal of a complete and so consistent correspondence with reality prevails. The MERODE² method for data makes the importance of these time considerations explicit: cardinalities on the relationships between entities express the situation at one point in time, not throughout time.
So where operational data represents individual entities at specific points in time, analytical data is about populations. It is first and foremost obtained by collecting operational data through space — not just one entity but many — and time. This is somewhat contradictory from an etymological perspective since the word analytical means to take things apart. While operational data supports the day-to-day work, analytical data enables an organisation to understand the composition, behaviour and evolution of a population of interest and to use that understanding for decision-making.
Insights and patterns or models derived from analytical data are often applied to individual entities. Examples are evaluating an investment opportunity by scrutinising data about similar companies or a machine learning model detecting objects in a specific image. However, the statistical variation present in a population implies that such drawing of conclusions about an individual entity is uncertain. The level of uncertainty is inherent to the population and always remains, regardless of how many data points are used, and is very appropriately called the irreducible error. The GDPR caters for this in its article 22 by protecting data subjects from decisions solely based on automated processing, e.g. by providing the right to human intervention.
The meaning of operational data is — at least partially — provided by the applications and services that work with this data in their specific context. On the other hand, analytical data stands on its own and can be used for different purposes in various contexts. Therefore there is a more pressing need for additional mechanisms like metadata to capture its semantics. The detachment of the operational context also implies that data quality can no longer be judged by how well the corresponding operational processes are supported but by how the objectives of the analytical processing are affected.
The maxim form follows function also applies to the structuring of data. An elementary but helpful view of an information system is that it takes data through a sequence of transformations to realise its behaviour. Well-structured data eases the implementation of these transformations. The specificities of the supported processes determine the desired behaviour of an operational system. Domain-Driven Design³ describes practical techniques to discover the optimal interplay between structure and behaviour and vividly evokes how challenging such a design journey can be.
The strong link between the characteristics of the domain and the structure of the data does not exist in the analytical world. On the contrary: analytical techniques and models prescribe the structure they require. As a consequence, the intricacies of the domain should be captured in the data’s content. Consider, for example, how a neural network requires the data to be mapped to the nodes of its input layer or how the distinction between facts and dimensions in dimensional modelling enables fundamental analysis. Furthermore, composing or merging data is probably the most crucial transformation in analytics. Hence it is required that individual analytical data sets support composition well. This can be achieved by following the principles put forward in the Tidy Data⁴ paper. These principles ensure that structure and semantics work well together.