Skip to main content

The subject of data quality is one that tends to be near the top of everyone’s agenda when I talk to data leaders, but is data quality really what we should be talking about?  The real outcome here is the confidence to make those data-driven decisions, that you trust the data you are looking at (or that you are feeding into your ML model training or providing to your GenAI chatbots).

Trust is subjective, so all we can really do is ensure the data is demonstrably trustworthy and this is enabled through the happy collaboration of three data management activities:

  1. Understanding data lineage
  2. Managing data quality
  3. Ensuring consistency of understanding (this being what you are trying to achieve through a business glossary and similar activities).

Together these three activities help us understand where and what to check and correct. Trust in data is often difficult not because we know the data is wrong, but because we suspect it might be wrong or simply that we cannot be sure it is right.  It is more than ensuring the data is right, it is demonstrating – consistently and on demand – that it is.

Show Me The Way – Data Lineage

All data is not created equal – any activity around improving the trustworthiness of data should focus on the most important and impactful data. Organisations tend to have a lot of data, and it isn’t always of equal importance.  When it comes to analytics and reports to inform decision-making, there will normally be a set of key reports or metrics that have a significant impact on the business.  These will be fed by data from different sources, but it will be a relatively small minority of the total volume of data.

A key step in the process is identifying important data and focussing improvement efforts there. This is supported by capturing data lineage – how the data moves and flows through the business.

With a map showing which data sets feed the key reports and metrics and in turn which data sets feed them,  the origins of exactly where the data comes from can be understood and documented.  This can be completed manually, although like many manual activities it can be time consuming, subject to human error and is not automatically kept up to date.  Fortunately, there are also tools to capture data lineage information – sometimes this is a feature of a data cataloguing tool but there are also specialist tools for this.

Data lineage can be documented at a number of levels:

  • Level 1: Showing linkages between systems (e.g. recording that the CRM feeds the ERP system)
  • Level 2: Showing linkages between tables/data sets. (e.g. two specific tables from the CRM database feed a specific visualisation in PowerBI)
  • Level 3: Showing column to column linkages. (e.g. which fields in each of the source tables feeds each field in a target visualisation)
  • Level 4: Showing transformations, describing how the data is manipulated and transformed as it is moved (e.g. the SQL query code used to aggregate, group, sort or otherwise transform the data as it is moved)

While a lot can be achieved around data trustworthiness by identifying which tables are involved in the creation of the key tables (i.e. level 2), a complete understanding will require level 4 which requires more specialist data lineage tools.

Managing Data Errors – Data Quality Management

The idea of the various dimensions of data quality, like accuracy, consistency et al and the ability to query data to test against these criteria is a well-trodden path.  However, this activity is just as valid as it has always been.

In many cases it can be difficult to make progress using the traditional ‘waterfall’ sequential approach of defining the business rules that should be adhered to then translating that into tests that can be introduced to measure how well the data is meeting the business rules. The preparation and planning can be a very time-consuming activity and the ‘blank sheet of paper’ approach of asking the business “what should things be like” can be vague and somewhat intangible.

An alternative approach is to start discovering how the data is and use that understanding to prompt more specific questions around what ‘good’ looks like.  For example, column ‘x’ which stores customer IDs has some blank entries leading to the question if it is acceptable for a customer not to have a customer ID?  If not, then this becomes a business rule, and controls can be put in place to both measure adherence to the rule and also take steps to prevent it from happening.

In this way, looking at the data can support the planning process – determine the business rules based on observation rather than in a vacuum. It also enables you to focus on the quality issues that may actually be occurring rather than theoretical rules and requirements which may never cause a problem.

Managing Ambiguity and Misunderstanding – Business Glossary et al

The business glossary is a list of terms relevant to the business with corresponding definitions. However, this is just the first rung of a ladder of tools to help with semantics and understanding, from taxonomies and reference data to full ontologies.  This area is all about documenting a shared understanding of what things are called, what they are like and how they relate to each other.

When it comes to data trustworthiness, a lot of errors in data, come down to a misinterpretation – finance figures contradict sales because finance used invoiced figures for revenue when sales used booked figures.  A notable example of this is when two pieces of NASA code failed to use the same unit for measuring force resulting in the crashing of the Mars Climate Orbiter when the correct number (in imperial units) was interpreted as a metric unit, therefore becoming the wrong number.  While the data was absolutely correct, it was its the interpretation that was the problem.

Ensuring that all terminology involved in the data flow for critical reports and metrics (nouns in report titles, axes and legends of charts or KPIs) is clearly and universally understood avoids such issues. The creation of a business glossary to store and share these terms with their definitions allows all data elements to be checked against the definition.  Another useful addition is to attach appropriate business terms to data elements in the data catalogue, with a natural mapping including entities such as ‘customer’ or ‘product’ being mapped to tables and associated attributes such as ‘age’ or ‘name’ or ‘product ID’ being mapped to columns.

Remediation Versus Monitoring

As with many activities within the field of data management, we are introducing new controls on our data, which have never been applied before because the need had not previously been there.  As a result, we can expect that things are not where we would hope they would be, and we should expect an amount of remediation to bring us to a happy state.  This may take some time and will be achieved through a fair amount of work.  This normally looks like a single assessment which highlights a lot of shortcomings followed by a lot of work to overcome those issues – this is proactive work in response to a requirement for change.

Once remediation has been put in place, the job switches to maintaining the situation and responding if there is ever a dip back below the expected levels.  This is reactive work in response to:

  • Changes in the data lineage which might indicate a shift in where your focus should lie.
  • Changes in the data quality identified by quality measurement tools which alert you based on thresholds you have set.
  • Issues raised by users – whether with quality or ambiguity.

A Quick Look

To drive data trust, organisations can start by understanding their current position and determining where they aim to be. This involves identifying the key subsets of the critical data that require attention through data lineage exploration, examining the existing profile of this critical data to establish what constitutes ‘good’ or ‘correct’ data, and capturing and defining terms used in our critical data to eliminate ambiguity, especially at the consumer end.

The next step is to measure the gap by testing the data for correctness using data quality management tools and validating the correct application of terminology throughout the data flow, aided by detailed data lineage information, including column-to-column mappings and data transformations.

Finally, to close the gap, they should take appropriate remedial actions. Maintaining the gap closed involves switching to a monitoring mode where data quality management/data observability tools alert you when the situation falls below acceptable standards, enabling users to highlight concerns or issues, and ensuring any issues are swiftly addressed to reinforce trust.

Allan Watkins

At the heart of Allan's professional journey, spanning more than 26 years, lies a deep-seated passion for acquiring knowledge and understanding the mechanics behind how things operate. His thought leadership content mirrors this curiosity, enabling readers to broaden their understanding of data governance and its complex web of policies.

Close Menu

© Nephos Technologies Ltd.