Removing unnecessary data is becoming increasingly worthwhile, whether to mitigate the risks of sensitive information falling into the wrong hands or to cut storage costs. This isn’t a new challenge. Many organisations have attempted to trim their data, but I am yet to encounter one that has truly mastered it.
With the growing emphasis on and maturity of data governance, we’re now better positioned to tackle the core issue that has long made this so difficult: viewing it solely as an IT problem.
Understanding the challenge
In my experience, it’s entirely possible to analyse your data to understand what’s there, assess its sensitivity, and even identify what may no longer be needed. However, what often seems to be missing is a clear mechanism to make the ultimate decision: “Yes, delete this.” Historically, this is because data deletion has been treated as an IT project. IT teams often lack the context to fully appreciate what the data represents, the impact of deleting it, and whether it’s still in use – it’s not their data, after all. Even when decision-making authority is placed with the right people (those who understand the data’s value and can confidently determine it’s no longer needed), technical complexities can still create a barrier. If the details are presented in a way that’s hard to grasp, uncertainty can lead to the default choice of, “Not sure, so leave it.”Governance is good
Without diving into a lengthy debate about what data governance is or isn’t, the principles of effective governance apply here as it does everywhere. With a solid understanding of your environment and a good dose of common sense, you can ensure your efforts are well-directed and inefficiencies are minimised. When it comes to data disposition, this means:- Decision-making rests with the right people – those who create, understand, and use the data are best placed to decide what should be kept or removed.
- A clear mechanism exists for identifying candidates for removal, making it easier to search and act on unnecessary data.
Where to start
As always, it comes down to return on investment – where you can achieve the biggest impact quickly with minimal effort. Impact The impact depends on your objective. If your goal is to reduce your data footprint and cut IT costs, focus on high-volume data stored on the most expensive infrastructure. If the priority is mitigating the risk of sensitive data leaks, target areas where the most sensitive data is most likely to be exposed. Effectiveness Start small and build gradually – whether you prefer ‘low-hanging fruit’ or ‘don’t boil the ocean,’ the idea remains the same. When identifying categories of data for removal, consider these factors:- Ease of execution: Simple criteria, like categorising files by type or age, are straightforward to implement. In contrast, processing files to detect specific words, people, or content (e.g., in text, images, or audio) can be computationally intensive.
- Return on investment: A policy that removes just a handful of files may not justify the effort, while one that clears terabytes of sensitive data clearly delivers greater value. However, overly broad policies can introduce uncertainty and caveats, making it harder to gain approval for deletion.
Chicken or egg?
The logical approach to a project like this is to define a policy for a specific category of data to remove, identify where that data exists, and then delete it. While this makes sense in theory, technological constraints often get in the way. Can you accurately classify which data fits the criteria? Some data is far easier to identify than others, and accuracy can vary. Factoring these limitations into the policy-making process ensures that the policies you define are actually actionable. Starting from scratch can also be challenging. Instead, gaining an initial understanding of your data can spark conversations about good removal candidates and help test hypotheses about which policies are likely to deliver meaningful results.