Data Hygiene and Why it Matters

By Emma Warrillow

Published on 30 Nov, 2022

Your marketing efforts are only as strong as the quality of your data, and with marketing budgets tightening up, it is increasingly important to ensure that your efforts (and your marketing dollars) are not wasted.

An alarming 25% of U.S. organizations believe that their data is inaccurate. Bad data quality, or ‘messy data’, can quickly lead to wasted time and money, and could even damage your brand’s reputation. Dirty data costs the average business 15% to 25% of revenue, and the US economy over $3 trillion annually.

So, what makes data quality good or bad?

What is Data Hygiene?

Data hygiene is the collective processes conducted to ensure the cleanliness of data. Data is considered clean if it is relatively error-free and current.

5 key characteristics of clean, quality data:

Valid – Does your data conform to the defined business rules and objectives? You are looking to see if the value is in the accepted set of values. For example, in a numeric field for month, the value of 25 is impossible.

Accurate – Is the information correct? How well does your data reflect the truth? Data accuracy refers to error-free records that can be used as a reliable source of information. For example, the inaccurate data may be caused by the customer themselves entering a false birthdate to appear older (or younger) than they really are.

Complete – Are there critical gaps in your data? Completeness refers to how comprehensive the information is. When looking at data completeness, think about whether all of the data you need is available; for example, marketers are frequently concerned with whether an email address is available for each customer that they wish to contact.

Consistent – How confident are you with the data sources? Are they reliable? Database consistency rules require that data be written and formatted in ways that support your system’s definition of valid data. One way to ensure data consistency is through anomaly detection, which helps you to identify unexpected values or events in your data set. An example of an outlier in this scenario might be if the data input is to include country and time zone and “Spain” has “Eastern Standard Time Zone” we know there is an inconsistency based on the other consistent entries. This often happens when data is merged from a variety of sources where data is not collected or maintained with consistent rules.

Timely – How current is the information you’re referencing? Timeliness refers to the speed of the dissemination of the data. For example, the lapse of time between the end of a reference period (or a reference date) and the dissemination of the data. Data generally decays over time; for example, marital status collected in 2012 may not still be relevant in 2022.

Real-time data refers to data that is presented as it is acquired. The idea of real-time data handling has become increasingly popular in new technologies such as those that deliver up-to-the-minute information to mobile devices. However, storing that data over time may mean it is no longer as useful as it was in the moment.

What causes dirty data?

Dirty data can be caused by a number of factors including duplicate records, incomplete or outdated data, and the improper parsing of record fields from disparate systems. This inconsistent, messy data can lead to misinformed, and perhaps fatal, business decisions.

In general, a poor data strategy leads to dirty data, but common things we have seen include:

Poor process around manually entering data (unclear naming or taxonomy standards)
Broken automation or workflows
Poor business process or duplicative processes
System setup – e.g.,free-form text fields where a pull down menu would be better
Systems being brought together, e.g. two companies are acquired and their data is brought into one database

In some cases, poor understanding of the value and potential uses of data are what cause the dirty data in the first place. For example, many sales representatives put the minimal data for their own purposes into a CRM (e.g., 4th Floor, RBC Tower) rather than the whole address – not considering that a central team might want to use the data to plan for client visits, to mail a holiday card to that contact, etc.

Ensuring everyone on your team understands the value of data is critical.

Why is data hygiene so important?

Having clean data will ultimately increase overall productivity and allow for the highest quality information in your decision-making. Benefits of clean data sets include:

More efficient business practices
Quicker and more accurate decision-making
Fewer errors
Happier customers/clients
Less-frustrated employees

What is data cleaning?

Tableau defines data cleaning as “the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. When combining multiple data sources, there are many opportunities for data to be duplicated or mislabeled. If data is incorrect, outcomes and algorithms are unreliable, even though they may look correct.”

Conducting a thorough data audit, to uncover deficiencies, duplicates and inconsistencies should be the first step you tackle as part of a coherent, centralized data quality strategy. While some data cleaning is straightforward and clear-cut, often there is a need for further validation from the client or a third party data source; for example, an incorrect address can be amended using address accuracy software.

Cleaning data, however, requires more than a one-off fix. Of equal importance in your strategy is the understanding of how data issues are occuring and the processes that are causing them. Fixing upstream processes will include everything from adding drop-down menus or predictive-fill algorithms, training front line staff and identifying broken data connections across data sets.

Preventative action upfront avoids the creation of dirty data and saves organizations considerable money. Consider the 1-10-100 rule coined in 1992 by George Labovitz and Yu Sang Chang. They posit that if prevention (verifying the quality up front) costs $1, then correction (cleansing, deduplicating, etc.) costs $10; furthermore, leaving the data ‘broken’ costs $100 in failure costs as it may be relied upon by marketing or sales representing missed opportunities, reputational cost etc. So, costs increase exponentially the longer you let dirty data fester.

Other considerations

In addition to hygiene itself, there are three other critical considerations for data that often impact its usefulness.

Relevant – Is the data you’re collecting actionable? Relevancy reflects the degree to which statistical information meets the real needs of clients. Are the data points you are collecting tied to your overall organizational objectives? What data would you love to have access to, but don’t? How would that information make a difference to your organization, if you were able to analyze it? Is the data just noise and irrelevant to your objectives?

Accessible – One other characteristic to consider when reviewing your data hygiene practices is if the data is accessible to everyone that needs it. Is there a single source of reliable data from which everyone can take action upon? If not, your updated data strategy should incorporate prioritizing the accessibility of data throughout your organization.

Not surprisingly, 31% of organizations report lack of internal communications between departments as a source of inaccurate data. Those darn siloes!

In conclusion

Data has become the foundation for making successful business decisions, and yet 91% of organizations suffer from common data errors. Clean, well-organized data not only drives smart decisions, but stands to dramatically improve both your customer and your employees’ experiences.

Need support with your data hygiene practices or in mapping out a coherent, centralized data quality strategy? Talk to Shift Paradigm today!

Data Strategy & Design MarTech Optimization

Written By Emma Warrillow

SVP, Research & Data Insights at Shift Paradigm, Emma has over 25 years experience helping organizations with the strategic use of customer data to drive business results. Her work has spanned a variety of industries in B2C and B2B, including those who market to intermediaries (like brokers).