While the previous section addressed data that is simply missing, we now turn our attention to data that is present but incorrect. Errors can creep into datasets in numerous ways: manual data entry mistakes (typos), faulty sensors or measurement equipment, variations in formatting over time, or bugs in the systems that originally captured the data. Correcting these errors is a significant step towards creating reliable data suitable for analysis or loading into a target system.
Think about common mistakes you might see in everyday information: a phone number with too many digits, a birth date listed in the future, or a state abbreviated inconsistently ('CA', 'Calif.', 'California'). These might seem minor, but in large datasets, such inconsistencies can lead to inaccurate reports, failed analyses, and flawed decision making.
Types of Data Errors
Errors can manifest in various forms. Here are some common categories you'll encounter:
- Incorrect Data Types: A field expected to contain numbers (like quantity or price) might hold text characters ('N/A', 'unknown', or even just a typo like '1O' instead of '10').
- Out-of-Range Values: Data might fall outside plausible limits. For example, an 'Age' field containing '200' or a percentage value greater than 100.
- Invalid Values: A field might contain values that are syntactically correct but don't belong to a defined set of possibilities. For instance, a 'State' field containing 'XX' when only valid US state abbreviations are allowed.
- Inconsistent Formatting: The same piece of information might be represented differently across records. Dates ('MM/DD/YYYY' vs. 'YYYY-MM-DD'), state names ('NY' vs. 'New York'), or units ('kg' vs. 'pounds') are frequent examples.
- Contradictory Data: Different fields within the same record might conflict. A record might list a city as 'Los Angeles' but have a zip code belonging to 'New York'.
- Duplicate Records: Sometimes, the same entity (like a customer or a transaction) might appear multiple times in the dataset, potentially with slight variations that make simple removal difficult. While often handled separately as deduplication, identifying near-duplicates often involves correcting inconsistent fields first.
Identifying Errors
Before you can correct errors, you need to find them. This often involves a combination of techniques:
- Data Profiling: This involves analyzing the data to understand its structure, content, and quality. Simple profiling techniques include:
- Finding the minimum and maximum values for numerical columns (quickly reveals outliers).
- Counting the frequency of distinct values in categorical columns (highlights inconsistencies like 'CA' vs. 'Calif.').
- Checking data types against expectations.
- Validation Rules: Define explicit rules that the data must adhere to. These can be simple checks (e.g.,
age > 0 AND age < 120
) or more complex patterns (e.g., checking if a product code matches a specific format using regular expressions). Data that violates these rules is flagged as erroneous.
- Consistency Checks: Compare related fields to ensure they make sense together. Does the city match the state or zip code? Is the order date before the shipping date?
- Using Reference Data: Compare values against known, trusted lists (lookup tables). For instance, check if a 'Country' field contains values from an official list of country codes.
Strategies for Correction
Once errors are identified, you need a plan to fix them. The appropriate strategy depends on the type of error, the context, and the potential impact of the correction.
- Rule-Based Correction: Implement automatic rules to fix common, predictable errors.
- Example: If
State
is 'Calif.' or 'California', replace it with 'CA'.
- Example: If
PhoneNumber
contains characters like (
)
-
or spaces, remove them to standardize the format.
- Lookup Tables: Use external reference tables to standardize or validate values.
- Example: A
ZipCode
lookup table can provide the correct City
and State
, allowing you to fix inconsistencies.
- Statistical Approaches: For numerical outliers that are clearly impossible (like
Age = 200
), you might replace them with a statistical measure like the mean or median age from valid records. Use this approach cautiously, as it alters the original data distribution. It's generally better suited for impossible values than merely unusual ones.
- Parsing and Reformatting: For inconsistencies in complex fields like dates or addresses, you might need to parse the field into components, standardize them, and then reconstruct the field in the desired format.
- Flagging: Sometimes, an error cannot be reliably corrected automatically. In these cases, the best approach might be to flag the record or the specific field as erroneous, potentially routing it for manual review or excluding it from certain analyses.
- Manual Review: For complex or ambiguous errors, especially in smaller datasets or critical fields, manual inspection and correction by a human might be necessary. This is often implemented as an "exception handling" process where flagged records are presented to a data steward.
Here's a simple flow illustrating how you might decide on a correction strategy for an invalid state abbreviation:
Decision process for handling potentially invalid state abbreviations.
Correcting errors is often an iterative process. You might apply one set of rules, profile the data again, discover new issues, and refine your correction logic. The goal is not necessarily perfection, which can be unattainable, but rather to significantly improve data quality and consistency for its intended use in the loading stage.