Data rarely arrives in a pristine, ready-to-use state. Just like ingredients need preparation before cooking, data needs cleaning before analysis. Understanding how data gets messy is the first step in learning how to fix it. Imperfections can creep in at various stages, from initial collection to storage and processing. Let's look at some frequent origins of these data quality problems.
Manual Data Entry Errors
Humans are involved in collecting and entering data in many scenarios, and humans make mistakes. These are often the most common and varied sources of dirty data:
- Typos: Simple spelling errors in text fields (like
New Yrok
instead of New York
) or transposed digits in numbers (132
instead of 123
).
- Inconsistent Formatting: Entering dates in different ways (
01/15/2023
, 15-Jan-2023
, 2023/01/15
), using different units without notation (miles vs. kilometers), or inconsistent capitalization (california
, California
, CA
).
- Inconsistent Abbreviations/Labels: Using different terms for the same thing, like
USA
, U.S.A.
, United States
, or Female
, F
, Woman
.
- Missing Values Entered Incorrectly: Sometimes, instead of leaving a field truly empty (which systems often represent as
NULL
or NaN
), users might type placeholders like N/A
, Missing
, ?
, 999
, or even just a space. These aren't truly missing from a system perspective but represent missing information that needs special handling.
- Transcription Errors: Mistakes made when copying data from one source (like a paper form) to another (like a spreadsheet).
Data Integration Issues
Organizations often combine data from multiple sources, such as different databases, departments, or external partners. This integration process can introduce several problems:
- Schema Mismatches: Different sources might store similar information using different column names (
CustomerID
vs. CustID
), data types (numeric vs. string), or structures. Merging them requires careful mapping and transformation.
- Conflicting Values: The same entity (e.g., a customer) might have different information in different source systems (like an old address in one and a new address in another). Deciding which value is correct (or how to represent both) is a challenge.
- Duplicate Records: Merging datasets can easily introduce duplicate entries if there isn't a reliable, unique identifier shared across all systems. For example, merging customer lists based on names might duplicate entries for people with common names.
System and Software Errors
Automated systems are not immune to creating data issues:
- Software Bugs: Errors in the code used for data collection, transformation, or storage can corrupt data or record incorrect values. For instance, a faulty script might truncate text fields or miscalculate values.
- Data Transmission Errors: Data can become corrupted as it moves between systems, especially over unstable networks. This might manifest as garbled text or incorrect numeric values.
- Sensor Malfunctions: In fields like IoT or scientific research, physical sensors can fail or provide inaccurate readings due to calibration issues, environmental factors, or hardware defects.
- Character Encoding Problems: Different systems might use different standards (like UTF-8, ASCII, Latin-1) to represent characters. Mixing encodings without proper handling can lead to unreadable text (e.g.,
“
instead of a quote).
Data Decay
Information can become outdated over time. This is particularly relevant for data involving people, organizations, or real-world states:
- Changes in Real-World Attributes: People move, change jobs, change phone numbers or email addresses. Businesses change names or locations. Products get discontinued. Data collected previously might no longer reflect the current reality.
- Shifting Relevance: The importance or meaning of certain data points can change. For example, historical purchasing data might become less relevant for predicting future behavior as customer preferences evolve.
Lack of Defined Standards
Without clear guidelines established before data collection begins, inconsistencies are almost guaranteed:
- Undefined Data Dictionaries: No central reference exists to define what each field means, what format it should be in, what values are allowed (e.g., for categorical data), or how missing values should be indicated.
- Inconsistent Units: Different teams or systems might record measurements using different units (e.g., pounds vs. kilograms, Fahrenheit vs. Celsius) without explicitly stating the unit used.
- Varying Categorical Labels: Different codes or text strings might be used for the same category (e.g.,
1
, M
, Male
for gender; CA
, Calif.
, California
for state).
Recognizing these potential sources helps you anticipate the types of problems you might encounter in a new dataset. This awareness guides your initial exploration and informs the specific cleaning techniques you'll need to apply, which we will cover in subsequent chapters.