Let's put the concepts from this chapter into action. You've learned about standardizing text case, removing extra whitespace, performing simple replacements, and handling basic unit conversions. Now, we'll work through a practical example using a hypothetical dataset to solidify these skills. Applying these techniques is a common part of preparing data for analysis.
Imagine we have a small dataset representing product information, perhaps collected from different sources or through manual entry. As is often the case, it contains inconsistencies that need addressing.
Here's a simplified view of our initial data:
ID | ProductName | Origin | WeightStr |
---|---|---|---|
1 | ' Super Widget ' | 'usa' | '5 kg' |
2 | ' Power Gadget ' | 'USA' | '11 lbs' |
3 | 'Flexo Thing' | ' U.S.A. ' | '2 kg' |
4 | 'Super Widget' | 'Canada' | '7 kg' |
5 | 'Power Gadget' | 'usa' | '15 lbs' |
6 | 'Flexo Thing ' | 'United States' | '2.1 kg' |
We can see several issues:
ProductName
has leading/trailing whitespace.Origin
uses inconsistent capitalization ('usa', 'USA', 'U.S.A.', 'United States') and has surrounding whitespace.WeightStr
contains both kilograms ('kg') and pounds ('lbs'), and the values are stored as strings combined with units.Our goal is to clean this data to make it uniform and ready for analysis. We'll assume we are using a common data manipulation tool like the pandas library in Python, but the concepts apply broadly.
First, let's tackle the Origin
and ProductName
columns. We want consistent casing (let's choose lowercase) and no unnecessary spaces.
Targeting Origin
:
# Example using pandas
# Convert to lowercase
df['Origin'] = df['Origin'].str.lower()
# Remove leading/trailing whitespace
df['Origin'] = df['Origin'].str.strip()
Targeting ProductName
:
# Example using pandas
df['ProductName'] = df['ProductName'].str.strip()
After these steps, our Origin
and ProductName
columns look like this:
ID | ProductName | Origin | WeightStr |
---|---|---|---|
1 | 'Super Widget' | 'usa' | '5 kg' |
2 | 'Power Gadget' | 'usa' | '11 lbs' |
3 | 'Flexo Thing' | 'u.s.a.' | '2 kg' |
4 | 'Super Widget' | 'canada' | '7 kg' |
5 | 'Power Gadget' | 'usa' | '15 lbs' |
6 | 'Flexo Thing' | 'united states' | '2.1 kg' |
The Origin
column is better, but we still have variations like 'usa', 'u.s.a.', and 'united states'. We should standardize these to a single format, for example, 'usa'.
We can use string replacement for this.
# Example using pandas
replace_map = {
'u.s.a.': 'usa',
'united states': 'usa'
}
df['Origin'] = df['Origin'].replace(replace_map)
Now, the Origin
column is consistent:
ID | ProductName | Origin | WeightStr |
---|---|---|---|
1 | 'Super Widget' | 'usa' | '5 kg' |
2 | 'Power Gadget' | 'usa' | '11 lbs' |
3 | 'Flexo Thing' | 'usa' | '2 kg' |
4 | 'Super Widget' | 'canada' | '7 kg' |
5 | 'Power Gadget' | 'usa' | '15 lbs' |
6 | 'Flexo Thing' | 'usa' | '2.1 kg' |
The WeightStr
column mixes units ('kg' and 'lbs') and stores the data as text. To perform calculations, we need a numeric column with consistent units. Let's convert everything to kilograms (kg).
The conversion factor is approximately 1 lb=0.453592 kg.
This often involves several sub-steps:
Weight_kg
.# Example using pandas (conceptual steps)
# 1. Extract numeric value and unit (e.g., using string splitting or regex)
# Example: '5 kg' -> (5.0, 'kg'), '11 lbs' -> (11.0, 'lbs')
# Store these in temporary columns, e.g., 'WeightValue' and 'WeightUnit'
df[['WeightValue', 'WeightUnit']] = df['WeightStr'].str.extract(r'(\d+\.?\d*)\s*(\w+)')
df['WeightValue'] = df['WeightValue'].astype(float)
# 2. Define the conversion factor
lbs_to_kg = 0.453592
# 3. Apply conversion where needed
# Create the new column, initially copying kg values
df['Weight_kg'] = df['WeightValue']
# Find rows where the unit is 'lbs' and apply the conversion
lbs_mask = df['WeightUnit'].str.lower() == 'lbs'
df.loc[lbs_mask, 'Weight_kg'] = df.loc[lbs_mask, 'WeightValue'] * lbs_to_kg
# 4. (Optional) Drop the temporary/original columns
# df = df.drop(columns=['WeightStr', 'WeightValue', 'WeightUnit'])
After applying the conversion, our dataset looks much cleaner, with a dedicated numeric column for weight in kilograms:
ID | ProductName | Origin | WeightStr | WeightValue | WeightUnit | Weight_kg |
---|---|---|---|---|---|---|
1 | 'Super Widget' | 'usa' | '5 kg' | 5.0 | 'kg' | 5.000000 |
2 | 'Power Gadget' | 'usa' | '11 lbs' | 11.0 | 'lbs' | 4.989512 |
3 | 'Flexo Thing' | 'usa' | '2 kg' | 2.0 | 'kg' | 2.000000 |
4 | 'Super Widget' | 'canada' | '7 kg' | 7.0 | 'kg' | 7.000000 |
5 | 'Power Gadget' | 'usa' | '15 lbs' | 15.0 | 'lbs' | 6.803880 |
6 | 'Flexo Thing' | 'usa' | '2.1 kg' | 2.1 | 'kg' | 2.100000 |
We can round the Weight_kg
column to a reasonable number of decimal places if desired and then potentially drop the intermediate columns (WeightStr
, WeightValue
, WeightUnit
) if they are no longer needed.
Through these steps, we applied the formatting techniques discussed in this chapter:
Origin
column.ProductName
and Origin
.Origin
column.Weight_kg
column.Our dataset is now significantly more consistent and easier to work with for analysis or merging with other data. This practice exercise demonstrates how these seemingly simple formatting steps are fundamental for creating reliable datasets.
© 2025 ApX Machine Learning