Let's put the concepts from this chapter into action. You've learned about standardizing text case, removing extra whitespace, performing simple replacements, and handling basic unit conversions. Now, we'll work through an example using a dataset to solidify these skills. Applying these techniques is a common part of preparing data for analysis.Imagine we have a small dataset representing product information, perhaps collected from different sources or through manual entry. As is often the case, it contains inconsistencies that need addressing.Here's a simplified view of our initial data:IDProductNameOriginWeightStr1' Super Widget ''usa''5 kg'2' Power Gadget ''USA''11 lbs'3'Flexo Thing'' U.S.A. ''2 kg'4'Super Widget''Canada''7 kg'5'Power Gadget''usa''15 lbs'6'Flexo Thing ''United States''2.1 kg'We can see several issues:ProductName has leading/trailing whitespace.Origin uses inconsistent capitalization ('usa', 'USA', 'U.S.A.', 'United States') and has surrounding whitespace.WeightStr contains both kilograms ('kg') and pounds ('lbs'), and the values are stored as strings combined with units.Our goal is to clean this data to make it uniform and ready for analysis. We'll assume we are using a common data manipulation tool like the pandas library in Python, but the concepts apply broadly.Step 1: Standardizing Text Case and Removing WhitespaceFirst, let's tackle the Origin and ProductName columns. We want consistent casing (let's choose lowercase) and no unnecessary spaces.Targeting Origin:Convert all entries to lowercase.Remove leading and trailing whitespace.# Example using pandas # Convert to lowercase df['Origin'] = df['Origin'].str.lower() # Remove leading/trailing whitespace df['Origin'] = df['Origin'].str.strip()Targeting ProductName:Remove leading and trailing whitespace.# Example using pandas df['ProductName'] = df['ProductName'].str.strip()After these steps, our Origin and ProductName columns look like this:IDProductNameOriginWeightStr1'Super Widget''usa''5 kg'2'Power Gadget''usa''11 lbs'3'Flexo Thing''u.s.a.''2 kg'4'Super Widget''canada''7 kg'5'Power Gadget''usa''15 lbs'6'Flexo Thing''united states''2.1 kg'Step 2: Simple String ReplacementsThe Origin column is better, but we still have variations like 'usa', 'u.s.a.', and 'united states'. We should standardize these to a single format, for example, 'usa'.We can use string replacement for this.# Example using pandas replace_map = { 'u.s.a.': 'usa', 'united states': 'usa' } df['Origin'] = df['Origin'].replace(replace_map)Now, the Origin column is consistent:IDProductNameOriginWeightStr1'Super Widget''usa''5 kg'2'Power Gadget''usa''11 lbs'3'Flexo Thing''usa''2 kg'4'Super Widget''canada''7 kg'5'Power Gadget''usa''15 lbs'6'Flexo Thing''usa''2.1 kg'Step 3: Basic Unit ConversionThe WeightStr column mixes units ('kg' and 'lbs') and stores the data as text. To perform calculations, we need a numeric column with consistent units. Let's convert everything to kilograms (kg).The conversion factor is approximately $1 \text{ lb} = 0.453592 \text{ kg}$.This often involves several sub-steps:Separate the numeric value from the unit ('kg' or 'lbs').Convert the numeric part to a floating-point number.Identify rows with 'lbs' and apply the conversion factor.Store the result in a new numeric column, perhaps named Weight_kg.# Example using pandas (steps) # 1. Extract numeric value and unit (e.g., using string splitting or regex) # Example: '5 kg' -> (5.0, 'kg'), '11 lbs' -> (11.0, 'lbs') # Store these in temporary columns, e.g., 'WeightValue' and 'WeightUnit' df[['WeightValue', 'WeightUnit']] = df['WeightStr'].str.extract(r'(\d+\.?\d*)\s*(\w+)') df['WeightValue'] = df['WeightValue'].astype(float) # 2. Define the conversion factor lbs_to_kg = 0.453592 # 3. Apply conversion where needed # Create the new column, initially copying kg values df['Weight_kg'] = df['WeightValue'] # Find rows where the unit is 'lbs' and apply the conversion lbs_mask = df['WeightUnit'].str.lower() == 'lbs' df.loc[lbs_mask, 'Weight_kg'] = df.loc[lbs_mask, 'WeightValue'] * lbs_to_kg # 4. (Optional) Drop the temporary/original columns # df = df.drop(columns=['WeightStr', 'WeightValue', 'WeightUnit'])After applying the conversion, our dataset looks much cleaner, with a dedicated numeric column for weight in kilograms:IDProductNameOriginWeightStrWeightValueWeightUnitWeight_kg1'Super Widget''usa''5 kg'5.0'kg'5.0000002'Power Gadget''usa''11 lbs'11.0'lbs'4.9895123'Flexo Thing''usa''2 kg'2.0'kg'2.0000004'Super Widget''canada''7 kg'7.0'kg'7.0000005'Power Gadget''usa''15 lbs'15.0'lbs'6.8038806'Flexo Thing''usa''2.1 kg'2.1'kg'2.100000We can round the Weight_kg column to a reasonable number of decimal places if desired and then potentially drop the intermediate columns (WeightStr, WeightValue, WeightUnit) if they are no longer needed.Summary of PracticeThrough these steps, we applied the formatting techniques discussed in this chapter:Standardized text case in the Origin column.Trimmed leading/trailing whitespace from ProductName and Origin.Used string replacement to unify variations in the Origin column.Performed a basic unit conversion on the weight data, creating a consistent, numeric Weight_kg column.Our dataset is now significantly more consistent and easier to work with for analysis or merging with other data. This practice exercise demonstrates how these seemingly simple formatting steps are fundamental for creating reliable datasets.