Let's put the concepts from this chapter into practice. We've learned about different data types and why ensuring columns have the correct type is significant for analysis. Now, we'll walk through a common scenario where you receive data with mixed or incorrect types and apply the techniques discussed to fix them.We'll use a small dataset representing product orders. Imagine this data came from combining different sources or manual entry, leading to inconsistencies. We will use the pandas library, a standard tool for data manipulation in Python. If you haven't used it before, pandas provides structures called DataFrames, which are like tables, and Series, which are like columns, along with functions to work with them.Setting Up the ExampleFirst, let's create our sample DataFrame. In a real project, you would load this from a file (like a CSV), but here we'll define it directly for clarity.import pandas as pd import numpy as np # We need numpy for NaN data = { 'OrderID': [1, 2, 3, 4, 5], 'Product': ['Apple', 'Banana', 'Milk', 'Bread', 'Apple'], 'Category': ['Fruit', 'Fruit', 'Dairy', 'Bakery', 'Fruit'], 'Quantity': ['10', '5', '2', '1', '5'], # Stored as strings 'Unit_Price': ['$0.50', '$0.30', '$3.5O', '$2.50', '$0.50'], # Strings with symbols and an error 'Order_Date': ['01/15/2023', '01/16/2023', '01/16/2023', '2023-01-17', '01/18/2023'] # Mixed formats stored as strings } df = pd.DataFrame(data) print("Original DataFrame:") print(df)Inspecting Initial Data TypesBefore making changes, let's check the current data types pandas assigned to each column. The info() method is excellent for this, providing types and non-null counts.print("\nInitial Data Types:") df.info()You'll likely see output similar to this:<class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 OrderID 5 non-null int64 1 Product 5 non-null object 2 Category 5 non-null object 3 Quantity 5 non-null object 4 Unit_Price 5 non-null object 5 Order_Date 5 non-null object dtypes: int64(1), object(5) memory usage: 368.0+ bytesNotice that OrderID is correctly identified as an integer (int64), but Product, Category, Quantity, Unit_Price, and Order_Date are all listed as object. While Product and Category being objects (which usually means strings in pandas) might be acceptable initially, Quantity, Unit_Price, and Order_Date definitely need correction for proper analysis (e.g., calculating total cost or analyzing sales over time).Converting to Numeric TypesLet's tackle Quantity and Unit_Price.1. Quantity: This column contains numbers stored as strings. We can convert it directly using pd.to_numeric.df['Quantity'] = pd.to_numeric(df['Quantity']) print("\nData types after converting Quantity:") print(df.dtypes)The output should now show Quantity as int64 or float64.2. Unit_Price: This column is trickier. It contains dollar signs ($) and potentially other non-numeric characters (like the letter 'O' instead of zero in '$3.5O').First, we need to clean the strings: remove the $ sign. We can use the .str.replace() method available for pandas Series with string data.df['Unit_Price'] = df['Unit_Price'].str.replace('$', '', regex=False) print("\nUnit_Price after removing '$':") print(df['Unit_Price'])Now, try converting to numeric. What happens with the '$3.5O' entry?# Attempt direct conversion (will likely cause an error) # df['Unit_Price'] = pd.to_numeric(df['Unit_Price']) # Uncommenting this would raise an error # Use errors='coerce' to handle invalid values df['Unit_Price'] = pd.to_numeric(df['Unit_Price'], errors='coerce') print("\nUnit_Price after conversion with errors='coerce':") print(df['Unit_Price']) print("\nData types after converting Unit_Price:") print(df.dtypes)Using errors='coerce' tells pandas to replace any value that cannot be converted to a number with NaN (Not a Number), which represents missing data. You'll see NaN where '$3.5O' used to be. This is useful because it prevents the entire conversion from failing. You would then need to decide how to handle this new missing value, perhaps by investigating the source or using an imputation technique discussed in Chapter 2. For now, we'll leave it as NaN. The Unit_Price column should now be of type float64.Converting to Datetime TypesThe Order_Date column contains dates stored as strings, and even has mixed formats. pandas has a powerful pd.to_datetime function that can often automatically figure out the format.df['Order_Date'] = pd.to_datetime(df['Order_Date']) print("\nData types after converting Order_Date:") print(df.dtypes)If the formats were more complex or ambiguous, you might need to specify the format string using the format argument in pd.to_datetime. For example, if all dates were like '01/15/2023', you could use pd.to_datetime(df['Order_Date'], format='%m/%d/%Y'). However, pandas is quite clever at inferring common formats, as seen here. The Order_Date column should now be datetime64[ns].Converting to Categorical TypeThe Category column contains text representing distinct groups ('Fruit', 'Dairy', 'Bakery'). While leaving it as object (string) is fine, converting it to a category data type can be more memory-efficient and sometimes speed up operations, especially if there are many duplicate values (like 'Fruit' appearing multiple times).df['Category'] = df['Category'].astype('category') print("\nFinal Data Types after all conversions:") df.info()The info() output now reflects the changes: Quantity is numeric, Unit_Price is numeric (float) with one missing value introduced by coercion, Order_Date is datetime, and Category is categorical.<class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 OrderID 5 non-null int64 1 Product 5 non-null object # Still object (string), which is fine 2 Category 5 non-null category 3 Quantity 5 non-null int64 4 Unit_Price 4 non-null float64 # Note: 4 non-null due to coercion 5 Order_Date 5 non-null datetime64[ns] dtypes: category(1), datetime64[ns](1), float64(1), int64(2), object(1) memory usage: 593.0+ bytes # Memory usage might decrease slightly due to categoryHere's a visual comparison of the data types before and after our corrections:{"data":[{"type":"bar","x":["int64","object"],"y":[1,5],"name":"Before","marker":{"color":"#ff8787"}},{"type":"bar","x":["int64","object","category","float64","datetime64[ns]"],"y":[2,1,1,1,1],"name":"After","marker":{"color":"#74c0fc"}}],"layout":{"title":{"text":"Data Type Distribution Before and After Correction"},"xaxis":{"title":{"text":"Data Type"}},"yaxis":{"title":{"text":"Number of Columns"}},"barmode":"group","margin":{"l":40,"r":20,"t":40,"b":40}}}Count of columns for each data type present in the DataFrame before and after applying type corrections.Summary of Practical StepsIn this hands-on section, we:Inspected the initial data types using df.info().Converted a string column (Quantity) to numeric using pd.to_numeric().Cleaned a string column (Unit_Price) by removing unwanted characters ($) using .str.replace().Converted the cleaned column to numeric using pd.to_numeric(errors='coerce') to handle invalid entries gracefully by turning them into NaN.Converted a string column with date information (Order_Date) into a proper datetime format using pd.to_datetime().Converted a string column with repetitive values (Category) into a memory-efficient category type using .astype('category').Verified the final data types.Correcting data types is a fundamental step in data preprocessing. It ensures that your data is stored appropriately, preventing errors and enabling accurate calculations, comparisons, and analysis in subsequent steps. By applying these techniques, you make your data ready for visualization, statistical analysis, or machine learning model training.