Let's put the concepts from this chapter into practice. We'll work through common scenarios of reading data from files, making a minor adjustment, and then saving the results. This simulates a frequent pattern in data analysis: load, process, save.
First, ensure you have Pandas imported. Typically, we use the alias pd
.
import pandas as pd
print(f"Pandas version: {pd.__version__}")
For these exercises, we need some data files. Let's imagine we have two files:
students.csv
: A Comma Separated Value file.grades.xlsx
: A Microsoft Excel file.If you are running this code locally, you can create these files yourself.
Create students.csv
:
Save the following text into a file named students.csv
in the same directory as your script or notebook:
StudentID,Name,Major
101,Alice,Physics
102,Bob,Chemistry
103,Charlie,Mathematics
104,David,Physics
Create grades.xlsx
:
You'll need a library like openpyxl
installed (pip install openpyxl
) to work with .xlsx
files. Create an Excel file named grades.xlsx
with the following data in a sheet named Sheet1
:
StudentID | Course | Grade |
---|---|---|
101 | Mechanics | 85 |
102 | Org Chem | 92 |
101 | E&M | 78 |
103 | Calculus | 95 |
102 | Thermo | 88 |
104 | Mechanics | 81 |
Now, let's read the students.csv
file into a Pandas DataFrame using pd.read_csv()
.
# Read the CSV file
students_df = pd.read_csv('students.csv')
# Display the DataFrame
print("Students DataFrame:")
print(students_df)
You'll notice Pandas correctly identified the first row as the header and used a default integer index (0, 1, 2, 3).
What if our file used a different separator, like a tab? Or what if we wanted the StudentID
column to be the DataFrame's index? We can use parameters in read_csv
.
# Example: Reading assuming StudentID should be the index
students_df_indexed = pd.read_csv('students.csv', index_col='StudentID')
print("\nStudents DataFrame with StudentID as index:")
print(students_df_indexed)
Setting index_col='StudentID'
tells Pandas to use the values from that column as the row labels (the index) for the DataFrame.
Next, let's load the data from the grades.xlsx
file using pd.read_excel()
.
# Read the Excel file (ensure openpyxl or another engine is installed)
try:
grades_df = pd.read_excel('grades.xlsx')
print("\nGrades DataFrame (from Excel Sheet1):")
print(grades_df)
except ImportError:
print("\nPlease install 'openpyxl' to read Excel files: pip install openpyxl")
except FileNotFoundError:
print("\nMake sure 'grades.xlsx' is in the correct directory.")
By default, read_excel
reads the first sheet. If our data was on a different sheet, say "CourseGrades", we would specify it:
# Example: Reading a specific sheet (if it existed)
# grades_df_specific_sheet = pd.read_excel('grades.xlsx', sheet_name='CourseGrades')
You can use the sheet name (string) or its zero-based index (integer) for sheet_name
.
Let's take the grades_df
we loaded and add a new column. Suppose we want a 'Status' column indicating 'Pass' if the grade is 70 or above, and 'Fail' otherwise. We can achieve this using boolean indexing and assignment.
# Check if grades_df was loaded successfully before proceeding
if 'grades_df' in locals():
# Add a 'Status' column, default to 'Fail'
grades_df['Status'] = 'Fail'
# Update 'Status' to 'Pass' where Grade >= 70
grades_df.loc[grades_df['Grade'] >= 70, 'Status'] = 'Pass'
print("\nGrades DataFrame with Status column added:")
print(grades_df)
else:
print("\nSkipping modification as grades_df was not loaded.")
We first created the column and assigned a default value ('Fail'). Then, we used .loc
with a boolean condition (grades_df['Grade'] >= 70
) to select the rows where the grade met the passing threshold and updated only their 'Status' value to 'Pass'.
Now that we have modified grades_df
, let's save it back to a new CSV file named processed_grades.csv
.
A common requirement is to not write the DataFrame's index (0, 1, 2...) into the file, as it often doesn't represent meaningful data. We use the index=False
parameter for this.
# Check if grades_df exists before saving
if 'grades_df' in locals():
# Save the modified DataFrame to CSV without the index
grades_df.to_csv('processed_grades.csv', index=False)
print("\nSaved modified grades data to processed_grades.csv (index excluded).")
# You can verify by reading it back or opening the file
# check_csv = pd.read_csv('processed_grades.csv')
# print("\nVerification read from processed_grades.csv:")
# print(check_csv)
else:
print("\nSkipping CSV export as grades_df is not available.")
If you open processed_grades.csv
, you'll see the columns StudentID
, Course
, Grade
, and Status
, but not the default 0-based index column.
Similarly, we can save the modified DataFrame to an Excel file. Let's call it processed_grades.xlsx
and put the data in a sheet named 'Final Grades'. Again, we'll typically want index=False
.
# Check if grades_df exists before saving
if 'grades_df' in locals():
try:
# Save the modified DataFrame to Excel without the index, specifying sheet name
grades_df.to_excel('processed_grades.xlsx', sheet_name='Final Grades', index=False)
print("\nSaved modified grades data to processed_grades.xlsx (Sheet: 'Final Grades', index excluded).")
# Verification (optional)
# check_excel = pd.read_excel('processed_grades.xlsx', sheet_name='Final Grades')
# print("\nVerification read from processed_grades.xlsx:")
# print(check_excel)
except ImportError:
print("\nPlease install 'openpyxl' to write Excel files: pip install openpyxl")
else:
print("\nSkipping Excel export as grades_df is not available.")
This creates an Excel file with our processed data neatly organized in the specified sheet.
In this practice session, you used the fundamental Pandas IO functions:
pd.read_csv()
to load data from text files, exploring index_col
.pd.read_excel()
to load data from spreadsheets, exploring sheet_name
..to_csv()
to save DataFrames, emphasizing the use of index=False
..to_excel()
to save DataFrames, using index=False
and sheet_name
.You also performed a simple data modification step between reading and writing, demonstrating a basic data processing workflow. Mastering these input and output operations is essential for integrating Pandas into your data analysis tasks.
© 2025 ApX Machine Learning