Before you can analyze data, you need to bring it into your analysis environment. As outlined in the previous chapter, Python, particularly the Pandas library, provides robust tools for this task. This section focuses on the practical steps of loading data from some of the most common file formats you'll encounter: Comma Separated Values (CSV), Microsoft Excel spreadsheets, and JavaScript Object Notation (JSON). Mastering these initial steps is fundamental to any data analysis project.
We assume you have your Python environment set up with Pandas installed, as discussed in Chapter 1.
CSV files are perhaps the most common format for storing tabular data. They are plain text files where values are typically separated by commas. Pandas provides the read_csv()
function, a versatile tool for reading these files into a DataFrame.
The most basic usage requires just the path to the file:
import pandas as pd
# Assuming 'data.csv' is in the same directory as your script/notebook
try:
df_csv = pd.read_csv('data.csv')
print("CSV file loaded successfully.")
# Proceed with inspecting df_csv (covered in the next section)
except FileNotFoundError:
print("Error: data.csv not found. Please check the file path.")
except Exception as e:
print(f"An error occurred: {e}")
While this works for standard, comma-delimited files, read_csv()
offers many parameters to handle variations:
filepath_or_buffer
: This is the primary argument, accepting a file path (string), a URL, or any object with a read()
method.sep
(or delimiter
): Specifies the character used to separate values. If your file uses tabs (\t
), semicolons (;
), or other delimiters, you must set this parameter accordingly. For example: pd.read_csv('data.tsv', sep='\t')
. Pandas usually infers the separator, but explicit definition is safer.header
: Indicates which row contains the column names. By default, it's header=0
(the first row). If your file has no header, use header=None
. Pandas will then assign default integer column names (0, 1, 2...). You can also specify a different row index if the header isn't on the first line.index_col
: You can designate one or more columns from the CSV file to be the DataFrame's index upon loading. For instance, index_col='ID'
uses the column named 'ID' as the index.usecols
: If you only need a subset of columns, provide a list of column names or indices to this parameter, like usecols=['Name', 'Age', 'City']
. This can save memory and loading time for wide datasets.dtype
: Allows you to specify the data type for specific columns or all columns during loading using a dictionary, e.g., dtype={'CustomerID': str, 'Amount': float}
. This can prevent incorrect type inference (like reading numerical IDs as integers when they should be strings).nrows
: Useful for very large files. Set nrows=1000
to read only the first 1000 rows (excluding the header). This is helpful for quick inspection before loading the entire dataset.encoding
: Specifies the file encoding if it's not the default (often UTF-8). Common alternatives include 'latin1'
, 'iso-8859-1'
, or 'cp1252'
, especially with older files or those from different operating systems. If you encounter a UnicodeDecodeError
, try specifying the encoding.# Example: Loading a semicolon-separated file with no header, using specific columns
try:
columns_to_load = ['Timestamp', 'SensorValue']
df_sensor = pd.read_csv(
'sensor_data.log',
sep=';',
header=None,
names=columns_to_load, # Provide names if header=None
usecols=[0, 2], # Load only the 1st and 3rd columns
encoding='latin1'
)
print("Sensor log loaded successfully.")
except FileNotFoundError:
print("Error: sensor_data.log not found.")
except Exception as e:
print(f"An error occurred: {e}")
Excel files (.xls
or .xlsx
) are also frequently used for storing data, often containing multiple sheets. Pandas uses the read_excel()
function for this. Note that read_excel()
requires an additional library: openpyxl
for .xlsx
files or xlrd
for older .xls
files. If you haven't installed them, you can do so via pip:
pip install openpyxl xlrd
The basic usage is similar to read_csv()
:
import pandas as pd
try:
# Load the first sheet by default
df_excel = pd.read_excel('spreadsheet.xlsx')
print("Excel file loaded successfully (first sheet).")
except FileNotFoundError:
print("Error: spreadsheet.xlsx not found.")
except Exception as e:
print(f"An error occurred: {e}")
Key parameters for read_excel()
include:
io
: The file path, URL, or file-like object.sheet_name
: This is important for Excel files. You can specify:
sheet_name='Sales Data'
).sheet_name=0
for the first sheet).sheet_name=['Orders', 'Customers']
).None
to load all sheets into a dictionary of DataFrames.header
, index_col
, usecols
, dtype
, nrows
: These function similarly to their read_csv()
counterparts.# Example: Loading a specific sheet ('Inventory') and specifying the header row
try:
df_inventory = pd.read_excel(
'company_data.xlsx',
sheet_name='Inventory',
header=2 # Header is on the 3rd row (index 2)
)
print("Inventory sheet loaded successfully.")
# Example: Loading all sheets
all_sheets = pd.read_excel('company_data.xlsx', sheet_name=None)
print(f"Loaded sheets: {list(all_sheets.keys())}")
# all_sheets is now a dictionary where keys are sheet names
# and values are the corresponding DataFrames.
# df_orders = all_sheets['Orders']
except FileNotFoundError:
print("Error: company_data.xlsx not found.")
except ImportError:
print("Error: Please install 'openpyxl' or 'xlrd' to read Excel files.")
print("Run: pip install openpyxl xlrd")
except Exception as e:
print(f"An error occurred: {e}")
JSON (JavaScript Object Notation) is a lightweight data-interchange format common in web APIs and configuration files. Its structure can be more complex than flat CSV or Excel files, often involving nested objects and arrays. Pandas provides read_json()
to handle this format.
The effectiveness of read_json()
heavily depends on the structure of the JSON file. A common challenge is that JSON doesn't enforce a single tabular structure.
import pandas as pd
# Example assuming a simple JSON structure (list of records)
# [{'col1': 'a', 'col2': 1}, {'col1': 'b', 'col2': 2}]
try:
df_json = pd.read_json('data.json')
print("JSON file loaded successfully.")
except FileNotFoundError:
print("Error: data.json not found.")
except ValueError as ve:
# This often happens if the JSON structure is not directly table-like
print(f"ValueError loading JSON: {ve}")
print("Check the JSON structure and consider the 'orient' parameter.")
except Exception as e:
print(f"An error occurred: {e}")
The most significant parameter for read_json()
is orient
, which tells Pandas how to interpret the JSON structure:
'records'
: (Default if lines=False
) JSON string is an array of JSON objects. Example: [{'col A': 1, 'col B': 0.5}, {'col A': 2, 'col B': 0.7}]
. Each object becomes a row.'columns'
: JSON string is a dictionary where keys are column names and values are dictionaries mapping index labels to cell values. Example: {'col A': {'row1': 1, 'row2': 2}, 'col B': {'row1': 0.5, 'row2': 0.7}}
.'index'
: Similar to 'columns'
, but the inner dictionaries map column names to cell values. Example: {'row1': {'col A': 1, 'col B': 0.5}, 'row2': {'col A': 2, 'col B': 0.7}}
.'split'
: JSON string is a dictionary with keys 'index'
, 'columns'
, and 'data'
. Example: {'columns': ['col A', 'col B'], 'index': ['row1', 'row2'], 'data': [[1, 0.5], [2, 0.7]]}
.'values'
: JSON string is just an array of arrays containing the data values. Example: [[1, 0.5], [2, 0.7]]
. Column names will be numerical (0, 1, ...).You often need to inspect the JSON file's structure first to determine the correct orient
value.
# Example: Loading JSON with 'columns' orientation
# {"col1": {"0": "a", "1": "b"}, "col2": {"0": 1, "1": 2}}
try:
df_json_cols = pd.read_json('data_columns.json', orient='columns')
print("JSON ('columns' orient) loaded successfully.")
except FileNotFoundError:
print("Error: data_columns.json not found.")
except Exception as e:
print(f"An error occurred: {e}")
# Example: Loading line-delimited JSON (each line is a valid JSON object)
# {"col1": "a", "col2": 1}
# {"col1": "b", "col2": 2}
try:
# For line-delimited JSON, use lines=True (orient defaults to 'records')
df_json_lines = pd.read_json('data_lines.jsonl', lines=True)
print("Line-delimited JSON loaded successfully.")
except FileNotFoundError:
print("Error: data_lines.jsonl not found.")
except Exception as e:
print(f"An error occurred: {e}")
For deeply nested JSON data, read_json()
might produce columns containing dictionaries or lists. While Pandas can handle these, you might need more advanced techniques like pandas.json_normalize()
or custom parsing logic to flatten the structure into a standard tabular format suitable for analysis. These techniques are beyond this initial loading section but are useful tools to be aware of.
'data.csv'
) are relative to the current working directory of your script or notebook. Absolute paths (like '/Users/you/Documents/data.csv'
or 'C:\\Users\\you\\Documents\\data.csv'
) are also valid. Using libraries like os
or pathlib
can help manage paths reliably across different operating systems.try...except
blocks to gracefully handle FileNotFoundError
or other issues like incorrect formatting or permissions.Successfully loading your data is the critical first step. Once the data is in a Pandas DataFrame, you can proceed to the next stage: inspecting its basic properties, structure, and types, which is the focus of the upcoming sections.
© 2025 ApX Machine Learning