While CSV and Excel files are extremely common for storing tabular data, Pandas offers flexibility to read data from various other sources. Let's look at two important examples: JSON files and SQL databases.
JSON (JavaScript Object Notation) is a lightweight data-interchange format. It uses human-readable text to transmit data objects consisting of attribute-value pairs and array data types. It's frequently used in web applications and APIs.
Imagine you have data stored in a JSON file named data.json
. The structure of the JSON file significantly influences how Pandas reads it. A common structure is a list of records (dictionaries), like this:
[
{"name": "Alice", "age": 30, "city": "New York"},
{"name": "Bob", "age": 24, "city": "Los Angeles"},
{"name": "Charlie", "age": 35, "city": "Chicago"}
]
To read this into a DataFrame, you can use the pd.read_json()
function:
import pandas as pd
# Assuming 'data.json' contains the JSON structure shown above
try:
df_json = pd.read_json('data.json')
print(df_json)
except FileNotFoundError:
print("Error: data.json not found. Please create it with the example content.")
# Example creating DataFrame from a JSON string directly
json_string = """
[
{"name": "Alice", "age": 30, "city": "New York"},
{"name": "Bob", "age": 24, "city": "Los Angeles"},
{"name": "Charlie", "age": 35, "city": "Chicago"}
]
"""
df_from_string = pd.read_json(json_string)
# print(df_from_string) # This will produce the same output as reading from the file
Output:
name age city
0 Alice 30 New York
1 Bob 24 Los Angeles
2 Charlie 35 Chicago
pd.read_json()
has parameters to handle different JSON structures. For instance, if your JSON is structured with keys as indices or columns, you might need to specify the orient
parameter (e.g., orient='index'
, orient='columns'
, orient='records'
). The default orient=None
tries to infer the structure, which often works well for the common list-of-records format shown above.
Pandas can interact directly with SQL databases, allowing you to execute queries and load the results into a DataFrame. This requires an additional library called SQLAlchemy
to handle the database connection details across different SQL database types (like PostgreSQL, MySQL, SQLite, etc.).
First, you typically establish a connection engine using SQLAlchemy:
# Note: You'll need to install SQLAlchemy: pip install sqlalchemy
# You might also need a database-specific driver, e.g., psycopg2 for PostgreSQL
from sqlalchemy import create_engine
import pandas as pd
# Example for connecting to a SQLite database file named 'my_database.db'
# Replace with your actual database connection string
# Format: 'dialect+driver://username:password@host:port/database'
try:
# Create an engine object. SQLite connects to a file.
# Other databases require different connection strings.
engine = create_engine('sqlite:///my_database.db')
# Assume 'my_database.db' exists and has a table named 'users'
# Example: Create a dummy table and data for demonstration
with engine.connect() as connection:
connection.execute("DROP TABLE IF EXISTS users") # Start fresh for example
connection.execute("CREATE TABLE users (name TEXT, age INTEGER, city TEXT)")
connection.execute("INSERT INTO users (name, age, city) VALUES ('David', 42, 'Boston')")
connection.execute("INSERT INTO users (name, age, city) VALUES ('Eve', 29, 'Miami')")
# Now, read data using Pandas
# Option 1: Read an entire table
df_sql_table = pd.read_sql_table('users', con=engine)
print("--- Reading entire table ---")
print(df_sql_table)
# Option 2: Execute a specific SQL query
query = "SELECT name, city FROM users WHERE age > 30"
df_sql_query = pd.read_sql_query(query, con=engine)
print("\n--- Reading specific query results ---")
print(df_sql_query)
except ImportError:
print("SQLAlchemy is required but not installed. Run: pip install sqlalchemy")
except Exception as e:
# Catch other potential errors like database not found or connection issues
print(f"An error occurred: {e}")
print("Ensure 'my_database.db' exists or adjust the connection string.")
Output (assuming successful connection and table creation):
--- Reading entire table ---
name age city
0 David 42 Boston
1 Eve 29 Miami
--- Reading specific query results ---
name city
0 David Boston
In these examples:
create_engine
from SQLAlchemy
.engine
object representing the database connection. The connection string ('sqlite:///my_database.db'
) tells SQLAlchemy how to connect (using the SQLite dialect, connecting to the file my_database.db
). For other databases like PostgreSQL or MySQL, the string would look different and include hostname, username, password, etc.pd.read_sql_table('users', con=engine)
reads the entire table named users
into a DataFrame.pd.read_sql_query(query, con=engine)
executes the specified SQL query
against the database via the engine
and returns the results as a DataFrame.Reading from SQL is powerful because it allows you to leverage the database's querying capabilities to filter, aggregate, and select data before loading it into memory as a Pandas DataFrame, which can be very efficient for large datasets. Remember that setting up the database connection (engine
) is a prerequisite and depends on the specific database system you are using.
Pandas includes functions for other formats as well, though perhaps less common than CSV, Excel, JSON, or SQL:
pd.read_html()
: Reads tables directly from HTML web pages. Useful for web scraping.pd.read_fwf()
: Reads data from fixed-width formatted lines.pd.read_clipboard()
: Reads text from the system clipboard and attempts to parse it into a DataFrame (often surprisingly useful for quick data grabs).While we focused on JSON and SQL here, knowing that Pandas provides a consistent interface (pd.read_...
) for many data types makes it a versatile tool for data loading. The specific parameters for each function vary depending on the format's characteristics. Always check the Pandas documentation for the specific reader function you intend to use.
© 2025 ApX Machine Learning