DataFrames are one of the core data structures in Pandas and are essential for working with structured data. Think of a DataFrame as a table akin to a spreadsheet or SQL table, but designed for more flexible and powerful data operations. Each column in a DataFrame can be of a different data type, making it a versatile tool for data manipulation.
To begin working with DataFrames, you can create them using various methods. One of the simplest ways is by utilizing a Python dictionary. Each key in the dictionary represents a column name, and the corresponding value is a list of column values.
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Occupation': ['Engineer', 'Doctor', 'Artist']
}
df = pd.DataFrame(data)
print(df)
This will output:
Name Age Occupation
0 Alice 25 Engineer
1 Bob 30 Doctor
2 Charlie 35 Artist
You can also create a DataFrame from a list of dictionaries, where each dictionary represents a row of data.
data = [
{'Name': 'Alice', 'Age': 25, 'Occupation': 'Engineer'},
{'Name': 'Bob', 'Age': 30, 'Occupation': 'Doctor'},
{'Name': 'Charlie', 'Age': 35, 'Occupation': 'Artist'}
]
df = pd.DataFrame(data)
In practical scenarios, data is often stored in files. Pandas provides functions to easily load data from CSV, Excel, SQL databases, and more. Here's how to load a CSV file:
df = pd.read_csv('data.csv')
This command reads the content of data.csv
into a DataFrame called df
. Pandas automatically infers the data types of each column, making it straightforward to start analyzing the data immediately.
Once you have a DataFrame, you can explore its structure and content. Use the head()
method to view the first few rows:
print(df.head())
The info()
method provides a concise summary of the DataFrame:
df.info()
This will display the number of entries, column names, data types, and memory usage, helping you understand the dataset at a glance.
Selecting specific data from a DataFrame is a common task. You can select columns by passing their names as strings:
ages = df['Age']
To select multiple columns, pass a list of column names:
subset = df[['Name', 'Occupation']]
Rows can be selected using the loc[]
and iloc[]
methods. Use loc[]
for label-based indexing and iloc[]
for positional indexing:
# Select rows by label
row = df.loc[0]
# Select rows by position
row = df.iloc[0]
You can easily add, modify, or delete columns in a DataFrame. To add a new column, simply assign the data to a new column name:
df['Salary'] = [70000, 80000, 75000]
To modify an existing column, assign new values to it:
df['Age'] = df['Age'] + 1
Deleting a column is just as straightforward:
df.drop('Salary', axis=1, inplace=True)
DataFrames provide several methods to quickly perform summary statistics:
print(df.describe())
The describe()
method returns a summary of statistics for numerical columns, such as mean, median, and standard deviation. For more specific operations, methods like mean()
, sum()
, and count()
are available:
average_age = df['Age'].mean()
total_entries = df['Age'].count()
DataFrames are powerful yet intuitive structures that simplify the process of data analysis. By mastering DataFrames, you gain a versatile tool for handling data in a variety of forms, paving the way for efficient data manipulation and preparation. As you grow more familiar with these operations, you'll be able to tackle more complex data science tasks and analyses with confidence.
© 2025 ApX Machine Learning