实践：特征创建与选择

应用将原始数据转换为信息更丰富的特征以及选择最具影响力的特征的技术，是一项实践练习。这包括使用Pandas和Scikit-learn等Python库。主要目标不仅仅是运行代码，更重要的是理解为什么要应用特定的转换和选择方法。

我们将使用一个包含客户信息及其购买特定产品可能性的数据集。假设我们已经完成了第1章中涉及的初始数据加载和基本清理步骤。

设置与数据准备

首先，让我们导入必要的库并创建一个示例DataFrame。在实际项目中，你会使用pd.read_csv或类似函数加载数据。

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import (
    KBinsDiscretizer, PolynomialFeatures, StandardScaler, OneHotEncoder, OrdinalEncoder
)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier # 用于特征重要性
import matplotlib.pyplot as plt
import seaborn as sns

# 示例数据（请替换为你的实际数据加载）
data = {
    'CustomerID': range(1, 101),
    'Age': np.random.randint(18, 70, 100),
    'Income': np.random.normal(50000, 15000, 100).clip(10000),
    'AccountBalance': np.random.normal(10000, 5000, 100).clip(0),
    'NumTransactions': np.random.randint(0, 50, 100),
    'EducationLevel': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], 100, p=[0.3, 0.4, 0.2, 0.1]),
    'Region': np.random.choice(['North', 'South', 'East', 'West'], 100),
    'LastFeedback': np.random.choice(['Positive experience', 'Neutral', 'Issue resolved', 'Complaint filed', 'No feedback'], 100, p=[0.3, 0.2, 0.2, 0.1, 0.2]),
    'Purchased': np.random.randint(0, 2, 100) # 目标变量
}
df = pd.DataFrame(data)

# 分离特征 (X) 和目标 (y)
X = df.drop(['CustomerID', 'Purchased'], axis=1)
y = df['Purchased']

# 分割数据以进行真实评估（对目标编码等很重要）
# 我们将对训练集应用转换，然后对测试集应用*相同*的
# 已拟合的转换。为了本练习的简化，
# 我们可能直接对X应用一些转换，但请记住
# 实际模型构建中的训练/测试集分割原则。
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

print("原始训练数据形状:", X_train.shape)
print(X_train.head())

从数值数据生成特征

让我们将所学的一些技术应用于数值列：Age、Income、AccountBalance和NumTransactions。

1. 数值数据分箱

分箱有助于捕捉非线性影响，或将连续变量分组为有意义的类别。让我们将Age分箱。

# 将年龄分成4个分位数
binner = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='quantile', subsample=None) # 对于较小的数据，使用 subsample=None 以获得精确的分位数

# 在训练数据上拟合并转换
X_train['Age_Binned'] = binner.fit_transform(X_train[['Age']])
# 使用*已拟合*的分箱器转换测试数据
X_test['Age_Binned'] = binner.transform(X_test[['Age']])

print("\n年龄分箱（训练数据）:")
print(X_train[['Age', 'Age_Binned']].head())

我们这里使用了基于分位数的分箱，创建了样本数量大致相等的箱。encode='ordinal'为这些箱分配了数值标签（0, 1, 2, 3）。

2. 多项式特征

生成多项式特征可以帮助模型捕获交互影响和非线性关系。让我们创建Income和NumTransactions之间的交互项。

poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)

# 选择用于多项式特征的列
X_train_poly_subset = X_train[['Income', 'NumTransactions']]
X_test_poly_subset = X_test[['Income', 'NumTransactions']]

# 在训练数据上拟合并转换
poly_features_train = poly.fit_transform(X_train_poly_subset)
# 转换测试数据
poly_features_test = poly.transform(X_test_poly_subset)

# 获取特征名称
poly_feature_names = poly.get_feature_names_out(['Income', 'NumTransactions'])

# 为新特征创建DataFrame
poly_df_train = pd.DataFrame(poly_features_train, columns=poly_feature_names, index=X_train.index)
poly_df_test = pd.DataFrame(poly_features_test, columns=poly_feature_names, index=X_test.index)

# 将这些特征添加回主DataFrame（如果需要可以删除原始列，但目前保留）
X_train = pd.concat([X_train, poly_df_train], axis=1)
X_test = pd.concat([X_test, poly_df_test], axis=1)

print("\n已添加多项式特征（训练数据片段）:")
print(X_train[poly_feature_names].head())

这创建了诸如Income^2、NumTransactions^2以及交互项Income * NumTransactions等特征。请注意，这会显著增加特征数量，特别是对于更高的阶数。

3. 缩放

当数值特征处于相似的尺度时，许多算法表现更佳。让我们应用StandardScaler。

numerical_cols = ['Income', 'AccountBalance', 'NumTransactions'] # 排除年龄，因为我们已经对其进行了分箱

scaler = StandardScaler()

# 仅在训练数据上拟合
scaler.fit(X_train[numerical_cols])

# 转换训练集和测试集
X_train[numerical_cols] = scaler.transform(X_train[numerical_cols])
X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])

print("\n缩放后的数值特征（训练数据片段）:")
print(X_train[numerical_cols].head())
print("\n缩放后的均值（应接近0）:")
print(X_train[numerical_cols].mean())
print("\n缩放后的标准差（应接近1）:")
print(X_train[numerical_cols].std())

请记住：仅在训练数据上拟合缩放器，以防止测试集的信息泄露到缩放参数 (parameter)（均值和标准差）中。

编码分类变量

现在让我们处理分类列：EducationLevel、Region和LastFeedback。

1. 独热编码

Region是一个名义变量（没有固有的顺序）。独热编码是适用的。

ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore') # handle_unknown='ignore' 对于测试集中未见的类别更安全

# 选择列
X_train_region = X_train[['Region']]
X_test_region = X_test[['Region']]

# 在训练数据上拟合并转换
ohe_features_train = ohe.fit_transform(X_train_region)
# 转换测试数据
ohe_features_test = ohe.transform(X_test_region)

# 获取特征名称
ohe_feature_names = ohe.get_feature_names_out(['Region'])

# 创建DataFrame
ohe_df_train = pd.DataFrame(ohe_features_train, columns=ohe_feature_names, index=X_train.index)
ohe_df_test = pd.DataFrame(ohe_features_test, columns=ohe_feature_names, index=X_test.index)

# 添加回并删除原始的“Region”列
X_train = pd.concat([X_train.drop('Region', axis=1), ohe_df_train], axis=1)
X_test = pd.concat([X_test.drop('Region', axis=1), ohe_df_test], axis=1)

print("\n独热编码区域（训练数据片段）:")
print(X_train[ohe_feature_names].head())

2. 序数编码

EducationLevel有明确的顺序。我们可以使用序数编码。

# 显式定义顺序
education_order = ['High School', 'Bachelor', 'Master', 'PhD']

ordinal_encoder = OrdinalEncoder(categories=[education_order]) # 传递顺序

# 拟合并转换训练数据
X_train['EducationLevel_Encoded'] = ordinal_encoder.fit_transform(X_train[['EducationLevel']])
# 转换测试数据
X_test['EducationLevel_Encoded'] = ordinal_encoder.transform(X_test[['EducationLevel']])

# 删除原始列
X_train = X_train.drop('EducationLevel', axis=1)
X_test = X_test.drop('EducationLevel', axis=1)

print("\n序数编码教育水平（训练数据）:")
print(X_train[['EducationLevel_Encoded']].head())

从文本数据创建特征

LastFeedback列包含简单文本。让我们使用TF-IDF将其转换为数值特征。

tfidf_vectorizer = TfidfVectorizer(max_features=5) # 为简单起见限制特征数量

# 在训练数据上拟合并转换
tfidf_features_train = tfidf_vectorizer.fit_transform(X_train['LastFeedback'])
# 转换测试数据
tfidf_features_test = tfidf_vectorizer.transform(X_test['LastFeedback'])

# 获取特征名称
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()
tfidf_feature_names = [f"feedback_{name}" for name in tfidf_feature_names] # 添加前缀

# 创建DataFrame（TF-IDF默认返回稀疏矩阵，转换为密集矩阵）
tfidf_df_train = pd.DataFrame(tfidf_features_train.toarray(), columns=tfidf_feature_names, index=X_train.index)
tfidf_df_test = pd.DataFrame(tfidf_features_test.toarray(), columns=tfidf_feature_names, index=X_test.index)

# 添加回并删除原始的“LastFeedback”列
X_train = pd.concat([X_train.drop('LastFeedback', axis=1), tfidf_df_train], axis=1)
X_test = pd.concat([X_test.drop('LastFeedback', axis=1), tfidf_df_test], axis=1)

print("\n来自LastFeedback的TF-IDF特征（训练数据片段）:")
print(X_train[tfidf_feature_names].head())

在本例中，我们将max_features限制为5。在实践中，你可能允许更多特征或使用N-gram等技术。

特征选择

我们现在有了一个更大的特征集。让我们选择最相关的特征。

1. 过滤方法：SelectKBest

我们可以使用统计测试来对特征进行评分并选择前k个。由于我们的目标Purchased是二元的，f_classif（ANOVA F值）适用于数值输入。

# 确保所有数据都是数值型，并处理可能引入的任何NaN
X_train_numeric = X_train.select_dtypes(include=np.number).fillna(0) # 仅用于演示的简单填充
X_test_numeric = X_test.select_dtypes(include=np.number).fillna(0)

# 确保在可能删除/添加列后列匹配
common_cols = list(set(X_train_numeric.columns) & set(X_test_numeric.columns))
X_train_numeric = X_train_numeric[common_cols]
X_test_numeric = X_test_numeric[common_cols]

k_best = 10 # 选择前10个特征
selector_kbest = SelectKBest(score_func=f_classif, k=k_best)

# 在训练数据上拟合
selector_kbest.fit(X_train_numeric, y_train)

# 获取选定的特征名称
selected_features_mask = selector_kbest.get_support()
selected_features_kbest = X_train_numeric.columns[selected_features_mask]

print(f"\nSelectKBest选择的前{k_best}个特征:")
print(selected_features_kbest.tolist())

# 然后你可以过滤你的DataFrame:
# X_train_kbest = X_train_numeric[selected_features_kbest]
# X_test_kbest = X_test_numeric[selected_features_kbest]

2. 嵌入 (embedding)方法：特征重要性（随机森林）

基于树的模型在训练期间计算特征重要性。

# 使用简单的随机森林来获取重要性
rf = RandomForestClassifier(n_estimators=50, random_state=42, n_jobs=-1)
rf.fit(X_train_numeric, y_train)

importances = pd.Series(rf.feature_importances_, index=X_train_numeric.columns)
importances_sorted = importances.sort_values(ascending=False)

print("\n随机森林的特征重要性:")
print(importances_sorted.head(10)) # 显示前10个

# 绘制特征重要性
plt.figure(figsize=(10, 6))
sns.barplot(x=importances_sorted.head(10), y=importances_sorted.head(10).index, palette="viridis")
plt.title('前10个特征重要性（随机森林）')
plt.xlabel('重要性分数')
plt.ylabel('特征')
plt.tight_layout()
plt.show()

条形图显示了由在工程特征上训练的随机森林分类器确定的前10个特征的相对重要性。

特征重要性通常能很好地表明哪些工程特征对模型的预测贡献最大。

使用PCA进行降维

让我们将PCA应用于缩放后的数值和多项式特征，看看我们是否能在保留方差的同时减少维度。

# 选择用于PCA的特征（缩放后的数值 + 多项式）
pca_cols = numerical_cols + poly_feature_names
X_train_pca_subset = X_train[pca_cols].fillna(0) # 确保没有NaN
X_test_pca_subset = X_test[pca_cols].fillna(0)

pca = PCA(n_components=0.95) # 保留95%的方差

# 仅在训练数据上拟合
pca.fit(X_train_pca_subset)

# 转换训练集和测试集
X_train_pca = pca.transform(X_train_pca_subset)
X_test_pca = pca.transform(X_test_pca_subset)

print(f"\nPCA的原始特征数量: {X_train_pca_subset.shape[1]}")
print(f"保留95%方差的PCA组件数量: {pca.n_components_}")

# 可选：为PCA组件创建DataFrame
pca_comp_names = [f"PCA_{i+1}" for i in range(pca.n_components_)]
pca_df_train = pd.DataFrame(X_train_pca, columns=pca_comp_names, index=X_train.index)
# 你可以将这些添加回X_train，可能会替换PCA中使用的原始列

# 绘制解释方差比
plt.figure(figsize=(8, 5))
plt.plot(np.cumsum(pca.explained_variance_ratio_), marker='o', linestyle='--')
plt.xlabel('组件数量')
plt.ylabel('累积解释方差')
plt.title('PCA解释方差')
plt.grid(True)
plt.axhline(y=0.95, color='r', linestyle='-', label='95% 方差阈值')
plt.legend()
plt.show()

折线图显示了随着主成分数量增加的累积解释方差。水平线指示95%方差阈值。

PCA将我们选择的特征转换为更小的一组正交分量。这有助于可视化、降噪，或作为对高维度敏感的模型的输入。

最终特征集

经过此过程后，X_train和X_test包含原始（已缩放）、工程化（已分箱、多项式、已编码、TF-IDF）以及可能来自PCA的特征组合。你保留的特定特征将取决于你的特征选择过程的结果（例如，保留SelectKBest中的前k个特征，或随机森林中高于重要性阈值的特征）。

# 示例：组合选定的特征（请替换为你的实际选择）
# 假设我们根据随机森林重要性 > 0.01 选择了特征
important_features = importances_sorted[importances_sorted > 0.01].index.tolist()

# 只保留选定的重要特征 + 如果使用了PCA组件
# 这需要基于索引的仔细合并
X_train_final = X_train_numeric[important_features] # 使用RF选择的示例
X_test_final = X_test_numeric[important_features]

# 或者，如果使用PCA组件而不是原始特征：
# X_train_combined = pd.concat([X_train.drop(pca_cols, axis=1), pca_df_train], axis=1)

print("\n最终训练数据形状（示例选择）:", X_train_final.shape)
print(X_train_final.head())

这个优化后的特征集（X_train_final、X_test_final）现在已准备好输入到我们将在下一章中介绍的机器学习 (machine learning)模型中。

结论

这个实践练习表明了本章中涵盖的特征工程和选择技术的实际运用。你已经了解了如何：

使用分箱和多项式生成新的数值特征。
使用独热和序数方法适当地编码分类数据。
使用TF-IDF创建基本的文本特征。
使用统计测试和模型重要性选择相关特征。
使用PCA降低维度。

请记住，特征工程通常是一个迭代过程。你可能会尝试不同的技术，评估它们对模型性能的影响（使用第3章中讨论的方法），并相应地优化你的特征集。这里执行的转换和选择显著改变了数据表示，旨在为预测模型提供更清晰的信号。

这部分内容有帮助吗？