特征转换是数据准备中的一个重要步骤,通常在解决不正确条目和缺失值等初始数据质量问题之后进行。原始特征通常具有不同的尺度和分布,这可能对许多机器学习算法的性能产生负面影响,特别是那些依赖距离计算(如K-近邻算法或支持向量机)或梯度下降优化的算法。以下将介绍一些常用方法,用于对数据进行缩放、归一化和转换,使其更适合建模。为何转换数据?假设有一个数据集,其中包含“年龄”(范围从20到70)和“收入”(范围从30,000到250,000)等特征。如果你在计算距离的算法中直接使用这些数据,“收入”特征仅仅因为其较大的尺度就会主导计算,可能掩盖“年龄”的影响,即使“年龄”同等或更具信息量。转换特征的目标是:统一尺度: 将特征置于共同尺度,避免数值较大的特征主导数值较小的特征。调整分布: 将偏态分布调整得更对称(通常更接近正态分布),这可以帮助某些模型满足其基本假设并提升性能。稳定方差: 某些转换可以帮助稳定特征值范围内的方差。我们来看看一些广泛使用的方法。数值特征缩放缩放调整数值特征的范围,而不改变其分布的形状。两种常用方法是最小-最大缩放和标准化。最小-最大缩放(归一化)最小-最大缩放,通常称为归一化,将特征重新缩放到固定范围,通常是 [0, 1]。特征 $x$ 的公式是:$$ x_{scaled} = \frac{x - \min(x)}{\max(x) - \min(x)} $$此处,$\min(x)$ 和 $\max(x)$ 是该特征在训练数据集中的最小值和最大值。当你需要将数据限制在特定范围内时,此方法很有用。然而,它对异常值相当敏感。单个非常大或非常小的值会将其余数据显著压缩到 [0, 1] 范围的狭窄部分。import pandas as pd import numpy as np from sklearn.preprocessing import MinMaxScaler import plotly.graph_objects as go # 偏斜数据样本 np.random.seed(42) data = pd.DataFrame({ 'FeatureA': np.random.gamma(2, 2, 100) * 10, # 偏斜 'FeatureB': np.random.normal(50, 10, 100) # 接近正态 }) # 初始化并拟合缩放器 min_max_scaler = MinMaxScaler() # 重要:在实际场景中仅对训练数据进行拟合 scaled_data_mm = min_max_scaler.fit_transform(data) scaled_df_mm = pd.DataFrame(scaled_data_mm, columns=['FeatureA_scaled', 'FeatureB_scaled']) # 可视化(可选比较) fig = go.Figure() fig.add_trace(go.Histogram(x=data['FeatureA'], name='Original Feature A', marker_color='#1c7ed6', nbinsx=15)) fig.add_trace(go.Histogram(x=scaled_df_mm['FeatureA_scaled'], name='MinMax Scaled Feature A', marker_color='#ff922b', nbinsx=15, xaxis='x2', yaxis='y2')) fig.update_layout( title_text='最小-最大缩放效果(形状不变)', xaxis_title='原始值', yaxis_title='计数', xaxis2=dict(title='缩放值 [0,1]', overlaying='x', side='top'), yaxis2=dict(overlaying='y', side='right'), bargap=0.1, height=350, legend=dict(yanchor="top", y=0.99, xanchor="right", x=0.99), margin=dict(l=20, r=20, t=50, b=20) ) # fig.show() # 在交互式环境中显示图表{"layout": {"title": {"text": "最小-最大缩放效果(形状不变)"}, "xaxis": {"title": "原始值"}, "yaxis": {"title": "计数"}, "xaxis2": {"title": "缩放值 [0,1]", "overlaying": "x", "side": "top"}, "yaxis2": {"overlaying": "y", "side": "right"}, "bargap": 0.1, "height": 350, "legend": {"yanchor": "top", "y": 0.99, "xanchor": "right", "x": 0.99}, "margin": {"l": 20, "r": 20, "t": 50, "b": 20}}, "data": [{"type": "histogram", "x": [25.189111, 3.550108, 13.087884, 60.201311, 35.830938, 22.876369, 15.370904, 17.657475, 33.039507, 20.492071, 3.622366, 44.891902, 35.227802, 21.835556, 45.764233, 22.943996, 10.379933, 15.097931, 7.930903, 27.314806, 28.734366, 22.309165, 34.010965, 10.204829, 41.465735, 33.460476, 6.882727, 11.324313, 31.709783, 16.937075, 45.030573, 25.735518, 16.71754, 18.228756, 37.260895, 35.43618, 33.105487, 31.058024, 51.519706, 25.298306, 32.112421, 16.49835, 33.11234, 11.704552, 22.719204, 36.69339, 51.134577, 28.478218, 18.046695, 20.995845, 52.373627, 10.147184, 10.89916, 12.892583, 27.480429, 51.468944, 14.16084, 22.661649, 16.720034, 18.016989, 39.647051, 24.381999, 22.626494, 46.84119, 12.75825, 18.778724, 40.04377, 24.440361, 23.431058, 24.317533, 22.228711, 48.739278, 18.097285, 14.518982, 16.736277, 13.304785, 21.634973, 22.509017, 12.496137, 11.195871, 26.078889, 5.29289, 50.528114, 13.811333, 24.036307, 31.941443, 30.974886, 30.820344, 18.972149, 12.452203, 33.264347, 13.178073, 44.76769, 26.927067, 14.440191, 32.446621, 23.025803, 52.738866, 13.697486], "name": "原始特征 A", "marker": {"color": "#1c7ed6"}, "nbinsx": 15}, {"type": "histogram", "x": [0.359975, 0.0, 0.157483, 0.961473, 0.54618, 0.316153, 0.199397, 0.237991, 0.482562, 0.283175, 0.001162, 0.69784, 0.533852, 0.300169, 0.712376, 0.317273, 0.114926, 0.19488, 0.073764, 0.389892, 0.414711, 0.309746, 0.502818, 0.11056, 0.630654, 0.491454, 0.05599, 0.130994, 0.460197, 0.225766, 0.702574, 0.362994, 0.221885, 0.24716, 0.568499, 0.537611, 0.483843, 0.449205, 0.808162, 0.361294, 0.46613, 0.21801, 0.483965, 0.137407, 0.314173, 0.559728, 0.801975, 0.409505, 0.243894, 0.294415, 0.821659, 0.109506, 0.123779, 0.155924, 0.392732, 0.807052, 0.178944, 0.313115, 0.221932, 0.243388, 0.600304, 0.349178, 0.312544, 0.731986, 0.153674, 0.257269, 0.606966, 0.350302, 0.332957, 0.34832, 0.308197, 0.764952, 0.244847, 0.181736, 0.222214, 0.164169, 0.297267, 0.311279, 0.149432, 0.128845, 0.373258, 0.029388, 0.791764, 0.171184, 0.342692, 0.463686, 0.44755, 0.444879, 0.260126, 0.148523, 0.486397, 0.159935, 0.695502, 0.38421, 0.18029, 0.47209, 0.319553, 0.828601, 0.168195], "name": "最小-最大缩放特征 A", "marker": {"color": "#ff922b"}, "nbinsx": 15, "xaxis": "x2", "yaxis": "y2"}]}特征分布在最小-最大缩放前后的比较。范围被压缩到 [0, 1],但整体形状(偏度)保持不变。标准化(Z-分数缩放)标准化重新缩放特征,使其均值 ($\mu$) 为 0,标准差 ($\sigma$) 为 1。公式是:$$ x_{standardized} = \frac{x - \mu}{\sigma} $$同样,$\mu$ 和 $\sigma$ 是从训练数据计算的。结果分布将具有均值 0 和单位方差。标准化比最小-最大缩放受异常值影响小,并且通常更适用于假定数据呈正态分布并以零为中心算法,或对特征方差敏感的算法。from sklearn.preprocessing import StandardScaler import plotly.graph_objects as go # 假设存在前一个例子中的 'data' DataFrame # 初始化并拟合缩放器 standard_scaler = StandardScaler() # 重要:在实际场景中仅对训练数据进行拟合 scaled_data_std = standard_scaler.fit_transform(data) scaled_df_std = pd.DataFrame(scaled_data_std, columns=['FeatureA_scaled', 'FeatureB_scaled']) # 可视化 fig_std = go.Figure() fig_std.add_trace(go.Histogram(x=data['FeatureA'], name='Original Feature A', marker_color='#1c7ed6', nbinsx=15)) fig_std.add_trace(go.Histogram(x=scaled_df_std['FeatureA_scaled'], name='Standardized Feature A', marker_color='#7048e8', nbinsx=15, xaxis='x2', yaxis='y2')) fig_std.update_layout( title_text='标准化效果(形状不变)', xaxis_title='原始值', yaxis_title='计数', xaxis2=dict(title='标准化值 (均值=0, 标准差=1)', overlaying='x', side='top'), yaxis2=dict(overlaying='y', side='right'), bargap=0.1, height=350, legend=dict(yanchor="top", y=0.99, xanchor="right", x=0.99), margin=dict(l=20, r=20, t=50, b=20) ) # fig_std.show() # 在交互式环境中显示图表{"layout": {"title": {"text": "标准化效果(形状不变)"}, "xaxis": {"title": "原始值"}, "yaxis": {"title": "计数"}, "xaxis2": {"title": "标准化值 (均值=0, 标准差=1)", "overlaying": "x", "side": "top"}, "yaxis2": {"overlaying": "y", "side": "right"}, "bargap": 0.1, "height": 350, "legend": {"yanchor": "top", "y": 0.99, "xanchor": "right", "x": 0.99}, "margin": {"l": 20, "r": 20, "t": 50, "b": 20}}, "data": [{"type": "histogram", "x": [25.189111, 3.550108, 13.087884, 60.201311, 35.830938, 22.876369, 15.370904, 17.657475, 33.039507, 20.492071, 3.622366, 44.891902, 35.227802, 2.1835556, 45.764233, 22.943996, 10.379933, 15.097931, 7.930903, 27.314806, 28.734366, 22.309165, 34.010965, 10.204829, 41.465735, 33.460476, 6.882727, 11.324313, 31.709783, 16.937075, 45.030573, 25.735518, 16.71754, 18.228756, 37.260895, 35.43618, 33.105487, 31.058024, 51.519706, 25.298306, 32.112421, 16.49835, 33.11234, 11.704552, 22.719204, 36.69339, 51.134577, 28.478218, 18.046695, 20.995845, 52.373627, 10.147184, 10.89916, 12.892583, 27.480429, 51.468944, 14.16084, 22.661649, 16.720034, 18.016989, 39.647051, 24.381999, 22.626494, 46.84119, 12.75825, 18.778724, 40.04377, 24.440361, 23.431058, 24.317533, 22.228711, 48.739278, 18.097285, 14.518982, 16.736277, 13.304785, 21.634973, 22.509017, 12.496137, 11.195871, 26.078889, 5.29289, 50.528114, 13.811333, 24.036307, 31.941443, 30.974886, 30.820344, 18.972149, 12.452203, 33.264347, 13.178073, 44.76769, 26.927067, 14.440191, 32.446621, 23.025803, 52.738866, 13.697486], "name": "原始特征 A", "marker": {"color": "#1c7ed6"}, "nbinsx": 15}, {"type": "histogram", "x": [0.05045, -1.48094, -0.77166, 2.72658, 0.8507, -0.01571, -0.59439, -0.41792, 0.63264, -0.19959, -1.47555, 1.54313, 0.80304, -0.09808, 1.60922, -0.01053, -0.97149, -0.61565, -1.1617, 0.21183, 0.32113, -0.05906, 0.70808, -0.98496, 1.28188, 0.66522, -1.24125, -0.90273, 0.5305, -0.47307, 1.55331, 0.09235, -0.49002, -0.37313, 0.95973, 0.82054, 0.63766, 0.48092, 2.05767, 0.05874, 0.56139, -0.50691, 0.63819, -0.87233, -0.02795, 0.9159, 2.02782, 0.30177, -0.38691, -0.16167, 2.12227, -0.98924, -0.93554, -0.7869, 0.22469, 2.05374, -0.68974, -0.03225, -0.48983, -0.38917, 1.14644, -0.0011, -0.03498, 1.69091, -0.7973, -0.3301, 1.17649, -0.00204, -0.13896, 0.00094, -0.06531, 1.83164, -0.38299, -0.66156, -0.48858, -0.75765, -0.11273, -0.04443, -0.81781, -0.91279, 0.11455, -1.36806, 1.97961, -0.72027, -0.03026, 0.54862, 0.47469, 0.46247, -0.31549, -0.82123, 0.65016, -0.76855, 1.53389, 0.18254, -0.66761, 0.5874, -0.00457, 2.14988, -0.72876], "name": "标准化特征 A", "marker": {"color": "#7048e8"}, "nbinsx": 15, "xaxis": "x2", "yaxis": "y2"}]}特征分布在标准化前后的比较。特征以0为中心,标准差为1,但偏度保持不变。选择缩放方法最小-最大缩放: 当你需要将特征限制在有界区间 [0, 1] 内(例如,图像像素强度)或用于不假定任何特定分布的算法时使用。小心异常值。标准化: 通常更适用于假定零均值和单位方差的算法(如线性回归、带正则化的逻辑回归)或当存在异常值时使用距离度量的算法。转换特征分布有时,仅仅缩放是不够的。如果特征的分布严重偏斜,应用非线性转换可以帮助使其更对称,可能提升模型性能。对数转换对数函数压缩大值范围并扩展小值范围。这使其能有效减少右偏性(尾部向右延伸的情况)。如果所有值都为正,使用 $log(x)$。如果值包含零,使用 $log(1 + x)$(通过 numpy.log1p)。# 假设存在带有偏斜 'FeatureA' 的 'data' DataFrame data['FeatureA_log'] = np.log1p(data['FeatureA']) # 使用 log1p 以确保处理可能存在的0值 # 可视化 fig_log = go.Figure() fig_log.add_trace(go.Histogram(x=data['FeatureA'], name='Original Feature A', marker_color='#1c7ed6', nbinsx=15)) fig_log.add_trace(go.Histogram(x=data['FeatureA_log'], name='Log Transformed Feature A', marker_color='#20c997', nbinsx=15, xaxis='x2', yaxis='y2')) fig_log.update_layout( title_text='对数转换对偏斜数据的影响', xaxis_title='原始值', yaxis_title='计数', xaxis2=dict(title='Log(1+值)', overlaying='x', side='top'), yaxis2=dict(overlaying='y', side='right'), bargap=0.1, height=350, legend=dict(yanchor="top", y=0.99, xanchor="right", x=0.99), margin=dict(l=20, r=20, t=50, b=20) ) # fig_log.show() # 在交互式环境中显示图表{"data": [{"type": "histogram", "x": [25.19, 3.55, 13.09, 60.20, 35.83, 22.88, 15.37, 17.66, 33.04, 20.49], "name": "原始特征 A", "marker": {"color": "#1c7ed6"}, "nbinsx": 15}, {"type": "histogram", "x": [3.27, 1.52, 2.65, 4.11, 3.61, 3.17, 2.80, 2.93, 3.53, 3.07], "name": "对数转换特征 A", "marker": {"color": "#20c997"}, "nbinsx": 15, "xaxis": "x2", "yaxis": "y2"}], "layout": {"bargap": 0.1, "height": 350, "legend": {"x": 0.99, "xanchor": "right", "y": 0.99, "yanchor": "top"}, "margin": {"b": 20, "l": 20, "r": 20, "t": 50}, "title": {"text": "对数转换对偏斜数据的影响"}, "xaxis": {"title": "原始值"}, "xaxis2": {"overlaying": "x", "side": "top", "title": "Log(1+值)"}, "yaxis": {"title": "计数"}, "yaxis2": {"overlaying": "y", "side": "right"}}}对数转换使“特征 A”的分布显得更对称(更接近钟形),与原始右偏分布相比。Box-Cox 转换Box-Cox 转换是一种更一般化的幂转换,可以找到接近最优的转换,使你的数据更接近正态分布。它定义为:$$ x^{(\lambda)} = \begin{cases} \frac{x^\lambda - 1}{\lambda} & \text{如果 } \lambda \neq 0 \ \ln(x) & \text{如果 } \lambda = 0 \end{cases} $$该转换找到 $\lambda$ (lambda) 的最佳值,以稳定方差并提高正态性。一个重要限制是 Box-Cox 要求所有数据为正。from scipy.stats import boxcox import plotly.graph_objects as go # 假设存在带有正值且偏斜的 'FeatureA' 的 'data' DataFrame # 在应用 Box-Cox 之前确保 FeatureA 为正值 if (data['FeatureA'] > 0).all(): # 应用 Box-Cox:返回转换后的数据和最优 lambda featureA_boxcox, best_lambda = boxcox(data['FeatureA']) print(f"Box-Cox 找到的最优 Lambda: {best_lambda:.4f}") # 可视化 fig_boxcox = go.Figure() fig_boxcox.add_trace(go.Histogram(x=data['FeatureA'], name='Original Feature A', marker_color='#1c7ed6', nbinsx=15)) fig_boxcox.add_trace(go.Histogram(x=featureA_boxcox, name='Box-Cox Transformed Feature A', marker_color='#be4bdb', nbinsx=15, xaxis='x2', yaxis='y2')) fig_boxcox.update_layout( title_text='Box-Cox 转换效果', xaxis_title='原始值', yaxis_title='计数', xaxis2=dict(title='Box-Cox 转换值', overlaying='x', side='top'), yaxis2=dict(overlaying='y', side='right'), bargap=0.1, height=350, legend=dict(yanchor="top", y=0.99, xanchor="right", x=0.99), margin=dict(l=20, r=20, t=50, b=20) ) # fig_boxcox.show() # 显示图表 else: print("FeatureA 包含非正值。Box-Cox 无法直接应用。") # 如果无法应用,则创建一个空的 plotly json fig_boxcox_json = '{"layout": {"title": {"text": "Box-Cox Skipped (Non-Positive Data)"},"height": 100}, "data": []}'{"layout": {"title": {"text": "Box-Cox 转换效果"}, "xaxis": {"title": "原始值"}, "yaxis": {"title": "计数"}, "xaxis2": {"title": "Box-Cox 转换值", "overlaying": "x", "side": "top"}, "yaxis2": {"overlaying": "y", "side": "right"}, "bargap": 0.1, "height": 350, "legend": {"yanchor": "top", "y": 0.99, "xanchor": "right", "x": 0.99}, "margin": {"l": 20, "r": 20, "t": 50, "b": 20}}, "data": [{"type": "histogram", "x": [25.189111, 3.550108, 13.087884, 60.201311, 35.830938, 22.876369, 15.370904, 17.657475, 33.039507, 20.492071, 3.622366, 44.891902, 35.227802, 21.835556, 45.764233, 22.943996, 10.379933, 15.097931, 7.930903, 27.314806, 28.734366, 22.309165, 34.010965, 10.204829, 41.465735, 33.460476, 6.882727, 11.324313, 31.709783, 16.937075, 45.030573, 25.735518, 16.71754, 18.228756, 37.260895, 35.43618, 33.105487, 31.058024, 51.519706, 25.298306, 32.112421, 16.49835, 33.11234, 11.704552, 22.719204, 36.69339, 51.134577, 28.478218, 18.046695, 20.995845, 52.373627, 10.147184, 10.89916, 12.892583, 27.480429, 51.468944, 14.16084, 22.661649, 16.720034, 18.016989, 39.647051, 24.381999, 22.626494, 46.84119, 12.75825, 18.778724, 40.04377, 24.440361, 23.431058, 24.317533, 22.228711, 48.739278, 18.097285, 14.518982, 16.736277, 13.304785, 21.634973, 22.509017, 12.496137, 11.195871, 26.078889, 5.29289, 50.528114, 13.811333, 24.036307, 31.941443, 30.974886, 30.820344, 18.972149, 12.452203, 33.264347, 13.178073, 44.76769, 26.927067, 14.440191, 32.446621, 23.025803, 52.738866, 13.697486], "name": "原始特征 A", "marker": {"color": "#1c7ed6"}, "nbinsx": 15}, {"type": "histogram", "x": [3.756204, 1.346337, 2.773611, 5.364758, 4.360955, 3.576986, 2.978825, 3.173721, 4.200694, 3.393174, 1.363269, 4.822239, 4.329032, 3.503942, 4.867434, 3.582184, 2.474225, 2.95464, 2.154785, 3.918372, 4.018813, 3.540179, 4.26437, 2.455987, 4.657814, 4.227142, 2.003619, 2.57381, 4.123375, 3.098128, 4.830038, 3.798054, 3.077611, 3.222106, 4.449911, 4.340441, 4.205583, 4.086161, 5.126944, 3.763923, 4.149429, 3.056851, 4.206085, 2.61321, 3.565836, 4.41771, 5.11017, 3.999368, 3.206911, 3.433484, 5.160174, 2.450001, 2.528352, 2.754074, 3.92965, 5.125079, 2.853795, 3.561699, 3.077839, 3.204486, 4.573292, 3.696879, 3.559115, 4.913367, 2.74067, 3.268075, 4.591715, 3.701157, 3.619588, 3.692466, 3.534187, 5.004665, 3.211102, 2.888102, 3.079319, 2.794097, 3.489325, 3.550753, 2.714579, 2.560041, 3.822611, 1.741605, 5.082747, 2.820519, 3.673093, 4.139885, 4.080263, 4.069304, 3.283938, 2.709987, 4.215398, 2.782436, 4.81611, 3.888692, 2.879909, 4.171239, 3.588465, 5.174662, 2.809238], "name": "Box-Cox 转换特征 A", "marker": {"color": "#be4bdb"}, "nbinsx": 15, "xaxis": "x2", "yaxis": "y2"}]}Box-Cox 转换也使偏斜数据更对称,在此情况下与对数转换相似,通过自动找到合适的幂转换。Yeo-Johnson 转换Yeo-Johnson 转换在精神上与 Box-Cox 相似,但优点是能够处理非正数据。如果你的数据包含零或负数,Yeo-Johnson 是一种合适的替代方案,可实现更接近正态的分布。它在 scikit-learn 中以 PowerTransformer(method='yeo-johnson') 提供。实践中应用转换:训练/测试集划分应用任何转换(缩放或分布调整)时的一个要点是只对训练数据拟合转换器。你从训练集中学习参数(如最小值/最大值、均值/标准差或 lambda),然后使用这些学习到的参数来转换训练集、验证集和测试集。为什么?在分割之前对整个数据集进行拟合会导致数据泄露。测试集中的信息(例如,其最小值或最大值)会泄露到训练过程中,导致对模型在未见数据上的性能估计过于乐观。from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression # 示例模型 from sklearn.pipeline import Pipeline # 假设 'X' 是你的特征矩阵,'y' 是你的目标向量 # X = data[['FeatureA', 'FeatureB']] # 示例特征 # y = ... # 你的目标变量 # 先分割数据! # X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # 仅对训练数据拟合缩放器 # scaler = StandardScaler() # X_train_scaled = scaler.fit_transform(X_train) # 将相同的已拟合缩放器应用于测试数据 # X_test_scaled = scaler.transform(X_test) # 使用 transform(),而不是 fit_transform() # -- 使用管道简化此过程 -- # 定义步骤:1. 缩放,2. 模型 # pipe = Pipeline([ # ('scaler', StandardScaler()), # ('classifier', LogisticRegression()) # ]) # 在训练数据上拟合整个管道 # 管道处理缩放器的拟合,然后训练模型 # pipe.fit(X_train, y_train) # 在测试数据上预测 # 管道自动使用已拟合的缩放器转换测试数据 # predictions = pipe.predict(X_test) # score = pipe.score(X_test, y_test) # print(f"模型在测试数据上的得分: {score:.4f}")如上所示,使用 scikit-learn 管道是连接预处理步骤和建模的推荐方式。管道确保在交叉验证期间仅在训练折叠上进行拟合,并且正确的转换按顺序应用。总结数据转换和标准化是机器学习数据准备中的必要步骤。通过使用最小-最大缩放或标准化等方法将特征调整到可比较的尺度,你可以避免某些特征不适当地影响模型结果。此外,使用对数或 Box-Cox 转换等方法转换偏斜分布可以帮助那些在更对称、类似正态数据上表现更好的模型。请记住基本规则:始终只在训练数据上拟合转换器,然后使用它们来转换训练和测试数据集,以防止数据泄露。这些方法为高效的特征工程和模型构建提供了支持。