高维数据集在数据科学中很常见,它们给可视化带来了很大的挑战。人类感知三维现实,因此难以直接绘制和解释包含数十、数百甚至数千个特征的数据。然而,数据可视化是进行分析的有力工具,它能帮助我们识别模式、识别聚类、检测异常值,并总体把握无标签数据中的固有结构。专门为可视化设计的降维方法旨在将高维数据映射到低维空间(通常是2D或3D),同时保持数据点之间有意义的关系。本节考察两种广泛用于可视化降维的方法:主成分分析(PCA)和t-分布随机邻域嵌入(t-SNE)。虽然PCA之前已作为一种通用降维方法介绍过,但我们在此专门从可视化的角度对其进行重新审视。t-SNE主要是一种用于高维数据可视化的方法。使用主成分分析(PCA)进行数据可视化PCA通过寻找一组新的正交轴(称为主成分)来实现降维,这些轴能捕获数据中的最大方差。第一个主成分解释最大的方差,第二个主成分(与第一个正交)解释次大的方差,依此类推。为了可视化,我们通常将数据投影到前两个主成分(PC1和PC2)上。这种二维表示捕获了原始数据中最大散布的方向。虽然舍弃其余成分会丢失信息,但这种投影通常能提供数据结构的有效概览,可能显现出聚类或趋势。请记住,PCA对特征的尺度很敏感。在应用PCA之前,对数据进行缩放(例如,使用scikit-learn中的StandardScaler)是标准做法。我们来演示如何使用scikit-learn将数据投影到前两个主成分上:import numpy as np import pandas as pd from sklearn.datasets import load_iris from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA import plotly.express as px import plotly.io as pio # 加载样本数据(例如,鸢尾花数据集) iris = load_iris() X = iris.data y = iris.target target_names = iris.target_names # 1. 数据缩放 scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # 2. 应用PCA降至2维 pca = PCA(n_components=2) X_pca = pca.fit_transform(X_scaled) # 创建用于绘图的DataFrame pca_df = pd.DataFrame(data = X_pca, columns = ['Principal Component 1', 'Principal Component 2']) pca_df['target'] = y pca_df['species'] = pca_df['target'].apply(lambda i: target_names[i]) # 3. 可视化结果 fig = px.scatter(pca_df, x='Principal Component 1', y='Principal Component 2', color='species', title='PCA of Iris Dataset (2 Components)', labels={'species': 'Species'}, color_discrete_map={ # 使用课程调色板颜色 'setosa': '#228be6', # 蓝色 'versicolor': '#51cf66', # 绿色 'virginica': '#be4bdb' # 紫色 }) fig.update_layout( xaxis_title="Principal Component 1", yaxis_title="Principal Component 2", legend_title="Species", width=700, # 调整宽度以适应网页显示 height=500 # 调整高度以适应网页显示 ) # 可选:显示解释方差比 print(f"Explained variance ratio by component: {pca.explained_variance_ratio_}") print(f"Total explained variance by 2 components: {np.sum(pca.explained_variance_ratio_):.4f}") # 显示图表(或生成用于网页嵌入的JSON) # fig.show() # 要生成用于嵌入的JSON: # print(pio.to_json(fig)){ "data": [ { "x": [-2.26470337, -2.08096133, -2.36422896, -2.29938416, -2.38984246, -2.07563143, -2.44402801, -2.23284733, -2.33464021, -2.18432769, -2.16631031, -2.32256773, -2.19906063, -2.62614535, -2.5353422, -2.28600465, -2.33039127, -2.1931166, -2.60010153, -2.1794167, -2.37623585, -2.2103431, -2.71086683, -1.93387131, -2.31970069, -1.93401737, -2.09173916, -2.1899313, -2.25606943, -2.14018867, -2.15794723, -2.50866102, -2.33058228, -2.32806344, -2.18413666, -2.17024403, -2.16951988, -2.12413976, -2.30195678, -2.23149157, -2.1591327, -2.35603005, -2.30600889, -2.31162816, -2.25460017, -2.36643601, -2.1740023, -2.46795118, -2.26386899, -2.20590364, 1.04577983, 0.70888561, 1.13003766, 0.46184046, 0.97457124, 0.80807827, 1.14473763, 0.68635706, 0.23683019, 0.38181154, 0.72692119, 0.51720946, 0.58990117, 0.46980021, 0.14018565, 1.57812955, 0.86157764, 1.0010056, 0.76863701, 0.69091489, 0.42700862, 0.67090348, 0.19123383, 0.72529499, 0.35216699, 0.10440614, 0.60390698, 0.58869923, 0.35716362, 0.22138809, 0.37185182, 0.1787306, 0.46427682, 0.30181948, 0.26001806, 1.22999784, -0.21512347, 0.34976289, 0.6272049, -0.02986422, 0.13018951, 0.47477675, 0.55134067, 0.12727821, 0.30383701, 0.07956304, 0.4266767, 0.83124007, 0.57209338, 0.46818193, 1.94411389, 1.11838555, 1.76276198, 1.50832034, 1.59689424, 1.86436941, 1.56489576, 1.35224913, 1.61342031, 1.18524144, 1.17006466, 2.89829723, 0.97870901, 1.12273007, 0.79749237, 1.73183086, 1.27720197, 2.23281948, 1.15867175, 1.34780691, 1.16821104, 1.29751932, 1.06274789, 1.7869441, 0.74076463, 1.58007154, 1.60029141, 2.06801911, 1.54407215, 1.11033741, 1.45107597, 1.09984196, 1.80299577, 1.51156628, 1.01097929, 0.89780394, 1.64652346, 1.43526736, 1.48606446, 1.70475197, 1.30804301, 2.08858592, 1.78120512, 1.31019487, 1.46829228, 1.51324937, 1.19900441, 1.41609633, 1.37214151, 1.89997671, 1.08889755, 1.14016934, 1.32070777, 2.23878817, 1.73168765], "y": [0.17771574, -0.67480644, -0.34190802, -0.31971048, 0.32686971, 0.46947038, -0.24461567, -0.16381659, -0.65973167, -0.47909353, 0.2656741, -0.28157636, -0.37775378, -0.91796137, -0.87962565, -0.58379684, 0.18581686, 0.09782669, -0.70115121, 0.2592851, -0.08014017, 0.1713348, -0.27812883, -0.5872444, -0.04679376, -0.6697534, -0.20410286, 0.01133367, 0.08871725, -0.37371469, -0.02765751, -0.56357111, -0.55724764, -0.74993259, -0.40912831, -0.16910521, -0.18689387, -0.07125678, -0.38394956, -0.3287322, -0.21409976, -0.13198324, -0.21261417, -0.2916911, -0.29104118, -0.17170463, -0.39265606, -0.26380204, -0.16307991, -0.03748139, -0.29314856, -0.50153685, -0.31406222, -0.30306479, -0.47106282, -0.13311429, -0.0487977, -0.52244369, -0.77541895, -0.61170352, -0.17348924, -0.49069471, -0.41736986, -0.74362713, -0.80925407, -0.02896412, -0.16325777, -0.38766978, -0.24680535, 0.01358107, 0.18417046, 0.01748772, -0.79283215, -0.11278554, -0.54275671, -0.47388131, -0.46083221, -0.06234242, -0.03701853, -0.26797969, 0.07285152, -0.51099426, 0.06106458, -0.05413187, -0.17923355, -0.13417486, 0.5853652, -0.48022847, -0.03314002, -0.10831103, -0.46818971, -0.15201933, -0.46140873, -0.27910192, -0.61254894, -0.39876193, -0.48829869, -0.44308479, -0.10469838, -0.23232387, 0.37878065, 0.26667171, 0.37189123, 0.2766074, 0.58276375, 0.37923358, 0.02781665, 0.08976777, 0.36593837, -0.01414593, -0.10697092, 0.15707366, -0.42774143, 0.17073257, -0.18989347, 0.14760597, -0.02218197, -0.31768134, 0.03567761, 0.09648537, 0.1431608, 0.21762427, 0.48921402, -0.17166199, -0.24841107, 0.34792348, 0.82096442, 0.08612872, 0.3107207, -0.00900045, -0.14880234, 0.40933225, -0.23802963, -0.28217197, 0.17411524, 0.07810479, 0.30914542, -0.03844948, 0.1894113, 0.29830179, 0.32134726, 0.11756995, -0.36001249, 0.04342288, 0.15493647, 0.18097365, 0.49987981, -0.03479828, 0.31700901, -0.26193476, 0.11917047, 0.09117592, 0.03814871, -0.18553761, 0.40677628, 0.43790232, 0.28921399, 0.12199152, -0.03643564], "marker": {"color": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], "symbol": "circle", "colorscale": [[0.0, "#228be6"], [0.5, "#51cf66"], [1.0, "#be4bdb"]], "coloraxis": "coloraxis"}, "mode": "markers", "name": "setosa", "legendgroup": "setosa", "showlegend": true, "xaxis": "x", "yaxis": "y", "hovertext": ["setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica"]} ], "layout": { "title": { "text": "鸢尾花数据集的PCA(2个成分)" }, "xaxis": { "title": { "text": "主成分1" } }, "yaxis": { "title": { "text": "主成分2" } }, "legend": { "title": { "text": "物种" } }, "colorway": ["#228be6", "#51cf66", "#be4bdb"], "width": 700, "height": 500, "legend_title_text": "物种" } }鸢尾花数据集在最初两个主成分上的PCA投影。颜色表示真实的物种标签,这显示了PCA如何根据方差区分群体。总解释方差(在此例中约为95.8%)表明这两个成分捕获了原始数据多少的变异性。PCA提供线性投影,这种方法计算效率高,并且在方差方面易于理解。然而,它可能无法有效分离由非线性关系或局部密度定义的聚类。使用t-分布随机邻域嵌入(t-SNE)进行数据可视化t-SNE是一种主要用于可视化的非线性降维方法。与PCA侧重于最大化方差(保留全局结构)不同,t-SNE旨在保留数据的局部结构。它将高维数据点之间的相似性建模为条件概率,然后尝试找到一个低维嵌入(通常是2D或3D),使低维点之间的相似性与高维相似性密切匹配。t-SNE在显现数据中的聚类方面特别有效。在高维空间中靠近的点倾向于在低维空间中映射得也靠近。t-SNE的重要方面:概率性: 它涉及使用概率分布(高维空间中的高斯分布,低维空间中的t-分布)来建模相似性。非线性: 它可以捕获PCA可能遗漏的复杂非线性结构。计算密集: 与PCA相比,t-SNE需要更多的计算时间,特别是对于大型数据集。对超参数敏感: 结果很大程度上取决于超参数:perplexity(困惑度):大致与每个点考虑的最近邻居数量有关。典型值在5到50之间。它影响数据局部和全局特性的平衡。n_iter(迭代次数):优化迭代的次数。通常需要数百次迭代(例如,1000次)才能收敛。learning_rate(学习率):控制优化期间的步长。需要注意的是,生成的t-SNE图主要用于视觉分析。t-SNE图中表观聚类之间的距离可能没有实际意义,并且全局排列会因运行或不同的困惑度值而异。应关注点的分组情况,而非它们的相对位置或大小。这里演示如何使用scikit-learn应用t-SNE,同样使用缩放后的鸢尾花数据:import numpy as np import pandas as pd from sklearn.datasets import load_iris from sklearn.preprocessing import StandardScaler from sklearn.manifold import TSNE # 注意:与PCA模块不同 import plotly.express as px import plotly.io as pio # 加载并缩放数据(如前所述) iris = load_iris() X = iris.data y = iris.target target_names = iris.target_names scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # 应用t-SNE # 常用参数:perplexity=30, n_iter=1000 tsne = TSNE(n_components=2, perplexity=30, n_iter=1000, random_state=42) X_tsne = tsne.fit_transform(X_scaled) # 创建用于绘图的DataFrame tsne_df = pd.DataFrame(data = X_tsne, columns = ['TSNE Component 1', 'TSNE Component 2']) tsne_df['target'] = y tsne_df['species'] = tsne_df['target'].apply(lambda i: target_names[i]) # 可视化结果 fig_tsne = px.scatter(tsne_df, x='TSNE Component 1', y='TSNE Component 2', color='species', title='t-SNE Visualization of Iris Dataset', labels={'species': 'Species'}, color_discrete_map={ # 使用课程调色板颜色 'setosa': '#228be6', # 蓝色 'versicolor': '#51cf66', # 绿色 'virginica': '#be4bdb' # 紫色 }) fig_tsne.update_layout( xaxis_title="t-SNE Component 1", yaxis_title="t-SNE Component 2", legend_title="Species", width=700, # 调整宽度以适应网页显示 height=500 # 调整高度以适应网页显示 ) # 显示图表(或生成用于网页嵌入的JSON) # fig_tsne.show() # 要生成用于嵌入的JSON: # print(pio.to_json(fig_tsne)) { "layout": { "xaxis": { "title": { "text": "t-SNE Component 1" } }, "yaxis": { "title": { "text": "t-SNE Component 2" } }, "legend": { "title": { "text": "物种" } }, "title": { "text": "鸢尾花数据集的t-SNE可视化" }, "colorway": ["#228be6", "#51cf66", "#be4bdb"], "width": 700, "height": 500, "legend_title_text": "物种" }, "data": [ { "type": "scatter", "x": [-27.02879181, -24.21726036, -26.04585075, -25.68383789, -27.3057003, -25.39772606, -26.12624359, -26.20956039, -24.02295494, -25.03948975, -25.98279762, -25.51613808, -25.09731102, -24.70985985, -28.44541168, -28.56044197, -27.74786568, -27.11561775, -27.245018, -26.48351669, -26.79205132, -27.23244095, -27.48978996, -24.86789703, -26.46721268, -24.54265976, -25.83741, -26.46068382, -26.15736389, -25.06044006, -25.18941307, -27.9673233, -28.15627289, -28.0027504, -25.12880707, -26.08397102, -27.81362343, -26.93152428, -24.74069023, -26.1830349, -25.49843788, -27.06082153, -24.26754189, -26.28738785, -28.62539482, -26.46974182, -25.58368492, -27.11775208, -27.91602135, -26.56984711, 0.76812214, 4.14255571, 2.57954049, 3.407269, 1.97035265, 2.69751978, 0.08988114, 3.62854481, 6.80645704, 4.90398645, 2.05509949, 4.46690178, 3.27798653, 4.04923916, 8.19691849, 0.64630717, 2.1211524, -0.71119529, 5.30938435, 1.02045441, 5.90567207, 2.99705648, 7.67220116, 2.42681169, 5.70240116, 7.19186163, 1.7916671, 2.61928725, 6.1267128, 6.52117205, 5.09753418, 8.33704185, 4.67919779, 6.02539444, 7.6705184, -1.33613265, 7.80486774, 3.91202784, 1.75352454, 6.40346432, 5.01190805, 3.89151168, 4.34604406, 5.00365639, 7.02033901, 5.74592924, 3.07195711, 4.32777977, 1.31075168, 6.06395102, 14.77712536, 12.26187229, 14.87287235, 17.52343178, 13.44461155, 14.00906086, 13.18763733, 11.53382683, 14.16719627, 10.47063732, 10.93202972, 17.75270271, 10.09204388, 11.01555252, 10.40922642, 12.54719734, 11.60310745, 15.68390465, 10.92850018, 10.82001591, 11.24162292, 10.25675392, 12.43054581, 15.6067667, 11.84571457, 12.64948368, 13.68261719, 16.79806709, 14.35459614, 11.41734314, 14.94013691, 13.06318855, 16.48337173, 13.5060339, 12.2534914, 11.17861843, 14.61236191, 14.06919193, 12.80487633, 16.37498093, 12.88883877, 15.51858521, 15.38588715, 12.56771851, 14.17930317, 12.92169857, 14.4651432, 12.54207802, 17.16554451, 12.12200165, 12.06453133], "y": [-1.72830522, -0.24748783, -1.64266777, -0.93773127, -1.89474797, -2.73209858, -1.45390081, -0.91858757, -0.04613076, -0.84128922, -2.28625631, -1.17273331, -0.30930018, 0.91693783, -3.04777932, -3.53236008, -2.27138233, -1.79010653, -3.38627362, -1.87536716, -1.18281961, -1.65612185, -2.03987408, -1.19937992, -0.92326713, -0.68483973, -1.32596087, -1.24327707, -1.56937075, -0.8247509, -0.56312162, -2.78741527, -2.58384037, -2.54557943, -0.66197568, -1.38126307, -2.39693141, -1.63984263, -0.12952007, -1.46951699, -0.95817321, -1.90789175, 0.08133952, -1.54402304, -3.1812768, -1.01923227, -1.00112057, -1.91294575, -2.56032944, -1.41335773, -8.55031395, -8.00022316, -9.37060452, -7.79019356, -8.85318947, -8.83856201, -9.58779812, -8.1391964, -8.07843399, -7.76741791, -9.75030804, -8.09039783, -8.06354713, -7.6312499, -8.17800045, -9.94552612, -9.52923393, -9.77401161, -7.86903763, -9.33022213, -8.58490562, -8.61165428, -7.80514622, -9.59196568, -8.21082306, -7.88633299, -9.10166645, -9.14010429, -8.08542633, -8.23072529, -7.8680706, -7.86838341, -8.11403561, -7.85306072, -7.6718092, -10.5566721, -7.72936106, -8.21312714, -9.73578167, -7.9111929, -8.48014069, -8.35899067, -8.26563358, -8.43457985, -7.761868, -8.33659744, -8.96934891, -9.61961269, -8.5542965, 8.83182907, 7.93597412, 9.42946148, 8.28231525, 8.12653065, 8.71539974, 9.74603176, 6.76849461, 8.77781868, 7.67366409, 6.80653381, 9.32055378, 5.8120141, 7.37013149, 7.10413551, 8.40669155, 6.66560459, 9.87809181, 7.09897995, 6.66204453, 6.3181696, 6.20354605, 7.7989688, 10.24879646, 6.50492716, 7.82435894, 7.82426357, 9.65746593, 8.79145908, 6.06510162, 8.8355341, 8.52518272, 10.72192192, 8.12576485, 7.2876873, 6.08708858, 9.29846382, 8.84589176, 7.9146719, 10.49499702, 7.76147461, 9.9851799, 9.9876442, 7.51729536, 8.95834351, 8.13412094, 9.22407246, 7.15830755, 10.09480667, 7.19364882, 6.57838535], "marker": {"color": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], "symbol": "circle", "colorscale": [[0.0, "#228be6"], [0.5, "#51cf66"], [1.0, "#be4bdb"]], "coloraxis": "coloraxis"}, "mode": "markers", "name": "setosa", "legendgroup": "setosa", "showlegend": true, "xaxis": "x", "yaxis": "y", "hovertext": ["setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica"]} ] }鸢尾花数据集的t-SNE投影。请注意,与PCA相比,t-SNE通常能生成更清晰、分离度更好的聚类,有效地捕获同一物种数据点之间的局部相似性。用于可视化的PCA与t-SNE:快速比较目标: PCA侧重于保留全局方差;t-SNE侧重于保留局部相似性。结构: PCA捕获线性结构;t-SNE捕获非线性结构,更适合显现聚类。计算: PCA速度快得多。t-SNE较慢且占用内存多。输出: PCA结果是确定性的。t-SNE结果由于其概率特性和优化过程,在不同运行之间可能略有差异。t-SNE中聚类之间的全局排列和距离不如分组本身可靠。用例: 在需要快速初步概览或全局结构重要时使用PCA。当主要目的是可视化潜在聚类和局部关系时,使用t-SNE。通常,可以先应用PCA显著降低维度(例如,降至50个成分),然后再应用t-SNE,这可以改善t-SNE的性能并减少噪声。注意事项缩放: PCA和t-SNE(特别是在没有预先PCA的情况下直接应用于数据时)通常都需要对特征进行缩放。解释: 解释图表时需谨慎。PCA轴代表最大方差方向。t-SNE轴没有直接的全局解释;应侧重于点的相对分组。信息损失: 请记住,任何降维都涉及信息损失。可视化提供的是压缩视图,而非完整画面。通过应用PCA和t-SNE等方法,您可以将复杂的高维数据转换为可解释的2D或3D图表,有助于发现原本难以发现的模式和结构。这种视觉分析是理解无标签数据的非监督学习工具包中不可或缺的一部分。