箱线图(也称箱须图)提供了一种紧凑的视觉汇总,能够突出数据的特定统计量。虽然直方图能有效地呈现数据的整体形态和频率,但箱线图提供了一种标准方式,基于五数概括:最小值、第一四分位数 (Q1)、中位数 (Q2)、第三四分位数 (Q3) 和最大值,来呈现数据分布。这种图表特别适用于比较不同群体的数据分布。箱线图的组成部分一个标准箱线图包含几个主要部分:箱体: 中央箱体表示数据中间的50%。其底部边界表示第一四分位数(Q1,即25%百分位数),顶部边界表示第三四分位数(Q3,即75%百分位数)。箱体的长度因此表示四分位距(IQR),计算方式为 $IQR = Q3 - Q1$。这个范围包含你数据点的中间一半。中位线: 箱体内的线表示数据的中位数(Q2,即50%百分位数)。这条线在箱体中的位置可以反映数据的对称性。如果中位数靠近Q1,则中位数以下的数据比中位数以上的数据更紧密,反之亦然。触须: 从箱体向外延伸的线,通常称为触须。通常的规定是触须延伸到距离下四分位数(Q1)1.5倍IQR范围内的最低数据点,以及距离上四分位数(Q3)1.5倍IQR范围内的最高数据点。更简单地说:下触须延伸到 $max(数据最小值, Q1 - 1.5 \times IQR)$上触须延伸到 $min(数据最大值, Q3 + 1.5 \times IQR)$ 任何落在此范围之外的数据点都被视为潜在的异常值。异常值: 落在触须定义范围之外的数据点会单独绘制,通常以点或星号表示。这些点被标记出来以备进一步检查,因为它们异常偏离数据的中心部分。为何使用箱线图?箱线图在汇总数据方面有几个优点:简洁汇总: 它们有效地展示数据的中心(中位数)、离散程度(IQR)和范围(触须)。异常值识别: 它们提供了一种识别潜在异常值的标准视觉方法。比较: 并排放置箱线图是比较不同数据集或数据集中子组分布的有效方式。你可以快速比较它们的中位数、IQR和异常值的存在情况。偏斜度指示: 中位数在箱体中的位置以及触须的相对长度可以提供数据分布偏斜度的视觉线索。中位数靠近Q1且上触须较长表示正偏斜,而中位数靠近Q3且下触须较长表示负偏斜。在 Python 中创建箱线图Matplotlib 和 Seaborn 等 Python 库使创建箱线图变得简单,特别是在处理 Pandas DataFrame 时。让我们生成一些表示两个城市(城市A和城市B)每日温度的样本数据并进行绘图。import pandas as pd import numpy as np import plotly.express as px # 生成一些样本温度数据 np.random.seed(42) # 为了结果可复现 city_a_temps = np.random.normal(loc=20, scale=5, size=100) # 平均20C,标准差5C city_b_temps = np.random.normal(loc=25, scale=8, size=100) # 平均25C,标准差8C # 为城市A添加几个异常值 city_a_temps = np.append(city_a_temps, [3, 45]) # 创建一个 Pandas DataFrame df = pd.DataFrame({ 'Temperature': np.concatenate([city_a_temps, city_b_temps]), 'City': ['City A'] * len(city_a_temps) + ['City B'] * len(city_b_temps) }) # 使用 Plotly Express 创建箱线图 fig = px.box(df, x='City', y='Temperature', color='City', # 按城市为箱体着色 points="outliers", # 显示异常值 title="按城市划分的每日温度分布", labels={'Temperature': '温度 (°C)', 'City': '城市'}, color_discrete_map={'City A': '#1f77b4', 'City B': '#ff7f0e'} # 可选自定义颜色 ) # 若要在 Jupyter 等环境中显示图表: # fig.show() # 这是用于嵌入的 JSON 表示:{"layout": {"xaxis": {"title": {"text": "城市"}}, "yaxis": {"title": {"text": "温度 (\u00b0C)"}}, "boxmode": "group", "title": {"text": "按城市划分的每日温度分布"}, "legend": {"traceorder": "reversed"}, "colorway": ["#1f77b4", "#ff7f0e"]}, "data": [{"type": "box", "name": "城市 A", "ysrc": "df:City:0:6e5b23", "xsrc": "df:City:0:6e5b23", "marker": {"color": "#1f77b4", "outliercolor": "rgba(31, 119, 180, 0.6)", "line": {"outliercolor": "rgba(31, 119, 180, 1.0)", "outlierwidth": 1}}, "boxpoints": "outliers", "y": [22.48357078615603, 19.30867850369068, 23.238442687606168, 27.61514957406901, 18.82923312618991, 18.82946877627654, 25.59965800889094, 27.343921286646043, 17.917250380826646, 22.930300910420844, 19.5360849483675, 15.312438485290698, 21.8607563055274, 16.238364005866473, 16.611663217495434, 16.14539997409447, 19.74907698052808, 19.738073286380986, 16.9999040787019, 22.47311490511792, 15.68156856070998, 19.399491652960013, 25.876979020449897, 15.460101580342998, 15.319951774267028, 19.41426601011187, 12.919986800591692, 19.991569798727904, 16.13817218123082, 11.773433328605976, 19.75790146774568, 24.39829264316073, 24.123815971861634, 18.40492394838509, 18.259401220108535, 17.06730706017646, 17.27070119386513, 13.957487769968027, 17.799748740348013, 19.83739614120198, 15.895585533819966, 21.133877296699083, 12.67851742044586, 18.12061997791988, 18.54061228791957, 19.98730023883219, 16.15705723207498, 20.27183872824103, 20.130670760133754, 19.027765771131074, 13.940476991951285, 17.79960338571128, 17.60872827186037, 18.07590991385831, 22.07718540023468, 22.60476244942158, 20.00278395166833, 24.696811205062574, 13.491390360412732, 18.13914248108469, 20.93708210309883, 18.977826981669233, 20.41303047956069, 16.73737862868698, 22.828318642415673, 23.52240230466352, 21.0428582939357, 20.922917970466767, 20.3144537169639, 23.08789730516731, 22.33737019961262, 18.551002645674224, 16.07918276088245, 18.535189322301178, 22.92922958756508, 15.430710131791383, 26.22847800406109, 16.35419885725517, 18.46038818798235, 20.83169939829573, 24.04514902325949, 19.45799750544651, 15.18510854715451, 17.115062395814697, 19.12975017947242, 17.38954225629772, 26.09024919413411, 17.887932618646427, 17.41539015394334, 16.89721871205006, 23.966675746089505, 21.83434773807677, 17.640862619295284, 22.86089137020273, 16.497789727552717, 18.091422018717257, 17.11398351213953, 19.79859385878174, 20.69189963341183, 22.35708310451476, 3.0, 45.0], "x": ["City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A", "City A"]}, {"type": "box", "name": "城市 B", "ysrc": "df:City:0:6e5b23", "xsrc": "df:City:0:6e5b23", "marker": {"color": "#ff7f0e", "outliercolor": "rgba(255, 127, 14, 0.6)", "line": {"outliercolor": "rgba(255, 127, 14, 1.0)", "outlierwidth": 1}}, "boxpoints": "outliers", "y": [30.29888724817138, 17.88974999377601, 31.3556360359314, 26.869153251137286, 28.241440957432407, 19.699594410014146, 22.14925353583878, 20.18190240222743, 30.72874899046187, 23.612948019643894, 25.19620538849211, 20.988740107034826, 22.77732700707336, 28.06193239259282, 17.93566518381909, 18.477811609202736, 23.261856931691556, 22.38285157459217, 36.59444069586278, 27.34616120917099, 22.610618960791426, 29.19899518560514, 36.261818618529526, 27.88706247341237, 14.204483901278616, 22.79212198390053, 20.871994324913744, 23.13913948738102, 35.11686609681648, 17.23943463419016, 18.313693101491336, 20.71521842668857, 26.44498215909014, 20.856054798110254, 25.73345122399623, 24.28616090204176, 18.110919614398507, 18.041377137006145, 29.89056760013404, 19.55741604039927, 28.02857180844438, 27.079227454372324, 22.18663856118892, 31.63799001193382, 21.67247352890458, 23.702241686502844, 26.069185809087876, 25.144712843488364, 28.29460794474945, 28.37807170641673, 21.96484713006127, 28.56766087202471, 23.58794437300184, 25.864859984293344, 28.19739309367396, 26.784422527179565, 21.438338200465885, 32.41981809500218, 20.17172655256142, 17.97631724336719, 23.24871022790363, 21.41422805948789, 19.36679082231165, 22.446493685114227, 33.67695239404658, 26.009227242748405, 28.18166300616098, 22.93507777313224, 23.185525723702257, 25.17090347974266, 17.94467002673556, 19.44159082708099, 24.62438389501319, 29.73583561240879, 26.58203556155307, 13.73713524464273, 21.51390125299796, 27.25120103974095, 27.35488948553257, 18.67191717345628, 14.741930486126928, 23.45129832948469, 24.91694296238715, 21.554039679453536, 30.865295291002026, 19.76471360839649, 26.32803837412798, 25.11758617267577, 32.62731066137912, 23.776083686185005, 28.87217977965706, 22.60246634424062, 20.763413375593845, 26.73598589403524, 29.34411467143411, 29.4638892292179, 34.94604640505633, 16.87073033757538, 21.56937659143928, 30.74654971488836], "x": ["City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B", "City B"]}]}并排的箱线图,比较城市A和城市B的每日温度。请注意城市A的中位线、箱体的范围(IQR)、触须以及单独标记的异常值。箱线图解读观察上面生成的图表:中位数比较: 城市B的温度中位数(橙色箱体内的线)明显高于城市A的温度中位数(蓝色箱体内的线)。离散程度 (IQR) 比较: 城市B的箱体比城市A的箱体更高,表明城市B中间50%的温度离散程度(更高的IQR)大于城市A。城市B在其中心范围内的温度变异性更大。触须和整体范围: 触须显示典型数据点(不包括异常值)的范围。城市B的触须整体覆盖范围更广,再次表明变异性更大。异常值: 城市A显示了两个单独绘制的点,分别远远高于和低于其上下触须。这些点表示我们添加的异常温度(3°C和45°C),可能需要进一步检查。在这个样本中,根据1.5 * IQR规则,城市B没有显示异常值。偏斜度: 在城市A的图表中,中位线大致位于箱体中心,触须大致对称(忽略异常值),表示数据主体分布相对对称。城市B的中位数也显得相对居中。箱线图提供了一种有效方式,可以快速了解数据集分布的主要特征,是探索性数据分析中必不可少的工具,特别是在比较群体时。它们补充了从均值和标准差等统计量以及直方图等可视化中获得的见解。