好的,让我们将理论付诸实践。理解概率分布的数学公式是一回事,但通过模拟和可视化观察它们的实际表现,能让你对其有更充分的直觉认知。这在你需要判断某个分布是否适合机器学习中观测到的数据时,尤其有用。我们将使用 Python,特别是 scipy.stats 模块,以及 numpy 进行数值计算,matplotlib 或 seaborn 进行绘图(常与 plotly 结合用于交互式网页可视化)。首先,请确保您已安装所需的库并导入它们:import numpy as np from scipy import stats import matplotlib.pyplot as plt import seaborn as sns # 可选:设置 seaborn 绘图风格 sns.set_style("whitegrid")模拟和可视化离散分布:二项分布我们从二项分布开始,它表示在固定次数 $n$ 的独立伯努利试验中成功次数 $k$ 的模型,每次试验的成功概率为 $p$。假设我们抛掷一枚不均匀的硬币($p=0.6$)20 次($n=20$),并重复此实验多次(例如 1000 次)。我们可以使用 scipy.stats.binom.rvs(随机变量)模拟每次实验中获得正面的次数。# 二项分布的参数 n_binom = 20 # 试验次数 p_binom = 0.6 # 成功概率 size_binom = 1000 # 模拟次数(实验次数) # 从二项分布中模拟随机样本 binomial_samples = stats.binom.rvs(n=n_binom, p=p_binom, size=size_binom) # 打印前 10 个模拟结果 print(f"前 10 个模拟结果(在 {n_binom} 次试验中的成功次数):") print(binomial_samples[:10])现在,我们使用直方图可视化模拟结果,并将其与理论概率质量函数(PMF)进行比较。PMF 表示获得恰好 $k$ 次成功的概率。# 计算理论 PMF k_values_binom = np.arange(0, n_binom + 1) pmf_binom = stats.binom.pmf(k=k_values_binom, n=n_binom, p=p_binom) # 使用 Matplotlib 创建图表(或适配 Plotly) plt.figure(figsize=(10, 6)) # 绘制模拟数据的直方图 # 使用 density=True 标准化直方图,以便与 PMF 进行比较 plt.hist(binomial_samples, bins=k_values_binom, density=True, alpha=0.6, color='#4dabf7', label='模拟数据(直方图)') # 绘制理论 PMF # 使用 'o-' 绘制点线 plt.plot(k_values_binom, pmf_binom, 'o-', color='#f03e3e', label='理论 PMF') plt.xlabel(f"在 {n_binom} 次试验中的成功次数 (k)") plt.ylabel("概率 / 密度") plt.title(f"二项分布 (n={n_binom}, p={p_binom}) 模拟 vs. 理论") plt.legend() plt.grid(True) plt.show() 以下是使用 Plotly 在网页上表示的图表:{ "data": [ { "type": "histogram", "x": [12, 13, 14, 11, 12, 13, 10, 15, 12, 11, 13, 12, 11, 14, 10, 13, 12, 14, 11, 12], "name": "模拟数据(直方图)", "marker": {"color": "#4dabf7"}, "opacity": 0.6, "histnorm": "probability density", "xbins": {"start": -0.5, "end": 20.5, "size": 1} }, { "type": "scatter", "x": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], "y": [1.099511627776e-09, 3.298534883328e-08, 5.2776558133248e-07, 5.1116893895936e-06, 3.476148784923648e-05, 0.00017886386487339264, 0.0007455018403891788, 0.002556006287048398, 0.007377218081138715, 0.017976149861473126, 0.03699028841138237, 0.06536521484361595, 0.09934911167105927, 0.1299511961319861, 0.14619509564848436, 0.139387291822545, 0.111509833458036, 0.074339888972024, 0.039488622545079464, 0.014929686796540646, 0.003018035584896], "mode": "lines+markers", "name": "理论 PMF", "marker": {"color": "#f03e3e"} } ], "layout": { "title": {"text": "二项分布 (n=20, p=0.6) 模拟 vs. 理论"}, "xaxis": {"title": {"text": "20 次试验中的成功次数 (k)"}}, "yaxis": {"title": {"text": "概率 / 密度"}}, "legend": {"title": {"text": "图例"}}, "bargap": 0.05, "barmode": "overlay" } }1000 次模拟二项分布实验(n=20, p=0.6)与理论概率质量函数的比较。直方图与 PMF 预测的形状非常接近。请注意,我们模拟结果的直方图如何与理论 PMF 的形状非常接近。随着模拟次数 (size_binom) 的增加,这种近似通常会变得更好。模拟和可视化连续分布:正态分布接下来,我们来研究正态(高斯)分布,它以其均值 $\mu$ 和标准差 $\sigma$(或方差 $\sigma^2$)为特征。它可能是统计学和机器学习中最常见的分布,常用于模拟自然现象或误差。我们从标准正态分布($\mu=0, \sigma=1$)中模拟数据,并将样本的直方图与理论概率密度函数(PDF)进行比较。# 正态分布的参数 mu_norm = 0 # 均值 sigma_norm = 1 # 标准差 size_norm = 1000 # 样本数量 # 从正态分布中模拟随机样本 normal_samples = stats.norm.rvs(loc=mu_norm, scale=sigma_norm, size=size_norm) # 打印前 10 个模拟值 print(f"N({mu_norm}, {sigma_norm**2}) 的前 10 个模拟值:") print(normal_samples[:10]) # 为理论 PDF 曲线生成点 x_values_norm = np.linspace(mu_norm - 4*sigma_norm, mu_norm + 4*sigma_norm, 200) pdf_norm = stats.norm.pdf(x_values_norm, loc=mu_norm, scale=sigma_norm) # 创建图表 plt.figure(figsize=(10, 6)) # 绘制模拟数据的直方图 plt.hist(normal_samples, bins=30, density=True, alpha=0.6, color='#74c0fc', label='模拟数据(直方图)') # 绘制理论 PDF plt.plot(x_values_norm, pdf_norm, color='#f76707', linewidth=2, label='理论 PDF') plt.xlabel("数值") plt.ylabel("密度") plt.title(f"正态分布 (\u03bc={mu_norm}, \u03c3={sigma_norm}) 模拟 vs. 理论") plt.legend() plt.grid(True) plt.show() # 同样,在网页部署时适配 Plotly这是相应的 Plotly JSON 表示:{ "data": [ { "type": "histogram", "x": [-0.626, 0.183, -0.835, 1.595, 0.329, -0.82, 0.488, -0.306, -0.634, 0.891, -0.76, 0.338, -0.587, 1.344, -0.529, 0.769, -0.95, -0.028, 0.49, 1.368], "name": "模拟数据(直方图)", "marker": {"color": "#74c0fc"}, "opacity": 0.6, "histnorm": "probability density", "nbinsx": 30 }, { "type": "scatter", "x": [-4.0, -3.979, -3.958, -3.937, -3.916, -3.895, -3.874, -3.853, -3.832, -3.811, -3.79, -3.769, -3.748, -3.727, -3.706, -3.685, -3.664, -3.643, -3.622, -3.601, -3.58, -3.559, -3.538, -3.517, -3.496, -3.475, -3.454, -3.433, -3.412, -3.391, -3.37, -3.349, -3.328, -3.307, -3.286, -3.265, -3.244, -3.223, -3.202, -3.181, -3.16, -3.139, -3.118, -3.097, -3.076, -3.055, -3.034, -3.013, -2.992, -2.971, -2.95, -2.929, -2.908, -2.887, -2.866, -2.845, -2.824, -2.803, -2.782, -2.761, -2.74, -2.719, -2.698, -2.677, -2.656, -2.635, -2.614, -2.593, -2.572, -2.551, -2.53, -2.509, -2.488, -2.467, -2.446, -2.425, -2.404, -2.383, -2.362, -2.341, -2.32, -2.299, -2.278, -2.257, -2.236, -2.215, -2.194, -2.173, -2.152, -2.131, -2.11, -2.089, -2.068, -2.047, -2.026, -2.005, -1.984, -1.963, -1.942, -1.921, -1.9, -1.879, -1.858, -1.837, -1.816, -1.795, -1.774, -1.753, -1.732, -1.711, -1.69, -1.669, -1.648, -1.627, -1.606, -1.585, -1.564, -1.543, -1.522, -1.501, -1.48, -1.459, -1.438, -1.417, -1.396, -1.375, -1.354, -1.333, -1.312, -1.291, -1.27, -1.249, -1.228, -1.207, -1.186, -1.165, -1.144, -1.123, -1.102, -1.081, -1.06, -1.039, -1.018, -0.997, -0.976, -0.955, -0.934, -0.913, -0.892, -0.871, -0.85, -0.829, -0.808, -0.787, -0.766, -0.745, -0.724, -0.703, -0.682, -0.661, -0.64, -0.619, -0.598, -0.577, -0.556, -0.535, -0.514, -0.493, -0.472, -0.451, -0.43, -0.409, -0.388, -0.367, -0.346, -0.325, -0.304, -0.283, -0.262, -0.241, -0.22, -0.199, -0.178, -0.157, -0.136, -0.115, -0.094, -0.073, -0.052, -0.031, -0.01, 0.01, 0.031, 0.052, 0.073, 0.094, 0.115, 0.136, 0.157, 0.178, 0.199, 0.22, 0.241, 0.262, 0.283, 0.304, 0.325, 0.346, 0.367, 0.388, 0.409, 0.43, 0.451, 0.472, 0.493, 0.514, 0.535, 0.556, 0.577, 0.598, 0.619, 0.64, 0.661, 0.682, 0.703, 0.724, 0.745, 0.766, 0.787, 0.808, 0.829, 0.85, 0.871, 0.892, 0.913, 0.934, 0.955, 0.976, 0.997, 1.018, 1.039, 1.06, 1.081, 1.102, 1.123, 1.144, 1.165, 1.186, 1.207, 1.228, 1.249, 1.27, 1.291, 1.312, 1.333, 1.354, 1.375, 1.396, 1.417, 1.438, 1.459, 1.48, 1.501, 1.522, 1.543, 1.564, 1.585, 1.606, 1.627, 1.648, 1.669, 1.69, 1.711, 1.732, 1.753, 1.774, 1.795, 1.816, 1.837, 1.858, 1.879, 1.9, 1.921, 1.942, 1.963, 1.984, 2.005, 2.026, 2.047, 2.068, 2.089, 2.11, 2.131, 2.152, 2.173, 2.194, 2.215, 2.236, 2.257, 2.278, 2.299, 2.32, 2.341, 2.362, 2.383, 2.404, 2.425, 2.446, 2.467, 2.488, 2.509, 2.53, 2.551, 2.572, 2.593, 2.614, 2.635, 2.656, 2.677, 2.698, 2.719, 2.74, 2.761, 2.782, 2.803, 2.824, 2.845, 2.866, 2.887, 2.908, 2.929, 2.95, 2.971, 2.992, 3.013, 3.034, 3.055, 3.076, 3.097, 3.118, 3.139, 3.16, 3.181, 3.202, 3.223, 3.244, 3.265, 3.286, 3.307, 3.328, 3.349, 3.37, 3.391, 3.412, 3.433, 3.454, 3.475, 3.496, 3.517, 3.538, 3.559, 3.58, 3.601, 3.622, 3.643, 3.664, 3.685, 3.706, 3.727, 3.748, 3.769, 3.79, 3.811, 3.832, 3.853, 3.874, 3.895, 3.916, 3.937, 3.958, 3.979, 4.0], "y": [0.000133, 0.000155, 0.000179, 0.000207, 0.000238, 0.000273, 0.000312, 0.000356, 0.000405, 0.000459, 0.000519, 0.000586, 0.000659, 0.000739, 0.000827, 0.000923, 0.001027, 0.00114, 0.001262, 0.001394, 0.001537, 0.00169, 0.001854, 0.00203, 0.002217, 0.002417, 0.002629, 0.002854, 0.003093, 0.003346, 0.003613, 0.003895, 0.004192, 0.004505, 0.004834, 0.00518, 0.005542, 0.005921, 0.006318, 0.006733, 0.007166, 0.007619, 0.008092, 0.008585, 0.009099, 0.009634, 0.01019, 0.010767, 0.011366, 0.011986, 0.012629, 0.013294, 0.013982, 0.014692, 0.015426, 0.016183, 0.016963, 0.017767, 0.018594, 0.019445, 0.02032, 0.021218, 0.02214, 0.023086, 0.024056, 0.02505, 0.026068, 0.02711, 0.028176, 0.029265, 0.030378, 0.031515, 0.032675, 0.033858, 0.035065, 0.036295, 0.037548, 0.038824, 0.040123, 0.041445, 0.04279, 0.044158, 0.045548, 0.04696, 0.048394, 0.049851, 0.051329, 0.052829, 0.054351, 0.055894, 0.057458, 0.059043, 0.060649, 0.062275, 0.063922, 0.065589, 0.067276, 0.068984, 0.070711, 0.072458, 0.074224, 0.076009, 0.077813, 0.079636, 0.081477, 0.083335, 0.08521, 0.087103, 0.089012, 0.090937, 0.092878, 0.094833, 0.096803, 0.098787, 0.100785, 0.102796, 0.10482, 0.106857, 0.108905, 0.110965, 0.113036, 0.115118, 0.11721, 0.119312, 0.121423, 0.123543, 0.12567, 0.127805, 0.129947, 0.132095, 0.13425, 0.136409, 0.138573, 0.140742, 0.142914, 0.14509, 0.147268, 0.149448, 0.151629, 0.153811, 0.155993, 0.158175, 0.160356, 0.162535, 0.164713, 0.166887, 0.169058, 0.171225, 0.173387, 0.175543, 0.177693, 0.179836, 0.18197, 0.184096, 0.186212, 0.188318, 0.190412, 0.192494, 0.194563, 0.196618, 0.198659, 0.200684, 0.202693, 0.204685, 0.206659, 0.208615, 0.210552, 0.212469, 0.214366, 0.216242, 0.218096, 0.219929, 0.221739, 0.223526, 0.225289, 0.227028, 0.228742, 0.23043, 0.232092, 0.233726, 0.235333, 0.236911, 0.238461, 0.239981, 0.241472, 0.242931, 0.244359, 0.245755, 0.247118, 0.248448, 0.249744, 0.251006, 0.252233, 0.253424, 0.25458, 0.2557, 0.256783, 0.257829, 0.258838, 0.259809, 0.260742, 0.261637, 0.262494, 0.263311, 0.264089, 0.264828, 0.265527, 0.266186, 0.266804, 0.267382, 0.26792, 0.268417, 0.268872, 0.269286, 0.269659, 0.26999, 0.270279, 0.270526, 0.27073, 0.270891, 0.271009, 0.271084, 0.271115, 0.271103, 0.271047, 0.270947, 0.270803, 0.270615, 0.270383, 0.270107, 0.269786, 0.269421, 0.26901, 0.268555, 0.268055, 0.267509, 0.266919, 0.266283, 0.265602, 0.264875, 0.264103, 0.263286, 0.262422, 0.261513, 0.260558, 0.259556, 0.258509, 0.257416, 0.256276, 0.25509, 0.253857, 0.252578, 0.251252, 0.249879, 0.248459, 0.246992, 0.245478, 0.243916, 0.242307, 0.24065, 0.238946, 0.237194, 0.235394, 0.233546, 0.23165, 0.229706, 0.227713, 0.225672, 0.223583, 0.221445, 0.219259, 0.217024, 0.214741, 0.212409, 0.210029, 0.207601, 0.205124, 0.202599, 0.200026, 0.197405, 0.194736, 0.192019, 0.189254, 0.186442, 0.183582, 0.180675, 0.17772, 0.174719, 0.171671, 0.168576, 0.165435, 0.162248, 0.159015, 0.155737, 0.152413, 0.149045, 0.145631, 0.142173, 0.138671, 0.135126, 0.131537, 0.127906, 0.124233, 0.120519, 0.116763, 0.112968, 0.109133, 0.105259, 0.101348, 0.09740, 0.093416, 0.089396, 0.085341, 0.081253, 0.077132, 0.072979, 0.068797, 0.064585, 0.060346, 0.056082, 0.051792, 0.047479, 0.043144, 0.03879, 0.034417, 0.030029, 0.025628, 0.021217, 0.016801, 0.01238, 0.007957, 0.003534, 0.000133, 0.000155], "mode": "lines", "name": "理论 PDF", "line": {"color": "#f76707", "width": 2} } ], "layout": { "title": {"text": "正态分布 (\u03bc=0, \u03c3=1) 模拟 vs. 理论"}, "xaxis": {"title": {"text": "数值"}}, "yaxis": {"title": {"text": "密度"}}, "legend": {"title": {"text": "图例"}}, "bargap": 0.05, "barmode": "overlay" } }从标准正态分布 (N(0, 1)) 中提取的 1000 个样本的直方图,叠加了理论概率密度函数 (PDF)。模拟数据与特征性的钟形曲线非常接近。同样,模拟数据的直方图很好地近似了理论 PDF 定义的钟形曲线。这种可视化有助于确认我们的随机样本行为符合正态分布的预期。模拟和可视化另一个分布:泊松分布我们再来看一个:泊松分布。它表示在给定平均发生率 $\lambda$ 的情况下,在固定时间或空间间隔内发生给定数量事件的概率。例子:每小时到达的电子邮件数量。假设电子邮件以平均每小时 $\lambda=5$ 的速率到达。我们将模拟 1000 个不同小时内收到的电子邮件数量。# 泊松分布的参数 lambda_poisson = 5 # 平均发生率(例如,每小时电子邮件数) size_poisson = 1000 # 模拟次数(观察的小时数) # 从泊松分布中模拟随机样本 poisson_samples = stats.poisson.rvs(mu=lambda_poisson, size=size_poisson) # 打印前 10 个模拟计数 print(f"前 10 个模拟计数(每个时间间隔的事件数,\u03bb={lambda_poisson}):") print(poisson_samples[:10]) # 计算理论 PMF # 为 k 选择一个合理的上限,例如,均值 + 几个标准差 k_max_poisson = int(lambda_poisson + 4 * np.sqrt(lambda_poisson)) k_values_poisson = np.arange(0, k_max_poisson + 1) pmf_poisson = stats.poisson.pmf(k=k_values_poisson, mu=lambda_poisson) # 创建图表 plt.figure(figsize=(10, 6)) # 绘制模拟数据的直方图 # 为离散数据正确对齐直方图的bin bins_poisson = np.arange(poisson_samples.min(), poisson_samples.max() + 2) - 0.5 plt.hist(poisson_samples, bins=bins_poisson, density=True, alpha=0.6, color='#69db7c', label='模拟数据(直方图)') # 绘制理论 PMF plt.plot(k_values_poisson, pmf_poisson, 'o-', color='#be4bdb', label='理论 PMF') plt.xlabel(f"每个时间间隔的事件数 (k)") plt.ylabel("概率 / 密度") plt.title(f"泊松分布 (\u03bb={lambda_poisson}) 模拟 vs. 理论") plt.legend() plt.grid(True) plt.xticks(k_values_poisson) # 确保刻度位于整数值 plt.xlim(bins_poisson.min(), bins_poisson.max()) plt.show(){"data":[{"type":"histogram","x":[5,5,6,4,5,3,5,7,6,3,6,4,3,5,7,6,5,4,5,6],"name":"模拟数据(直方图)","marker":{"color":"#69db7c"},"opacity":0.6,"histnorm":"probability density","xbins":{"start":-0.5,"end":14.5,"size":1}},{"type":"scatter","x":[0,1,2,3,4,5,6,7,8,9,10,11,12,13],"y":[0.0067,0.0336,0.0842,0.1403,0.1754,0.1754,0.1462,0.1044,0.0652,0.0362,0.0181,0.0082,0.0034,0.0013],"mode":"lines+markers","name":"理论 PMF","marker":{"color":"#be4bdb"}}],"layout":{"title":{"text":"泊松分布 (\u03bb=5) 模拟 vs. 理论"},"xaxis":{"title":{"text":"每个时间间隔的事件数 (k)"},"tickmode":"array","tickvals":[0,1,2,3,4,5,6,7,8,9,10,11,12,13],"range":[-0.5,13.5]},"yaxis":{"title":{"text":"概率 / 密度"}},"legend":{"title":{"text":"图例"}},"bargap":0.05,"barmode":"overlay"}}1000 次模拟泊松实验(lambda=5)与理论概率质量函数的比较。直方图与 PMF 相似,在平均发生率 lambda 附近达到峰值。这个实践练习表明,模拟如何帮助弥合抽象分布公式与具体数据之间的鸿沟。通过使用 SciPy 和 Matplotlib/Plotly 等库生成样本并对照理论函数绘制它们,你可以更好地了解常见概率分布的形状、分散情况和特点。这项技能对探索性数据分析、模型选择以及理解许多机器学习算法的概率根基都很有价值。