我们听到的声音是压力在空气等介质中传播形成的连续波。然而,计算机处理的是离散的数字数据。为弥合这一差异,我们必须将模拟声波转换为数字形式。这一转换过程是所有数字音频处理的根本,主要包括两个步骤:采样和量化。理解模拟信号模拟信号在时间和振幅上都是连续的。想象一下一个人说话时产生的声波。在任何给定时刻,波都有特定的振幅(与其响度相关),并且它从一个时刻到下一个时刻平滑地流动,没有任何中断。对计算机而言,这种平滑、无限的信息流无法直接使用。我们需要一种方法来获取其有限的近似值。第一步:采样这一转换的第一步是采样。采样是指在固定、离散的时间间隔内测量模拟信号振幅的过程。这就像为声波制作一本翻页书。翻页书中的每一页都是一个“样本”,是波在特定时间点的振幅快照。采集这些快照的速率称为采样率或采样频率,单位是赫兹(Hz)。16,000 Hz(或16 kHz)的采样率意味着我们每秒测量波的振幅16,000次。采样率的选择很重要。为了准确重构信号,奈奎斯特-香农采样定理指出,采样率必须至少是信号中最高频率的两倍。由于人说话的声音频率范围通常低于8 kHz,因此16 kHz的采样率在语音识别中很常见,因为它提供了足够的信息来捕捉口语的重要特征。对于存在更高频率的音乐,44.1 kHz(CD使用的)等速率是标准。第二步:量化采样后,我们得到一系列离散时间间隔的测量值,但每个样本的振幅值仍然是实数,可以具有无限精度。量化是将这些连续的振幅值映射到有限的离散级别集合的过程。这本质上是一种四舍五入的行为。我们定义了固定数量的可能振幅值,每个样本的真实振幅都会被四舍五入到最近的可用级别。级别的数量由位深度决定。更高的位深度提供更多级别,从而更准确地近似原始振幅。一个8位音频信号使用 $2^8 = 256$ 个离散级别。一个16位音频信号使用 $2^{16} = 65,536$ 个离散级别。对于大多数ASR应用,16位深度是标准配置。它在音频保真度和文件大小之间提供了良好的平衡。使用较低的位深度可能会节省空间,但会引入可听见的失真,称为量化误差或量化噪声,这是实际采样振幅与其四舍五入后的量化值之间的差异。下图显示了连续模拟波通过采样和量化转换为数字信号的过程。{ "layout":{ "xaxis":{"title":"时间","range":[-0.05, 1.05]}, "yaxis":{"title":"振幅","range":[-1.1, 1.1]}, "showlegend":true, "margin":{"l":50,"r":20,"t":20,"b":40} }, "data":[ {"x":[0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1],"y":[-0.75,-0.75,-0.75,-0.75,-0.75,-0.75,-0.75,-0.75,-0.75,-0.75,-0.75],"mode":"lines","type":"scatter","name":"量化级别","line":{"color":"#ced4da","dash":"dot"},"hoverinfo":"none"}, {"x":[0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1],"y":[-0.25,-0.25,-0.25,-0.25,-0.25,-0.25,-0.25,-0.25,-0.25,-0.25,-0.25],"mode":"lines","type":"scatter","name":"量化级别","line":{"color":"#ced4da","dash":"dot"},"showlegend":false,"hoverinfo":"none"}, {"x":[0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1],"y":[0.25,0.25,0.25,0.25,0.25,0.25,0.25,0.25,0.25,0.25,0.25],"mode":"lines","type":"scatter","name":"量化级别","line":{"color":"#ced4da","dash":"dot"},"showlegend":false,"hoverinfo":"none"}, {"x":[0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1],"y":[0.75,0.75,0.75,0.75,0.75,0.75,0.75,0.75,0.75,0.75,0.75],"mode":"lines","type":"scatter","name":"量化级别","line":{"color":"#ced4da","dash":"dot"},"showlegend":false,"hoverinfo":"none"}, {"x":[0,0.01,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09,0.1,0.11,0.12,0.13,0.14,0.15,0.16,0.17,0.18,0.19,0.2,0.21,0.22,0.23,0.24,0.25,0.26,0.27,0.28,0.29,0.3,0.31,0.32,0.33,0.34,0.35,0.36,0.37,0.38,0.39,0.4,0.41,0.42,0.43,0.44,0.45,0.46,0.47,0.48,0.49,0.5,0.51,0.52,0.53,0.54,0.55,0.56,0.57,0.58,0.59,0.6,0.61,0.62,0.63,0.64,0.65,0.66,0.67,0.68,0.69,0.7,0.71,0.72,0.73,0.74,0.75,0.76,0.77,0.78,0.79,0.8,0.81,0.82,0.83,0.84,0.85,0.86,0.87,0.88,0.89,0.9,0.91,0.92,0.93,0.94,0.95,0.96,0.97,0.98,0.99,1],"y":[0,-0.309,-0.588,-0.809,-0.951,-1,-0.951,-0.809,-0.588,-0.309,0,0.309,0.588,0.809,0.951,1,0.951,0.809,0.588,0.309,0,-0.309,-0.588,-0.809,-0.951,-1,-0.951,-0.809,-0.588,-0.309,0,0.309,0.588,0.809,0.951,1,0.951,0.809,0.588,0.309,0,-0.309,-0.588,-0.809,-0.951,-1,-0.951,-0.809,-0.588,-0.309,0,0.309,0.588,0.809,0.951,1,0.951,0.809,0.588,0.309,0,-0.309,-0.588,-0.809,-0.951,-1,-0.951,-0.809,-0.588,-0.309,0,0.309,0.588,0.809,0.951,1,0.951,0.809,0.588,0.309,0,-0.309,-0.588,-0.809,-0.951,-1,-0.951,-0.809,-0.588,-0.309,0,0.309,0.588,0.809,0.951,1,0.951,0.809,0.588,0.309,0],"mode":"lines","name":"模拟信号","line":{"color":"#4263eb","width":3}}, {"x":[0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1],"y":[0,0,0,0.25,0.25,0,-0.25,-0.75,-0.75,0,0],"type":"scatter","mode":"markers","name":"量化样本","marker":{"color":"#be4bdb","size":10,"symbol":"square"}} ] }模拟波(蓝色)在固定的时间间隔(采样)被测量。然后每个测量值的振幅被舍入到最近的离散级别(灰色线),得到最终的数字点(紫色方块)。结果:数字音频数据采样和量化共同将连续模拟波转换为一系列离散数字。这个序列是音频的数字表示形式,一种计算机可以轻松存储、操作和分析的格式。digraph G { rankdir=TB; node [shape=box, style="rounded,filled", fontname="sans-serif"]; edge [fontname="sans-serif"]; // 定义节点,使用中文 analog [label="模拟声波", fillcolor="#a5d8ff"]; sampled [label="采样信号\n(离散时间,连续振幅)", fillcolor="#91a7ff"]; digital [label="数字信号\n(离散时间,离散振幅)", fillcolor="#748ffc"]; // 定义它们之间的连接 analog -> sampled [label=" 采样 "]; sampled -> digital [label=" 量化 "]; }将模拟声波转换为计算机可读数字信号的过程。这串数字是我们ASR处理流程后续阶段的输入。在接下来的部分中,你将了解到如何将这些原始数字音频转换为对机器学习模型更有用的特征。