admin 管理员组文章数量: 888526
librosa 语音库(三) librosa.feature. 中的 spectrogram 与 melspectrogram
窗口的长度与 n_fft 需要匹配大小长度;
1. Mel 语谱图的函数定义
librosa.feature.melspectrogram()
函数在spectral.py
中,实现过程为:
def melspectrogram(y=None, sr=22050, S=None, n_fft=2048, hop_length=512,power=2.0, **kwargs):S, n_fft = _spectrogram(y=y, S=S, n_fft=n_fft, hop_length=hop_length, power=power)# Build a Mel filtermel_basis = filters.mel(sr, n_fft, **kwargs)return np.dot(mel_basis, S)
可以看出 Mel_ 语谱图的计算主要有两个函数构成
- 计算出信号的语谱图(功率谱形式构成的), 由
_spectrogram()
函数实现 - 构造Mel 滤波器, 由
filters.mel
函数实现; - 将Mel 滤波器组与语谱图做矩阵乘法, 得到 mel 语谱图;
1.1 _spectrogram()
函数实现
def _spectrogram(y=None, S=None, n_fft=2048, hop_length=512, power=1):if S is not None:# Infer n_fft from spectrogram shapen_fft = 2 * (S.shape[0] - 1)else:# Otherwise, compute a magnitude spectrogram from inputS = np.abs(stft(y, n_fft=n_fft, hop_length=hop_length))**powerreturn S, n_fft
其中的核心代码部分为stft()函数,短时傅里叶变化stft函数的实现过程为:
def stft(y, n_fft=2048, hop_length=None, win_length=None, window='hann',center=True, dtype=np.complex64, pad_mode='reflect'):# By default, use the entire frameif win_length is None:win_length = n_fft# Set the default hop, if it's not already specifiedif hop_length is None:hop_length = int(win_length // 4)fft_window = get_window(window, win_length, fftbins=True)# Pad the window out to n_fft sizefft_window = util.pad_center(fft_window, n_fft)# Reshape so that the window can be broadcastfft_window = fft_window.reshape((-1, 1))# Check audio is validutil.valid_audio(y)# Pad the time series so that frames are centeredif center:y = np.pad(y, int(n_fft // 2), mode=pad_mode)# Window the time series.y_frames = util.frame(y, frame_length=n_fft, hop_length=hop_length)# Pre-allocate the STFT matrixstft_matrix = np.empty((int(1 + n_fft // 2), y_frames.shape[1]),dtype=dtype,order='F')# how many columns can we fit within MAX_MEM_BLOCK?n_columns = int(util.MAX_MEM_BLOCK / (stft_matrix.shape[0] *stft_matrix.itemsize))for bl_s in range(0, stft_matrix.shape[1], n_columns):bl_t = min(bl_s + n_columns, stft_matrix.shape[1])stft_matrix[:, bl_s:bl_t] = fft.fft(fft_window *y_frames[:, bl_s:bl_t],axis=0)[:stft_matrix.shape[0]]return stft_matrix
1.2 filters.mel
函数实现
输出是 [ n_mels, 1 + (n_fft / 2) ]
这其中需要的注意的是,不同的librosa版本的mel()函数的归一化参数传递的也是不一样的,比如python3.6中的norm可以传递1 或是 None,但是python3.7中就需要传递norm=‘slaney’.
产生一个线性变换矩阵, 将FFT 频率下的bin 的 转化到 Mel 尺度下的bin
@cache(level=10)
def mel(sr,n_fft,n_mels=128,fmin=0.0,fmax=None,htk=False,norm="slaney",dtype=np.float32,
):"""Create a Mel filter-bank.This produces a linear transformation matrix to projectFFT bins onto Mel-frequency bins.Parameters----------sr : number > 0 [scalar]sampling rate of the incoming signaln_fft : int > 0 [scalar]number of FFT componentsn_mels : int > 0 [scalar]number of Mel bands to generatefmin : float >= 0 [scalar]lowest frequency (in Hz)fmax : float >= 0 [scalar]highest frequency (in Hz).If `None`, use ``fmax = sr / 2.0``htk : bool [scalar]use HTK formula instead of Slaneynorm : {None, 'slaney', or number} [scalar]If 'slaney', divide the triangular mel weights by the width of the mel band(area normalization).If numeric, use `librosa.util.normalize` to normalize each filter by to unit l_p norm.See `librosa.util.normalize` for a full description of supported norm values(including `+-np.inf`).Otherwise, leave all the triangles aiming for a peak value of 1.0dtype : np.dtypeThe data type of the output basis.By default, uses 32-bit (single-precision) floating point.Returns-------M : np.ndarray [shape=(n_mels, 1 + n_fft/2)]Mel transform matrixSee also--------librosa.util.normalizeNotes-----This function caches at level 10.Examples-------->>> melfb = librosa.filters.mel(22050, 2048)>>> melfbarray([[ 0. , 0.016, ..., 0. , 0. ],[ 0. , 0. , ..., 0. , 0. ],...,[ 0. , 0. , ..., 0. , 0. ],[ 0. , 0. , ..., 0. , 0. ]])Clip the maximum frequency to 8KHz>>> librosa.filters.mel(22050, 2048, fmax=8000)array([[ 0. , 0.02, ..., 0. , 0. ],[ 0. , 0. , ..., 0. , 0. ],...,[ 0. , 0. , ..., 0. , 0. ],[ 0. , 0. , ..., 0. , 0. ]])>>> import matplotlib.pyplot as plt>>> fig, ax = plt.subplots()>>> img = librosa.display.specshow(melfb, x_axis='linear', ax=ax)>>> ax.set(ylabel='Mel filter', title='Mel filter bank')>>> fig.colorbar(img, ax=ax)"""if fmax is None:fmax = float(sr) / 2# Initialize the weightsn_mels = int(n_mels)weights = np.zeros((n_mels, int(1 + n_fft // 2)), dtype=dtype)# Center freqs of each FFT binfftfreqs = fft_frequencies(sr=sr, n_fft=n_fft)# 'Center freqs' of mel bands - uniformly spaced between limitsmel_f = mel_frequencies(n_mels + 2, fmin=fmin, fmax=fmax, htk=htk)fdiff = np.diff(mel_f)ramps = np.subtract.outer(mel_f, fftfreqs)for i in range(n_mels):# lower and upper slopes for all binslower = -ramps[i] / fdiff[i]upper = ramps[i + 2] / fdiff[i + 1]# .. then intersect them with each other and zeroweights[i] = np.maximum(0, np.minimum(lower, upper))if norm == "slaney":# Slaney-style mel is scaled to be approx constant energy per channelenorm = 2.0 / (mel_f[2 : n_mels + 2] - mel_f[:n_mels])weights *= enorm[:, np.newaxis]else:weights = util.normalize(weights, norm=norm, axis=-1)# Only check weights if f_mel[0] is positiveif not np.all((mel_f[:-2] == 0) | (weights.max(axis=1) > 0)):# This means we have an empty channel somewherewarnings.warn("Empty filters detected in mel frequency basis. ""Some channels will produce empty responses. ""Try increasing your sampling rate (and fmax) or ""reducing n_mels.")return weights
2. 实例分析
假设 调用形式如下 :
librosa.feature.melspectrogram(y=current_window ,sr=sample_rate, n_mels=n_mels, fmin=f_min, fmax=f_max, n_fft=nfft, hop_length=hop)
def melspectrogram(y=None,sr=22050,S=None,n_fft=2048,hop_length=512,win_length=None,window="hann",center=True,pad_mode="reflect",power=2.0,**kwargs,
):S, n_fft = _spectrogram(y=y,S=S,n_fft=n_fft,hop_length=hop_length,power=power,win_length=win_length,window=window,center=center,pad_mode=pad_mode,)# Build a Mel filtermel_basis = filters.mel(sr, n_fft, **kwargs)return np.dot(mel_basis, S)
输入参数:
y = 32000 个点的音频数据,
sr = 4000;
n_mels = 256;
fmin = 50 , fmax = 2000
n_fft: fft 点数800;
hop_length = win_length / 4 = n_fft / 4
返回:
mel 语谱图的输出大小为 [n_mels, n_frames ],
Mel 语谱图的输出结果是, Mel 滤波器组 和 功率谱格式的语谱图 两者共同作用的结果。
[ n_mels, 1 + n f f t 2 \frac{n_{fft}}{2} 2nfft ] ×[ 1 + n f f t 2 \frac{n_{fft}}{2} 2nfft, n_frames ]
= [n_mels, n_frames ]
注意:
Mel 语谱图函数的输入, 其中的
S
这个参数量, 代表了输入中是否提供了语谱图的输入。
S
如果提供了语谱图输入, 此时S
直接使用输入的语谱图;S
如果没有提供语谱图的输入, 需要根据输入的一维数据y
, 先经过STFT 得到频谱, 然后在取模值的平方, 得到频谱的功率谱形式, 此时的功率谱便作为语谱图。
最终 mel 语谱图的输出大小为 [n_mels, frames ], 由 Mel 滤波器组和短时傅里叶变换STFT 两个 输出矩阵,相乘共同作用到的结果。
该函数中调用了两个函数实现:
2.1 STFT 的输出帧数
根据以上参数
输入参数:
y = 32000 个点的音频数据,
sr = 4000;
n_mels = 256;
fmin = 50 , fmax = 2000
n_fft: fft 点数800;
hop_length = win_length / 4 = n_fft / 4
调用 STFT ,得到的输出 [ 1 + n f f t 2 \frac{n_{fft}}{2} 2nfft, n_frames ]
-
行数: 1 + n f f t 2 \frac{n_{fft}}{2} 2nfft = 1+ (800/2) = 401
-
列数:等于帧数, 帧数
n_frames
求法:
n f r a m e s = a u d i o l e n g t h / / h o p l e n g t h + 1 , n_{frames} = audio_{length} // hop_{length} + 1, nframes=audiolength//hoplength+1,
带入上式中的, 可以求得:
n_frams = (32000 // 200) + 1 = 161,
故此时的STFT 输出 [ 401, 161]
2.2 filter.mels(sr, n_fft, **kwargs)
调用 filter.mels(sr, n_fft, n_mels=128, fmin=0.0, fmax=None, htk=False, norm=1 )
函数, 生成一个 Mel 滤波器组;
输入参数:
y = 32000 个点的音频数据,
sr = 4000;
n_mels = 256;
fmin = 50 , fmax = 2000
n_fft: fft 点数800;
hop_length = win_length / 4 = n_fft / 4
其中有几个重要的参数,需要关注以下
- n f f t n_{fft} nfft 的点数选取, n f f t n_{fft} nfft 的点数越多, CPU需要的运算量也增加,可以分别的频率也精细;
该参数与与频率分辨率有关:
sr
采样率为 16 KHZ;
假设将音频的频率分辨率设置为 10 HZ,
频率分辨率 bin = F s N f f t \frac{F_{s}}{N_{fft}} NfftFs = 采 样 率 N f f t 的 点 数 \frac{采样率}{N_{fft}的点数} Nfft的点数采样率
N f f t N_{fft} Nfft = F s 10 \frac{F_{s}}{10} 10Fs = 16000 10 \frac{16000}{10} 1016000 =1600 点;
2.2.1 Mel 频率
本文标签: librosa 语音库(三) librosafeature 中的 spectrogram 与 melspectrogram
版权声明:本文标题:librosa 语音库(三) librosa.feature. 中的 spectrogram 与 melspectrogram 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.freenas.com.cn/jishu/1686637053h20226.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论