Padding size should be less than the corresponding input dimension

Traceback (most recent call last):
File "/Users/pan/voice2text/main.py", line 30, in
mel = whisper.log_mel_spectrogram(audio).to(model.device)
File "/Users/pan/voice2text/.venv/lib/python3.10/site-packages/whisper/audio.py", line 138, in log_mel_spectrogram
stft = torch.stft(audio, N_FFT, HOP_LENGTH, window=window, return_complex=True)
File "/Users/pan/voice2text/.venv/lib/python3.10/site-packages/torch/functional.py", line 639, in stft
input = F.pad(input.view(extended_shape), [pad, pad], pad_mode)
RuntimeError: Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (200, 200) at dimension 2 of input [1, 441000, 1]

This error message is indicating that the padding size used in the STFT function is greater than the corresponding input dimension. The input dimension is [1, 441000, 1], which means that there is only one channel, 441000 time steps, and one frequency bin. The STFT function is trying to apply padding with a size of (200, 200) to the time dimension, which is causing the error.

To fix this error, you can reduce the padding size or increase the input size. Alternatively, you can try changing the STFT parameters like the window size or hop length.