인공지능 - Keras, Librosa, WebRTC Voice Activity Detector, 음성인식

카테고리 없음

인공지능 - Keras, Librosa, WebRTC Voice Activity Detector, 음성인식

바람사탕 2024. 5. 23. 23:58

Keras는 Python으로 작성된 고수준의 신경망 API로, 텐서플로(TensorFlow), Theano, CNTK 같은 딥러닝 프레임워크 위에서 작동합니다. Keras는 사용자 친화적이고 모듈화된 인터페이스를 제공하여 빠르고 쉽게 딥러닝 모델을 구축하고 실험할 수 있도록 설계되었습니다. 주요 특징과 개념을 설명하겠습니다.

주요 특징

1. 사용자 친화적:
   - 직관적이고 일관된 API를 제공하여 빠르게 모델을 만들고 실험할 수 있습니다.
   - 간단한 구조로 코드가 깔끔하고 읽기 쉬워 집니다.

2. 모듈화:
   - 모델, 레이어, 활성화 함수, 최적화기 등 구성 요소가 독립적으로 존재하며, 서로 결합하여 모델을 만들 수 있습니다.
   - 필요에 따라 각 구성 요소를 쉽게 교체하고 조정할 수 있습니다.

3. 확장성:
   - 새로운 모듈을 쉽게 추가할 수 있어, 연구 및 실험에 적합합니다.
   - Keras의 기본 클래스들을 상속하여 사용자 정의 레이어, 손실 함수, 지표 등을 구현할 수 있습니다.

4. 백엔드 유연성:
   - TensorFlow, Theano, Microsoft Cognitive Toolkit (CNTK) 등의 백엔드 엔진 위에서 실행할 수 있습니다.
   - 백엔드를 바꿔도 Keras 코드의 변경이 최소화됩니다.

주요 개념

1. 모델(Model):
   - Keras에서 모델은 신경망을 정의하는 객체입니다. 주요 두 가지 모델 유형은 시퀀셜 모델(Sequential Model)과 함수형 API(Functional API)입니다.

2. 레이어(Layer):
   - 신경망의 기본 구성 요소로, Keras에서 다양한 유형의 레이어(예: Dense, Convolutional, LSTM 등)를 제공합니다.

3. 컴파일(Compile):
   - 모델을 훈련하기 전에 최적화기, 손실 함수, 평가 지표 등을 설정합니다.

4. 훈련(Training):
   - 모델을 데이터에 맞추어 학습시키는 과정으로, `fit` 메서드를 사용하여 수행합니다.

5. 평가(Evaluation):
   - 훈련된 모델의 성능을 테스트 데이터로 평가하는 과정입니다.

6. 예측(Prediction):
   - 새로운 데이터에 대해 모델의 출력을 예측하는 과정입니다.

간단한 예제

아래는 간단한 시퀀셜 모델을 사용하여 MNIST 데이터셋을 학습시키는 예제입니다.

from keras.models import Sequential
from keras.layers import Dense, Flatten
from keras.datasets import mnist
from keras.utils import to_categorical

# 데이터셋 로드 및 전처리
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
train_images = train_images.reshape((60000, 28 * 28)).astype('float32') / 255
test_images = test_images.reshape((10000, 28 * 28)).astype('float32') / 255
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

# 모델 정의
model = Sequential([
    Flatten(input_shape=(28*28,)),
    Dense(512, activation='relu'),
    Dense(10, activation='softmax')
])

# 모델 컴파일
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# 모델 훈련
model.fit(train_images, train_labels, epochs=5, batch_size=128)

# 모델 평가
test_loss, test_acc = model.evaluate(test_images, test_labels)
print(f'Test accuracy: {test_acc}')

이 예제에서는 Keras의 시퀀셜 API를 사용하여 간단한 신경망을 정의하고, MNIST 데이터셋으로 학습한 후, 테스트 데이터로 성능을 평가합니다.

Keras는 딥러닝 모델을 신속하게 구축하고 실험할 수 있도록 도와주는 강력한 도구입니다. 높은 수준의 추상화와 사용의 용이성 덕분에 많은 연구자와 개발자들이 Keras를 선호합니다. 다양한 백엔드를 지원하여 유연하게 사용할 수 있으며, 커뮤니티도 활발하여 다양한 자료와 튜토리얼을 쉽게 찾을 수 있습니다.

Librosa

Librosa는 음악 및 오디오 분석을 위한 파이썬 라이브러리입니다. 이 라이브러리는 오디오 신호 처리와 음악 정보 검색(Music Information Retrieval, MIR)에 유용한 기능들을 제공합니다. Librosa는 오디오 데이터를 로드하고, 시간-주파수 변환, 피치 추출, 비트 추출, 템포 추정 등 다양한 오디오 분석 기능을 수행할 수 있게 합니다.

주요 기능

1. 오디오 로딩 및 저장:
   - 다양한 오디오 파일 형식(WAV, MP3 등)을 로드하고 저장할 수 있습니다.

2. 시간-주파수 변환:
   - 단일 측정(Short-Time Fourier Transform, STFT), 멜 스펙트로그램(Mel Spectrogram), 크로마 피처(Chroma Feature) 등의 변환을 수행할 수 있습니다.

3. 음악 정보 추출:
   - 템포(Tempo) 추정, 비트 추출, 하모닉-퍼커시브 소스 분리(Harmonic-Percussive Source Separation), 피치 추출 등의 음악 정보 분석 기능을 제공합니다.

4. 음성 특징 추출:
   - MFCC(Mel-Frequency Cepstral Coefficients), 톤쩌 특징(Tonnetz Features), 델타 피처(Delta Features) 등의 음성 특징을 추출할 수 있습니다.

5. 효과 처리:
   - 타임 스트레칭(Time Stretching), 피치 시프팅(Pitch Shifting), 볼륨 조절 등 오디오 신호에 다양한 효과를 적용할 수 있습니다.

설치

Librosa는 pip를 통해 쉽게 설치할 수 있습니다.

pip install librosa

간단한 예제

아래는 librosa를 사용하여 오디오 파일을 로드하고, 스펙트로그램을 계산하고, 그 결과를 시각화하는 예제입니다.

import librosa
import librosa.display
import matplotlib.pyplot as plt

# 오디오 파일 로드
filename = 'your_audio_file.wav'
y, sr = librosa.load(filename)

# 오디오 신호의 STFT 계산
D = librosa.stft(y)

# 멜 스펙트로그램 계산
S_db = librosa.amplitude_to_db(abs(D), ref=np.max)

# 멜 스펙트로그램 시각화
plt.figure(figsize=(10, 4))
librosa.display.specshow(S_db, sr=sr, x_axis='time', y_axis='log')
plt.colorbar(format='%+2.0f dB')
plt.title('Mel spectrogram')
plt.tight_layout()
plt.show()

추가 기능들

Librosa는 이 외에도 많은 기능을 제공하며, 자세한 내용은 [공식 문서](https://librosa.org/doc/latest/index.html)를 참고할 수 있습니다. 몇 가지 추가적인 기능 예시는 다음과 같습니다.

템포 추정

tempo, beat_frames = librosa.beat.beat_track(y=y, sr=sr)
print('Estimated tempo: {:.2f} BPM'.format(tempo))

피치 추출

pitches, magnitudes = librosa.core.piptrack(y=y, sr=sr)

하모닉-퍼커시브 소스 분리

y_harmonic, y_percussive = librosa.effects.hpss(y)

Librosa는 강력하고 직관적인 API를 통해 음악 및 오디오 신호 처리를 위한 다양한 도구를 제공하여, 연구자 및 개발자들이 오디오 데이터를 효율적으로 분석하고 처리할 수 있게 합니다.

Librosa를 사용하여, 사운드파일.wav에 사람 목소리가 포함되어 있는지 검사하는 코드

사운드 파일에 사람 목소리가 포함되어 있는지 검사하는 작업은 음성 활동 감지(Voice Activity Detection, VAD) 알고리즘을 활용할 수 있습니다. 음성 활동 감지 기능은 사운드 신호에서 사람의 음성을 감지하는 데 사용됩니다. Librosa 자체는 VAD 기능을 제공하지 않지만, 이를 위한 다양한 방법이 있습니다. 여기서는 Librosa와 함께 PyDub 라이브러리를 사용하여 기본적인 음성 활동 감지를 수행하는 예제를 작성해보겠습니다.

먼저 필요한 라이브러리를 설치합니다:

pip install librosa pydub numpy scipy

이제 음성 활동 감지를 수행하는 파이썬 코드를 작성해보겠습니다:

import librosa
import numpy as np
from pydub import AudioSegment
from pydub.silence import detect_nonsilent

def detect_voice(audio_path, frame_length=2048, hop_length=512, threshold=20):
    # Load audio file
    y, sr = librosa.load(audio_path, sr=None)

    # Compute the short-time Fourier transform (STFT)
    stft = librosa.stft(y, n_fft=frame_length, hop_length=hop_length)

    # Convert the STFT to magnitude
    magnitude = np.abs(stft)

    # Compute the root mean square (RMS) energy
    rms = librosa.feature.rms(S=magnitude)

    # Find frames where RMS energy is above the threshold
    voice_frames = np.where(rms[0] > (threshold * np.mean(rms[0])))[0]

    return len(voice_frames) > 0

def detect_voice_pydub(audio_path, min_silence_len=1000, silence_thresh=-30):
    # Load audio file using PyDub
    audio = AudioSegment.from_wav(audio_path)

    # Detect non-silent chunks
    nonsilent_chunks = detect_nonsilent(audio, min_silence_len=min_silence_len, silence_thresh=silence_thresh)

    return len(nonsilent_chunks) > 0

audio_path = '사운드파일.wav'

# Method 1: Using Librosa
voice_detected_librosa = detect_voice(audio_path)
print("Voice detected (Librosa):", voice_detected_librosa)

# Method 2: Using PyDub
voice_detected_pydub = detect_voice_pydub(audio_path)
print("Voice detected (PyDub):", voice_detected_pydub)

설명:

1. Librosa를 사용한 방법:
   - `librosa.load`로 오디오 파일을 로드합니다.
   - STFT를 계산하고 이를 통해 RMS 에너지를 계산합니다.
   - RMS 에너지가 일정 임계값을 초과하는 프레임이 있는지 확인합니다.

2. PyDub를 사용한 방법:
   - PyDub의 `AudioSegment`를 사용하여 오디오 파일을 로드합니다.
   - `detect_nonsilent` 함수를 사용하여 비침묵 구간을 감지합니다.
   - 비침묵 구간이 있다면, 사람의 목소리가 포함되어 있다고 판단합니다.

이 두 가지 방법은 기본적인 음성 활동 감지 방법을 보여줍니다. 더 정밀한 음성 감지를 위해서는 VAD 알고리즘 또는 딥러닝 기반의 음성 인식 모델을 사용하는 것이 좋습니다.

음성 활동 감지(Voice Activity Detection, VAD)에 적합한 여러 툴, 소프트웨어, 라이브러리

음성 활동 감지(Voice Activity Detection, VAD)에 적합한 여러 툴, 소프트웨어, 라이브러리들이 있습니다. 각각의 도구는 다양한 기능과 특징을 제공하므로, 필요에 따라 선택할 수 있습니다.

1. WebRTC Voice Activity Detector
WebRTC 프로젝트에서 제공하는 VAD는 매우 인기 있고, 많이 사용됩니다.

- 특징: 높은 정확도, 다양한 환경에서 테스트됨.
- 언어: C++, Python(바인딩)
- 사용 방법: 주로 실시간 통신에서 사용되며, Python 바인딩을 통해서도 사용 가능.
- 링크: [WebRTC VAD](https://webrtc.org/)

2. PyDub & SpeechRecognition
PyDub와 SpeechRecognition 라이브러리를 사용하여 음성 활동 감지를 수행할 수 있습니다.

- 특징: 간단한 사용법, 기본적인 음성 활동 감지.
- 언어: Python
- 사용 방법:

pip install pydub speechrecognition

  from pydub import AudioSegment
  from pydub.silence import detect_nonsilent

  def detect_voice_pydub(audio_path, min_silence_len=1000, silence_thresh=-30):
      audio = AudioSegment.from_file(audio_path)
      nonsilent_chunks = detect_nonsilent(audio, min_silence_len=min_silence_len, silence_thresh=silence_thresh)
      return len(nonsilent_chunks) > 0

3. librosa
librosa는 오디오 및 음악 분석을 위한 Python 패키지로, 기본적인 음성 활동 감지 기능을 구현할 수 있습니다.

- 특징: 신호 처리 기능, 음악 및 오디오 분석에 적합.
- 언어: Python
- 사용 방법:

pip install librosa

import librosa
  import numpy as np

  def detect_voice_librosa(audio_path, frame_length=2048, hop_length=512, threshold=20):
      y, sr = librosa.load(audio_path, sr=None)
      stft = librosa.stft(y, n_fft=frame_length, hop_length=hop_length)
      magnitude = np.abs(stft)
      rms = librosa.feature.rms(S=magnitude)
      voice_frames = np.where(rms[0] > (threshold * np.mean(rms[0])))[0]
      return len(voice_frames) > 0

4. OpenSMILE
OpenSMILE(오픈소스 음성 및 음악 분석 도구)은 음성 활동 감지를 포함한 다양한 오디오 분석 기능을 제공합니다.

- 특징: 강력한 기능, 다양한 오디오 분석.
- 언어: C++, Python(바인딩)
- 사용 방법: 주로 명령줄 도구로 사용되며, Python 바인딩도 존재.
- 링크: [OpenSMILE](https://www.audeering.com/opensmile/)

5. TensorFlow & PyTorch
딥러닝 기반 VAD 모델을 구축하려면 TensorFlow 또는 PyTorch를 사용할 수 있습니다.

- 특징: 높은 유연성, 최신 딥러닝 모델 사용 가능.
- 언어: Python
- 사용 방법: 사전 훈련된 모델 사용 또는 사용자 정의 모델 훈련.

6. Google Cloud Speech-to-Text
Google Cloud의 Speech-to-Text API는 음성 활동 감지 기능을 제공합니다.

- 특징: 클라우드 기반, 높은 정확도.
- 언어: Python, Java, Node.js 등
- 사용 방법: Google Cloud 콘솔에서 API 설정 후, 클라이언트 라이브러리를 사용하여 음성 인식 및 VAD 수행.
- 링크: [Google Cloud Speech-to-Text](https://cloud.google.com/speech-to-text)

이러한 도구와 라이브러리들을 사용하면 다양한 환경에서 효과적인 음성 활동 감지를 수행할 수 있습니다.

WebRTC Voice Activity Detector

WebRTC Voice Activity Detector(VAD)는 WebRTC 프로젝트의 일환으로 개발된 음성 활동 감지 라이브러리입니다. 이 라이브러리는 주로 실시간 통신 애플리케이션에서 사용되며, 음성과 비음성 구간을 구별하여 효과적인 음성 인식 및 처리에 도움을 줍니다.

주요 특징

1. 높은 정확도: 다양한 환경에서 테스트되어 높은 정확도를 제공합니다. 소음이 있는 환경에서도 효과적으로 작동하도록 설계되었습니다.
2. 경량성: VAD는 경량화되어 있어, CPU와 메모리 리소스를 적게 소모합니다. 이는 실시간 통신 애플리케이션에서 중요한 장점입니다.
3. 다중 모드: VAD는 여러 감도(sensitivity) 모드를 제공하여, 애플리케이션의 필요에 맞게 조정할 수 있습니다. 감도가 높을수록 더 많은 음성 구간을 감지하지만, 소음도 음성으로 오인할 가능성이 높아집니다.
4. 크로스 플랫폼: WebRTC VAD는 다양한 플랫폼에서 작동하도록 설계되었습니다. 이는 브라우저 기반 애플리케이션 뿐만 아니라 모바일 앱에서도 사용할 수 있습니다.

사용 방법

WebRTC VAD는 C++로 작성되었으며, Python 바인딩을 통해 Python에서도 사용할 수 있습니다. 다음은 Python에서 WebRTC VAD를 사용하는 예제입니다.

Python 바인딩 설치
WebRTC VAD의 Python 바인딩은 `webrtcvad`라는 패키지로 제공됩니다.

pip install webrtcvad

사용 예제

import webrtcvad
import wave

# 음성 파일을 읽어들이는 함수
def read_wave(path):
    with wave.open(path, 'rb') as wf:
        num_channels = wf.getnchannels()
        assert num_channels == 1
        sample_width = wf.getsampwidth()
        assert sample_width == 2
        sample_rate = wf.getframerate()
        assert sample_rate in (8000, 16000, 32000, 48000)
        frames = wf.readframes(wf.getnframes())
        return frames, sample_rate

# 음성 활동 감지 함수
def vad_collector(sample_rate, frame_duration_ms, padding_duration_ms, vad, frames):
    num_padding_frames = int(padding_duration_ms / frame_duration_ms)
    ring_buffer = collections.deque(maxlen=num_padding_frames)
    triggered = False

    voiced_frames = []
    for frame in frames:
        is_speech = vad.is_speech(frame, sample_rate)

        if not triggered:
            ring_buffer.append((frame, is_speech))
            num_voiced = len([f for f, speech in ring_buffer if speech])
            if num_voiced > 0.9 * ring_buffer.maxlen:
                triggered = True
                voiced_frames.extend([f for f, s in ring_buffer])
                ring_buffer.clear()
        else:
            voiced_frames.append(frame)
            ring_buffer.append((frame, is_speech))
            num_unvoiced = len([f for f, speech in ring_buffer if not speech])
            if num_unvoiced > 0.9 * ring_buffer.maxlen:
                triggered = False
                yield b''.join(voiced_frames)
                ring_buffer.clear()
                voiced_frames = []
    if voiced_frames:
        yield b''.join(voiced_frames)

# VAD 객체 생성 및 음성 활동 감지 실행
vad = webrtcvad.Vad()
vad.set_mode(1)  # 0 ~ 3까지의 모드 설정 가능 (높을수록 감도 증가)

frames, sample_rate = read_wave('your_audio_file.wav')
frame_duration_ms = 30  # 10, 20, 30 중 하나로 설정
frames = [frames[i:i + int(sample_rate * frame_duration_ms / 1000)] for i in range(0, len(frames), int(sample_rate * frame_duration_ms / 1000))]

segments = vad_collector(sample_rate, frame_duration_ms, 300, vad, frames)

for i, segment in enumerate(segments):
    path = 'chunk-%002d.wav' % (i,)
    print(' Writing %s' % (path,))
    with wave.open(path, 'wb') as wf:
        wf.setnchannels(1)
        wf.setsampwidth(2)
        wf.setframerate(sample_rate)
        wf.writeframes(segment)

주요 설정

- 모드 설정: `vad.set_mode(mode)`를 통해 감도 모드를 설정할 수 있습니다.

모드는 0에서 3까지 있으며, 숫자가 높을수록 더 민감하게 음성을 감지합니다.
- 프레임 길이: WebRTC VAD는 10ms, 20ms, 30ms 프레임 길이를 지원합니다.

프레임 길이가 길수록 감지의 신뢰도가 높아질 수 있지만, 실시간 처리에서는 짧은 프레임 길이를 선택하는 것이 유리합니다.

WebRTC VAD는 실시간 음성 활동 감지에 매우 유용한 도구입니다. 높은 정확도와 경량성 덕분에 다양한 애플리케이션에서 사용할 수 있으며, Python 바인딩을 통해 손쉽게 통합할 수 있습니다.

WebRTC Voice Activity Detector c예제코드

WebRTC Voice Activity Detector (VAD)를 C에서 사용하는 예제 코드를 제공하겠습니다. WebRTC VAD는 오디오 프레임에서 음성 활동을 감지하는 데 사용됩니다. 이 예제에서는 오디오 파일을 읽고, VAD를 사용하여 음성 활동을 감지하고, 음성 활동이 있는 구간을 출력합니다.

준비 사항

1. WebRTC VAD 라이브러리를 다운로드하고 빌드해야 합니다.

WebRTC 소스 코드에는 VAD 라이브러리가 포함되어 있습니다.

WebRTC 소스 코드의 `common_audio/vad` 디렉토리에서 VAD 관련 코드를 찾을 수 있습니다.

https://chromium.googlesource.com/external/webrtc

external/webrtc - Git at Google

chromium.googlesource.com

2. 이 예제 코드는 WebRTC VAD가 빌드된 후 사용할 수 있는 라이브러리 파일을 필요로 합니다.

C 예제 코드

다음은 WebRTC VAD를 사용하여 오디오 파일에서 음성 활동을 감지하는 C 예제 코드입니다.

#include
#include
#include
#include
#include "webrtc_vad.h"

#define FRAME_SIZE_10MS 160 // 10ms frame size for 16kHz audio

// Function to read a WAV file
int16_t* read_wav_file(const char* filename, int* sample_rate, int* num_samples) {
    FILE* file = fopen(filename, "rb");
    if (!file) {
        perror("Failed to open file");
        return NULL;
    }

    // Read WAV header
    char riff_header[44];
    fread(riff_header, sizeof(char), 44, file);

    // Get sample rate and number of samples
    *sample_rate = *(int*)(riff_header + 24);
    *num_samples = (*(int*)(riff_header + 40)) / 2;

    // Read samples
    int16_t* samples = (int16_t*)malloc(*num_samples * sizeof(int16_t));
    fread(samples, sizeof(int16_t), *num_samples, file);

    fclose(file);
    return samples;
}

int main() {
    const char* input_file = "input.wav";
    int sample_rate, num_samples;
    int16_t* samples = read_wav_file(input_file, &sample_rate, &num_samples);
    if (!samples) {
        return 1;
    }

    VadInst* vad = WebRtcVad_Create();
    if (WebRtcVad_Init(vad) != 0) {
        fprintf(stderr, "Failed to initialize VAD\n");
        return 1;
    }

    // Set VAD operating mode (0, 1, 2, 3)
    if (WebRtcVad_set_mode(vad, 3) != 0) {
        fprintf(stderr, "Failed to set VAD mode\n");
        return 1;
    }

    int num_frames = num_samples / FRAME_SIZE_10MS;
    for (int i = 0; i < num_frames; i++) {
        int16_t* frame = samples + i * FRAME_SIZE_10MS;
        int vad_result = WebRtcVad_Process(vad, sample_rate, frame, FRAME_SIZE_10MS);
        if (vad_result < 0) {
            fprintf(stderr, "VAD processing error\n");
            return 1;
        }
        printf("Frame %d: %s\n", i, vad_result ? "Speech" : "Non-Speech");
    }

    WebRtcVad_Free(vad);
    free(samples);
    return 0;
}

코드 설명

1. read_wav_file 함수: WAV 파일을 읽고 샘플 데이터를 반환합니다. 이 함수는 샘플 레이트와 샘플 수를 반환합니다.
2. main 함수:
   - 입력 파일을 읽습니다.
   - VAD 인스턴스를 생성하고 초기화합니다.
   - VAD 모드를 설정합니다. (모드는 0에서 3까지 있으며, 숫자가 클수록 감도가 높아집니다)
   - 오디오 데이터를 10ms 프레임으로 나누고, 각 프레임에 대해 VAD를 실행합니다.
   - 각 프레임에 대해 VAD 결과를 출력합니다.

컴파일 방법

이 코드를 컴파일하려면 WebRTC VAD 라이브러리를 포함해야 합니다. 예를 들어, 다음과 같은 명령어를 사용할 수 있습니다.

gcc -o vad_example vad_example.c -lwebrtc_vad -lm

이 명령어는 `vad_example.c` 파일을 컴파일하여 `vad_example` 실행 파일을 생성합니다. `-lwebrtc_vad`는 WebRTC VAD 라이브러리를 링크하는 옵션입니다.

위 예제는 기본적인 VAD 사용 방법을 보여줍니다. 실제 응용 프로그램에서는 예외 처리, 메모리 관리, 다양한 샘플 레이트 지원 등을 추가로 구현해야 할 수 있습니다.