Skip to main content

Audio Classifier API Architecture

This document explains the data flow and internal processing logic of the Voice Classifier API.

High-Level Architecture

The API follows a modular layered architecture to separate concerns between web handling, feature engineering, and model inference.

graph TD
User((User)) -->|POST /predict-anomaly| API[FastAPI Entry Point]

subgraph "1. Pre-processing"
API --> Load[Librosa Load - 16kHz Mono]
Load --> Trim[Trim Silence - 20dB top_db]
Trim --> PreEmp[Pre-emphasis Filter]
end

subgraph "2. Feature Extraction Pipeline"
PreEmp --> STFT[Short-Time Fourier Transform]
STFT --> MFCC[MFCC - 12 Coefs]
PreEmp --> Yin[Optimized Yin - Pitch/F0]
STFT --> Spectral[Spectral Centroid/Rolloff]
STFT --> Energy[RMS & ZCR]
end

subgraph "3. Feature Engineering"
MFCC --> Vector[Feature Vector Assembly]
Yin --> Vector
Spectral --> Vector
Energy --> Vector
Vector --> Selection[Feature Selection - Correlated Drops]
Selection --> Scaler[Standard Scaler]
end

subgraph "4. Inference Engine"
Scaler --> Ensemble[Voting Ensemble Model]
Ensemble --> Result[Label Encoder Mapping]
end

Result -->|JSON Response| User

style API fill:#f9f,stroke:#333,stroke-width:2px
style Ensemble fill:#69f,stroke:#333,stroke-width:2px

Performance Optimization (V7.2)

  • Yin Substitution: Replaced pyin with yin, reducing latent bottleneck from ~15s to < 1s for pitch estimation.
  • Duration Capping: Optimized extraction to focus on the first 10 seconds of audio.
  • Pitch Range Restriction: Narrowed search to 80Hz - 800Hz for human compatibility and speed.
  • In-Memory Buffer: Minimal I/O overhead using BytesIO.

Data Processing Pipeline

1. Request Handling (main.py)

  • Receives the audio file via a POST request.
  • Validates file format.
  • Passes the audio buffer to the storage-efficient processing layer.

2. Feature Extraction (features/feature_extractor.py)

  • Trimming: Removes leading and trailing silence to avoid bias.
  • Pre-emphasis: Amplifies high frequencies to balance the spectrum before analysis.
  • Pitch Estimation: Uses the Yin algorithm (optimized for speed) to determine fundamental frequency (F0).
  • Redundancy Reduction: Compares extracted features against dropped_features.joblib to remove highly correlated columns that the model doesn't use.

3. Inference Engine (utils/model_handler.py)

  • Scaling: Standardizes the features using the pre-fitted scaler.joblib.
  • Ensemble Prediction: Passes the scaled data through the Voting Classifier. The ensemble combines:
    • XGBoost
    • CatBoost
    • LightGBM
    • Random Forest
  • Result Mapping: Converts model indices back to human-readable labels (Adult/Child) using label_encoder.joblib.

Performance Optimization

  • Startup Loading: All model artifacts (.joblib files) are loaded into memory once during API startup.
  • Memory Efficiency: Audio is processed directly from memory (via io.BytesIO) without temporary disk writes.
  • Selective Calculation: Only the features required by the ensemble are calculated or kept after the selection phase.