Audio Classifier API Architecture

This document explains the data flow and internal processing logic of the Voice Classifier API.

High-Level Architecture

The API follows a modular layered architecture to separate concerns between web handling, feature engineering, and model inference.

graph TD
    User((User)) -->|POST /predict-anomaly| API[FastAPI Entry Point]
    
    subgraph "1. Pre-processing"
    API --> Load[Librosa Load - 16kHz Mono]
    Load --> Trim[Trim Silence - 20dB top_db]
    Trim --> PreEmp[Pre-emphasis Filter]
    end
    
    subgraph "2. Feature Extraction Pipeline"
    PreEmp --> STFT[Short-Time Fourier Transform]
    STFT --> MFCC[MFCC - 12 Coefs]
    PreEmp --> Yin[Optimized Yin - Pitch/F0]
    STFT --> Spectral[Spectral Centroid/Rolloff]
    STFT --> Energy[RMS & ZCR]
    end
    
    subgraph "3. Feature Engineering"
    MFCC --> Vector[Feature Vector Assembly]
    Yin --> Vector
    Spectral --> Vector
    Energy --> Vector
    Vector --> Selection[Feature Selection - Correlated Drops]
    Selection --> Scaler[Standard Scaler]
    end
    
    subgraph "4. Inference Engine"
    Scaler --> Ensemble[Voting Ensemble Model]
    Ensemble --> Result[Label Encoder Mapping]
    end
    
    Result -->|JSON Response| User

    style API fill:#f9f,stroke:#333,stroke-width:2px
    style Ensemble fill:#69f,stroke:#333,stroke-width:2px

Performance Optimization (V7.2)

Yin Substitution: Replaced pyin with yin, reducing latent bottleneck from ~15s to < 1s for pitch estimation.
Duration Capping: Optimized extraction to focus on the first 10 seconds of audio.
Pitch Range Restriction: Narrowed search to 80Hz - 800Hz for human compatibility and speed.
In-Memory Buffer: Minimal I/O overhead using BytesIO.

Data Processing Pipeline

1. Request Handling (`main.py`)

Receives the audio file via a POST request.
Validates file format.
Passes the audio buffer to the storage-efficient processing layer.

2. Feature Extraction (`features/feature_extractor.py`)

Trimming: Removes leading and trailing silence to avoid bias.
Pre-emphasis: Amplifies high frequencies to balance the spectrum before analysis.
Pitch Estimation: Uses the Yin algorithm (optimized for speed) to determine fundamental frequency (F0).
Redundancy Reduction: Compares extracted features against dropped_features.joblib to remove highly correlated columns that the model doesn't use.

3. Inference Engine (`utils/model_handler.py`)

Scaling: Standardizes the features using the pre-fitted scaler.joblib.
Ensemble Prediction: Passes the scaled data through the Voting Classifier. The ensemble combines:
- XGBoost
- CatBoost
- LightGBM
- Random Forest
Result Mapping: Converts model indices back to human-readable labels (Adult/Child) using label_encoder.joblib.

Performance Optimization

Startup Loading: All model artifacts (.joblib files) are loaded into memory once during API startup.
Memory Efficiency: Audio is processed directly from memory (via io.BytesIO) without temporary disk writes.
Selective Calculation: Only the features required by the ensemble are calculated or kept after the selection phase.

High-Level Architecture​

Performance Optimization (V7.2)​

Data Processing Pipeline​

1. Request Handling (main.py)​

2. Feature Extraction (features/feature_extractor.py)​

3. Inference Engine (utils/model_handler.py)​

Performance Optimization​