Audio Classifier API Architecture
This document explains the data flow and internal processing logic of the Voice Classifier API.
High-Level Architecture
The API follows a modular layered architecture to separate concerns between web handling, feature engineering, and model inference.
graph TD
User((User)) -->|POST /predict-anomaly| API[FastAPI Entry Point]
subgraph "1. Pre-processing"
API --> Load[Librosa Load - 16kHz Mono]
Load --> Trim[Trim Silence - 20dB top_db]
Trim --> PreEmp[Pre-emphasis Filter]
end
subgraph "2. Feature Extraction Pipeline"
PreEmp --> STFT[Short-Time Fourier Transform]
STFT --> MFCC[MFCC - 12 Coefs]
PreEmp --> Yin[Optimized Yin - Pitch/F0]
STFT --> Spectral[Spectral Centroid/Rolloff]
STFT --> Energy[RMS & ZCR]
end
subgraph "3. Feature Engineering"
MFCC --> Vector[Feature Vector Assembly]
Yin --> Vector
Spectral --> Vector
Energy --> Vector
Vector --> Selection[Feature Selection - Correlated Drops]
Selection --> Scaler[Standard Scaler]
end
subgraph "4. Inference Engine"
Scaler --> Ensemble[Voting Ensemble Model]
Ensemble --> Result[Label Encoder Mapping]
end
Result -->|JSON Response| User
style API fill:#f9f,stroke:#333,stroke-width:2px
style Ensemble fill:#69f,stroke:#333,stroke-width:2px
Performance Optimization (V7.2)
- Yin Substitution: Replaced
pyinwithyin, reducing latent bottleneck from ~15s to < 1s for pitch estimation. - Duration Capping: Optimized extraction to focus on the first 10 seconds of audio.
- Pitch Range Restriction: Narrowed search to 80Hz - 800Hz for human compatibility and speed.
- In-Memory Buffer: Minimal I/O overhead using
BytesIO.
Data Processing Pipeline
1. Request Handling (main.py)
- Receives the audio file via a
POSTrequest. - Validates file format.
- Passes the audio buffer to the storage-efficient processing layer.
2. Feature Extraction (features/feature_extractor.py)
- Trimming: Removes leading and trailing silence to avoid bias.
- Pre-emphasis: Amplifies high frequencies to balance the spectrum before analysis.
- Pitch Estimation: Uses the
Yinalgorithm (optimized for speed) to determine fundamental frequency (F0). - Redundancy Reduction: Compares extracted features against
dropped_features.joblibto remove highly correlated columns that the model doesn't use.
3. Inference Engine (utils/model_handler.py)
- Scaling: Standardizes the features using the pre-fitted
scaler.joblib. - Ensemble Prediction: Passes the scaled data through the Voting Classifier. The ensemble combines:
- XGBoost
- CatBoost
- LightGBM
- Random Forest
- Result Mapping: Converts model indices back to human-readable labels (Adult/Child) using
label_encoder.joblib.
Performance Optimization
- Startup Loading: All model artifacts (
.joblibfiles) are loaded into memory once during API startup. - Memory Efficiency: Audio is processed directly from memory (via
io.BytesIO) without temporary disk writes. - Selective Calculation: Only the features required by the ensemble are calculated or kept after the selection phase.