FedEmoNet: Privacy-preserving federated learning with TCN-Transformer fusion for cross-corpus speech emotion recognition

Fuente: PubMed "swarm"
PLoS One. 2026 May 7;21(5):e0342953. doi: 10.1371/journal.pone.0342953. eCollection 2026.ABSTRACTFederated learning offers a promising path toward privacy-preserving speech emotion recognition, yet existing approaches remain confined to single-corpus evaluation, lack formal differential privacy guarantees, and provide no mechanism for model interpretability. Meanwhile, cross-corpus generalization continues to challenge even centralized systems, with typical accuracy drops of 20-40% on unseen datasets due to domain shift in recording conditions, speaker demographics, and cultural expression norms. This paper introduces FedEmoNet, a unified framework that jointly addresses these open problems by combining FedProx-based distributed optimization, a hybrid Temporal Convolutional Network-Transformer (TCN-Transformer) architecture, Particle Swarm Optimization (PSO) feature selection, and calibrated ([Formula: see text])-differential privacy. Five heterogeneous clients-two German-speech (EmoDB), two English-speech (RAVDESS), and one mixed-collaborate under non-IID conditions (Dirichlet [Formula: see text]) without exchanging raw audio. Each client extracts multi-scale phase space reconstructions at micro (25 ms), meso (250 ms), and macro (2.5 s) temporal resolutions alongside spectral and handcrafted features, which are fused through multi-head attention across the TCN-Transformer branches. On held-out, speaker-independent test sets the framework achieves 99.07% ± 0.35% accuracy on EmoDB (107 samples) and 98.96% ± 0.42% on RAVDESS (288 samples). Zero-shot cross-corpus evaluation on CREMA-D (1,488 samples) yields 68.15% ± 1.23% overall, with a clear arousal-dependent pattern: high-arousal emotions (angry, happy, sad) transfer at 71.9% versus 62.1% for low-arousal categories (neutral, disgust, fear). Ablation experiments confirm that PSO selection (+2.80%), Transformer blocks (+2.10%), and the FedProx protocol (+2.62%) each contribute significantly, and a monotonic reduced-data curve rules out memorization. Membership inference attack resistance drops to near-chance levels (AUC = 0.52) under differential privacy while retaining 98.5% accuracy. A dual SHAP-LIME explainability analysis reveals high inter-method agreement (r = 0.997) and confirms that prosodic features-particularly fundamental frequency statistics-serve as language-invariant emotion indicators across all three corpora (r = 0.94 cross-corpus consistency).PMID:42096484 | PMC:PMC13152135 | DOI:10.1371/journal.pone.0342953