In this episode, we discuss Hymba: A Hybrid-head Architecture for Small Language Models by Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, Pavlo Molchanov. The paper introduces Hymba, a new family of small language models that combines transformer attention mechanisms with state space models for enhanced efficiency and performance. It employs a hybrid approach using attention heads and SSM heads for detailed recall and context summarization, along with optimizations like learnable meta tokens, cross-layer KV sharing, and partial sliding window attention to reduce cache size. Experiments show that Hymba-1.5B-Base outperforms other models under 2B parameters, with improvements in accuracy, cache size, and throughput.