- 2025.08: 📢📢📢 The survey paper has been reported by many technical media, including: Zhihu (知乎), Synced (机器之心), PaperWeekly, 52CV (我爱计算机视觉), Deep Learning and NLP (深度学习自然语言处理), Emergent Clustering Points (涌现聚点), Global Economic Forum (全球经济论坛), and more. People ❤️ "Speed Always Wins"!
- 2025.08: 🎉🎉🎉 We have released a survey paper Speed Always Wins: A Survey on Efficient Architectures for Large Language Models, with 449 papers included. Please feel free to open PRs to include your Awesome-Efficient-Arch work.
- 1 Introduction
- 1.1 Background
- 1.2 Position and Contributions
- 2 Linear Sequence Modeling
- 3 Sparse Sequence Modeling
- 4 Efficient Full Attention
- 4.1 IO-Aware Attention
- 4.2 Grouped Attention
- 4.3 Mixture of Attention
- 4.4 Quantized Attention
- 5 Sparse Mixture-of-Experts
- 5.1 Routing Mechanisms
- 5.2 Expert Architectures
- 5.3 MoE Conversion
- 6 Hybrid Architectures
- 6.1 Inter-layer Hybrid
- 6.2 Intra-layer Hybrid
- 7 Diffusion Large Language Models
- 8 Applications to Other Modalities
- 8.1 Vision
- 8.2 Audio
- 8.3 Multimodality
- 9 Conclusion and Future Directions
- Log-Linear Attention
- EdgeInfinite: A Memory-Efficient Infinite-Context Transformer for Edge Devices
- MoM: Linear Sequence Modeling with Mixture-of-Memories
- ReGLA: Refining Gated Linear Attention
- MetaLA: Unified Optimal Linear Approximation to Softmax Attention Map
- Gated Slot Attention for Efficient Linear-Time Sequence Modeling
- Scaling Laws for Linear Complexity Language Models
- Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention
- Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
- Linear Transformers with Learnable Kernel Functions are Better In-Context Models
- Simple linear attention language models balance the recall-throughput tradeoff
- The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry
- Transformers are Multi-State RNNs
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models
- Gated Linear Attention Transformers with Hardware-efficient Training
- GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling
- AGaLiTe: Approximate Gated Linear Transformers for Online Reinforcement Learning
- Recurrent Linear Transformers
- TransNormerLLM: A Faster and Better Large Language Model with Improved TransNormer
- Retentive Network: A Successor to Transformer for Large Language Models
- The Devil in Linear Transformer
- Transformer Quality in Linear Time
- ABC: Attention with Bounded-memory Control
- Going Beyond Linear Transformers with Recurrent Fast Weight Programmers
- Random Feature Attention
- Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
- RWKV-7" Goose" with Expressive Dynamic State Evolution
- xLSTM: Extended Long Short-Term Memory
- HGRN2: Gated Linear RNNs with State Expansion
- Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence
- GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling
- Hierarchically Gated Recurrent Neural Network for Sequence Modeling
- RWKV: Reinventing RNNs for the Transformer Era
- Resurrecting Recurrent Neural Networks for Long Sequences
- Mamba: Linear‑Time Sequence Modeling with Selective State Spaces
- Transformers are SSMs: Generalized Models and Efficient Algorithms through Structured State Space Duality
- Comba: Improving Nonlinear RNNs with Closed‑loop Control
- Structured State Space Models (S4)
- S5: Simplified State Space Layers for Sequence Modeling
- Hippo: Recurrent Memory with Optimal Polynomial Projections
- Diagonal State Spaces are as Effective as Structured State Spaces)
- On the parameterization and initialization of diagonal state space models
- Liquid‑S4: Liquid Structural State‑Space Models
- Longhorn: State Space Models Are Amortized Online Learners
- Time‑SSM: Simplifying and Unifying State Space Models for Time Series Forecasting
- Effectively Modeling Time Series with Simple Discrete State Spaces
- Attractor Memory for Long‑Term Time Series Forecasting: A Chaos Perspective
- How to Train Your HiPPO: State Space Models with Generalized Orthogonal Basis Projections
- LSSL: Combining Recurrent, Convolutional, and Continuous‑Time Models with Linear State‑Space Layers
- Test‑Time Regression: A Unifying Framework for Designing Sequence Models with Associative Memory
- Titans: Learning to Memorize at Test Time
- MesaNet: Sequence Modeling by Locally Optimal Test‑Time Training
- Atlas: Learning to Optimally Memorize the Context at Test Time
- Test‑Time Training Done Right
- It's All Connected: A Journey Through Test‑Time Memorization, Attentional Bias, Retention, and Online Optimization
- Learning to (Learn at Test Time): RNNs with Expressive Hidden States
- 线性注意力简史:从模仿、创新到反哺
- Comba: Improving Nonlinear RNNs with Closed‑loop Control
- Understanding Transformer from the Perspective of Associative Memory
- Test‑Time Regression: A Unifying Framework for Designing Sequence Models with Associative Memory
- MetaLA: Unified Optimal Linear Approximation to Softmax Attention Map
- It’s All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization
- On-the-Fly Adaptive Distillation of Transformer to Dual-State Linear Attention
- LoLA: Low-Rank Linear Attention With Sparse Caching
- M1: Towards scalable test-time compute with mamba reasoning models
- Liger: Linearizing Large Language Models to Gated Recurrent Structures
- Thinking slow, fast: Scaling inference compute with distilled reasoners
- Llamba: Scaling Distilled Recurrent Models for Efficient Language Processing
- LoLCATs: On Low-Rank Linearizing of Large Language Models
- The Mamba in the Llama: Distilling and Accelerating Hybrid Models
- Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models
- Linearizing Large Language Models
- DiJiang: Efficient Large Language Models through Compact Kernelization
- Fine-Tuning Pre-trained Transformers into Decaying Fast Weights
- Finetuning Pretrained Transformers into RNNs
- FLA: A Triton‑Based Library for Hardware‑Efficient Implementations of Linear Attention Mechanism
- Transformers are SSMs: Generalized Models and Efficient Algorithms through Structured State Space Duality
- Comba: Improving Nonlinear RNNs with Closed‑loop Control
- Gated Delta Networks: Improving Mamba2 with Delta Rule
- DeltaNet: Parallelizing Linear Transformers with the Delta Rule over Sequence Length
- Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality
- Gated linear attention transformers with hardware-efficient training
- Gated Slot Attention for Efficient Linear‑Time Sequence Modeling
- LongNet: Scaling Transformers to 1,000,000,000 Tokens
- Open-Sora: Democratizing Efficient Video Production for All
- LongT5: Efficient Text-to-Text Transformer for Long Sequences
- Blockwise Self-Attention for Long Document Understanding
- Longformer: The Long-Document Transformer
- GMAT: Global Memory Augmentation for Transformers
- ETC: Encoding Long and Structured Inputs in Transformers
- BigBird: Transformers for Longer Sequences
- Axial Attention in Multidimensional Transformers
- Generating Long Sequences with Sparse Transformers
- Star-Transformer
- Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity
- Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing
- The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs
- Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
- Unlimiformer: Long-Range Transformers with Unlimited Length Input
- ColT5: Faster Long-Range Transformers with Conditional Computation
- Memorizing Transformers
- Efficient Content-Based Sparse Attention with Routing Transformers
- Sparse Sinkhorn Attention
- Reformer: The Efficient Transformer
- Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
- Compressive Transformers for Long-Range Sequence Modelling
- Adaptive Attention Span in Transformers
- ABC: Attention with Bounded-Memory Control
- LServe: Efficient long-sequence llm serving with unified sparse attention
- XAttention: Block sparse attention with antidiagonal scoring
- Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention
- MoA: Mixture of sparse attention for automatic large language model compression
- SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
- SeerAttention-R: Sparse Attention Adaptation for Long Reasoning
- Transformers are Multi-State RNNs
- RetrievalAttention: Accelerating Long-context LLM Inference via Vector Retrieval
- LongHeads: Multi-Head Attention is Secretly a Long Context Processor
- ShadowKV: KV Cache in Shadows for High-throughput Long-context LLM Inference
- PQCache: Product Quantization-based KVCache for Long Context LLM Inference
- Transformers are multi-state rnns
- H2O: Heavy-hitter oracle for efficient generative inference of large language models
- Model tells you what to discard: Adaptive kv cache compression for llms
- Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
- DuoAttention: Efficient long-context llm inference with retrieval and streaming heads
- LongLoRA: Efficient Fine-tuning of Long-context Large Language Models
- Spatten: Efficient sparse attention architecture with cascade token and head pruning
- Efficient streaming language models with attention sinks
- Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity
- SeerAttention: Sparse Attention Adaptation for Long Reasoning
- Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
- MoBA: Mixture of Block Attention for Long-Context LLMs
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
- FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
- Longformer: The Long-Document Transformer
- FlashAttention: Fast and Memory‑Efficient Exact Attention with IO‑Awareness
- FlashAttention‑2: Faster Attention with Better Parallelism and Work Partitioning
- FlashAttention‑3: Fast and Accurate Attention with Asynchrony and Low‑Precision
- Fast Transformer Decoding: One Write-Head is All You Need
- GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
- DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
- DeepSeek-V3 Technical Report
- Hardware-Efficient Attention for Fast Decoding
- Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity
- MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression
- SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
- MoH: Multi-Head Attention as Mixture-of-Head Attention
- LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training
- MoBA: Mixture of Block Attention for Long-Context LLMs
- MoM: Linear Sequence Modeling with Mixture-of-Memories
- Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing
- SageAttention: Accurate 8‑Bit Attention for Plug‑and‑Play Inference Acceleration
- SageAttention 2 Technical Report: Accurate 4‑Bit Attention for Plug‑and‑Play Inference Acceleration
- SageAttention 3: Microscaling FP4 Attention for Inference and an Exploration of 8‑Bit Training
- INT‑FlashAttention: Enabling Flash Attention for INT8 Quantization
- Q‑BERT: Hessian‑Based Ultra‑Low‑Precision Quantization of BERT
- Q8BERT: Quantized 8‑Bit BERT
- Fully Quantized Transformer for Machine Translation
- I‑BERT: Integer‑Only BERT Quantization
- BitDistiller: Unleashing the Potential of Sub‑4‑Bit LLMs via Self‑Distillation
- GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
- Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
- Mixture-of-Experts with Expert Choice Routing
- BASE Layers: Simplifying Training of Large, Sparse Models
- Hash Layers For Large Sparse Models
- Harder Tasks Need More Experts: Dynamic Routing in MoE Models
- Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models
- Ada-K Routing: Boosting the Efficiency of MoE-based LLMs
- AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models
- MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts
- ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing
- BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity
- Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models
- Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts
- Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity
- From Sparse to Soft Mixtures of Experts
- OLMoE: Open Mixture-of-Experts Language Models
- DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
- Qwen1.5-MoE: Matching 7B Model Performance with 1/3 Activated Parameters
- Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
- Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
- ModuleFormer: Modularity Emerges from Mixture-of-Experts
- LoRAMoE: Alleviate World Knowledge Forgetting in Large Language Models via MoE-Style Plugin
- MoELoRA: Contrastive Learning Guided Mixture of Experts on Parameter-Efficient Fine-Tuning for Large Language Models
- Mixture of LoRA Experts
- Mixture of A Million Experts
- MoEfication: Transformer Feed-forward Layers are Mixtures of Experts
- Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
- MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation
- LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training
- LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training
- DLO: Dynamic Layer Operation for Efficient Vertical Scaling of LLMs
- MoDification: Mixture of Depths Made Easy
- Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models
- Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM
- Zamba: A Compact 7B SSM Hybrid Model
- Zamba2 Suite: Technical Report
- Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling
- Jamba: A Hybrid Transformer-Mamba Language Model
- RWKV-X: A Linear Complexity Hybrid Language Model
- Minimax-01: Scaling Foundation Models with Lightning Attention
- The mamba in the llama: Distilling and Accelerating Hybrid Models
- HunYuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought
- Zebra-Llama: Towards Extremely Efficient Hybrid Models
- YOCO: You Only Cache Once: Decoder-Decoder Architectures for Language Models
- RecurrentGemma: Moving Past Transformers for Efficient Open Language Models
- LaCT: Test-time Training Done Right
- Hymba: A Hybrid-head Architecture for Small Language Models
- WuNeng: Hybrid State with Attention
- TransMamba: Flexibly Switching between Transformer and Mamba
- Liger: Linearizing Large Language Models to Gated Recurrent Structures
- LoLCATs: On Low-Rank Linearizing of Large Language Models
- LoLA: Low-Rank Linear Attention With Sparse Caching
- Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance
- Large Language Diffusion Models
- Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution
- Likelihood-Based Diffusion Language Models
- DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models
- Diffusion-LM Improves Controllable Text Generation
- BLOCK DIFFUSION: INTERPOLATING BETWEEN AUTOREGRESSIVE AND DIFFUSION LANGUAGE MODELS
- Scaling Diffusion Language Models via Adaptation from Autoregressive Models
- LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
- Unified Multimodal Discrete Diffusion
- LaViDa: A Large Diffusion Language Model for Multimodal Understanding
- MMaDA: Multimodal Large Diffusion Language Models
- Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding
- AuroraLong: Bringing RNNs Back to Efficient Open-Ended Video Understanding
- Res-vmamba: Fine-grained food category visual classification using selective state space models with deep residual learning
- Rsmamba: Remote sensing image classification with state space model
- InsectMamba: State Space Model with Adaptive Composite Features for Insect Recognition
- Spectralmamba: Efficient mamba for hyperspectral image classification
- Medmamba: Vision mamba for medical image classification
- Mammil: Multiple instance learning for whole slide images with state space models
- Memorymamba: Memory-augmented state space model for defect recognition
- Scaling Vision with Sparse Mixture of Experts
- Robust Mixture-of-Expert Training for Convolutional Neural Networks
- Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient for Convolutional Neural Networks
- Fusion-Mamba for Cross-Modality Object Detection
- SOAR: Advancements in Small Body Object Detection for Aerial Imagery Using State Space Models and Programmable Gradients
- Mamba YOLO: SSMs-based YOLO for Object Detection
- MiM-ISTD: Mamba-in-Mamba for Efficient Infrared Small Target Detection
- Voxel Mamba: Group-Free State Space Models for Point Cloud Based 3D Object Detection
- HTD-Mamba: Efficient Hyperspectral Target Detection with Pyramid State Space Model
- ViG: Linear-Complexity Visual Sequence Learning with Gated Linear Attention
- Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures
- Tutel: Adaptive Mixture-of-Experts at Scale
- Mamba or RWKV: Exploring High-Quality and High-Efficiency Segment Anything Model
- Vision Mamba-based Autonomous Crack Segmentation on Concrete, Asphalt, and Masonry Surfaces
- SegMAN: Omni-scale Context Modeling with State Space Models and Local Attention for Semantic Segmentation
- PyramidMamba: Rethinking Pyramid Feature Fusion with Selective Space State Model for Semantic Segmentation of Remote Sensing Imagery
- PixMamba: Leveraging State Space Models in a Dual-Level Architecture for Underwater Image Enhancement
- WaterMamba: Visual State Space Model for Underwater Image Enhancement
- MambaUIE&SR: Unraveling the Ocean's Secrets with Only 2.8 GFLOPs
- RetinexMamba: Retinex-Based Mamba for Low-Light Image Enhancement
- LLEMamba: Low-Light Enhancement via Relighting-Guided Mamba with Deep Unfolding Network
- FourierMamba: Fourier Learning Integration with State Space Models for Image Deraining
- Mamba-based Light Field Super-Resolution with Efficient Subspace Scanning
- LFMamba: Light Field Image Super-Resolution with State Space Model
- HDMba: Hyperspectral Remote Sensing Imagery Dehazing with State Space Model
- BVI-RLV: A Fully Registered Dataset and Benchmarks for Low-Light Video Enhancement
- FD-Vision Mamba for Endoscopic Exposure Correction
- StyleRWKV: High-Quality and High-Efficiency Style Transfer with RWKV-like Architecture
- U-Shaped Vision Mamba for Single Image Dehazing
- DVMSR: Distillated Vision Mamba for Efficient Super-Resolution
- Sparse Reconstruction of Optical Doppler Tomography Based on State Space Model
- VmambaIR: Visual State Space Model for Image Restoration
- CU-Mamba: Selective State Space Models with Channel Learning for Image Restoration
- Serpent: Scalable and Efficient Image Restoration via Multi-Scale Structured State Space Models
- GMSR: Gradient-Guided Mamba for Spectral Reconstruction from RGB Images
- Q-MambaIR: Accurate Quantized Mamba for Efficient Image Restoration
- MatIR: A Hybrid Mamba-Transformer Image Restoration Model
- Exploring Real & Synthetic Dataset and Linear Attention in Image Restoration
- Restore-RWKV: Efficient and Effective Medical Image Restoration with RWKV
- RainRWKV: A Deep RWKV Model for Video Deraining
- Multiple Span Bidirectional RWKV Network for Infrared Image Super-Resolution
- Multi-View Learning with Context-Guided Receptance for Image Denoising
- ID-RWKV: Image Deraining RWKV
- Scalable Diffusion Models with State Space Backbone
- Gamba: Marry Gaussian Splatting With Mamba for Single-View 3D Reconstruction
- Zigma: A DiT-Style Zigzag Mamba Diffusion Model
- DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis
- Efficient 3D Shape Generation via Diffusion Mamba with Bidirectional SSMs
- Scaling Diffusion Mamba with Bidirectional SSMs for Efficient Image and Video Generation
- Dimba: Transformer-Mamba Diffusion Models
- Scalable Autoregressive Image Generation with Mamba
- MaskMamba: A Hybrid Mamba-Transformer Model for Masked Image Generation
- SDiT: Spiking Diffusion Model with Transformer
- Diffusion-RWKV: Scaling RWKV-like Architectures for Diffusion Models
- Vision Mamba for Classification of Breast Ultrasound Images
- MambaMIL: Enhancing Long Sequence Modeling with Sequence Reordering in Computational Pathology
- U-Mamba: Enhancing Long-Range Dependency for Biomedical Image Segmentation
- VM-UNet: Vision Mamba UNet for Medical Image Segmentation
- SegMamba: Long-Range Sequential Modeling Mamba for 3D Medical Image Segmentation
- MambaMIR: An Arbitrary-Masked Mamba for Joint Medical Image Reconstruction and Uncertainty Estimation
- MMR-Mamba: Multi-Contrast MRI Reconstruction with Mamba and Spatial-Frequency Information Fusion
- VMambaMorph: A Visual Mamba-Based Framework with Cross-Scan Module for Deformable 3D Image Registration
- I2I-Mamba: Multi-Modal Medical Image Synthesis via Selective State Space Modeling
- Motion-Guided Dual-Camera Tracker for Endoscope Tracking and Motion Analysis in a Mechanical Gastric Simulator
- BSBP-RWKV: Background Suppression with Boundary Preservation for Efficient Medical Image Segmentation
- Restore-RWKV: Efficient and Effective Medical Image Restoration with RWKV
- Zig-RiR: Zigzag RWKV-in-RWKV for Efficient Medical Image Segmentation
- RWKV-UNet: Improving UNet with Long-Range Cooperation for Effective Medical Image Segmentation
- RNN-Based Multiple Instance Learning for the Classification of Histopathology Whole Slide Images
- Delta-WKV: A Novel Meta-in-Context Learner for MRI Super-Resolution
- H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving
- MambaBEV: An Efficient 3D Detection Model with Mamba2
- Trajectory Mamba: Efficient Attention-Mamba Forecasting Model Based on Selective SSM
- DRaMa: An Efficient End-to-End Motion Planner for Autonomous Driving with Mamba
- Enhancing Autonomous Driving Perception With Mamba-Based Dual-Branch Depth Estimation
- SalM²: An Extremely Lightweight Saliency Mamba Model for Real-Time Cognitive Awareness of Driver Attention
- OccRWKV: Rethinking Efficient 3D Semantic Occupancy Prediction with Linear Complexity
- RS-Mamba: Large Remote Sensing Image Dense Prediction with State Space Models
- HSIMamba: Hyperspectral Imaging Efficient Feature Learning with Bidirectional State Space for Classification
- RS3Mamba: Visual State Space Model for Remote Sensing Image Semantic Segmentation
- Samba: Semantic Segmentation of Remotely Sensed Images with State Space Model
- ChangeMamba: Remote Sensing Change Detection with Spatio-Temporal State Space Model
- Pan-Mamba: Effective Pan-Sharpening with State Space Model
- RSCaMa: Remote Sensing Image Change Captioning with State Space Model
- RSDehamba: Lightweight Vision Mamba for Remote Sensing Satellite Image Dehazing
- Frequency-Assisted Mamba for Remote Sensing Image Super-Resolution
- A Novel State Space Model with Local Enhancement and State Sharing for Image Fusion
- RawBMamba: End-to-End Bidirectional State Space Model for Audio Deepfake Detection
- Audio Mamba: Pretrained Audio State Space Model for Audio Tagging
- Audio Mamba: Selective State Spaces for Self-Supervised Audio Representations
- Audio Mamba: Bidirectional State Space Model for Audio Representation Learning
- SSAMBA: Self-Supervised Audio Representation Learning with Mamba State Space Model
- It’s Raw! Audio Generation with State-Space Models
- Multichannel Long-Term Streaming Neural Speech Enhancement for Static and Moving Speakers
- Dual-path Mamba: Short and Long-term Bidirectional Selective Structured State Space Models for Speech Separation
- SPMamba: State-Space Model is All You Need in Speech Separation
- TRAMBA: A Hybrid Transformer and Mamba Architecture for Practical Audio and Bone Conduction Speech Super Resolution and Enhancement on Mobile and Wearable Platforms
- An Investigation of Incorporating Mamba for Speech Enhancement
- MAMCA: Optimal on Accuracy and Efficiency for Automatic Modulation Classification with Extended Signal Length
- Mamba in Speech: Towards an Alternative to Self-Attention
- Why Perturbing Symbolic Music is Necessary: Fitting the Distribution of Never-used Notes through a Joint Probabilistic Diffusion Model
- Exploring RWKV for Memory Efficient and Low Latency Streaming ASR
- Advancing VAD Systems Based on Multi-Task Learning with Improved Model Structures
- Mamba-Enhanced Text-Audio-Video Alignment Network for Emotion Recognition in Conversations
- AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation
- AV-Mamba: Cross-Modality Selective State Space Models for Audio-Visual Question Answering
- LaViDa: A Large Diffusion Language Model for Multimodal Understanding
- MMaDA: Multimodal Large Diffusion Language Models
- LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
- Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding
- Unified Multimodal Discrete Diffusion
- VisualRWKV-HD and UHD: Advancing High-Resolution Processing for Visual Language Models
- RWKV-CLIP: A Robust Vision-Language Representation Learner
- Scaling Vision-Language Models with Sparse Mixture of Experts
- PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts
- Multimodal Contrastive Learning with LIMoE: The Language-Image Mixture of Experts
- MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
- Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts
- Mixture of Cluster-conditional LoRA Experts for Vision-Language Instruction Tuning
- LLaVA-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs
⭐ Join us to improve this repo! ⭐ If you know of any Awesome-Efficient-Arch work we've missed, please contribute via PR or raise an issue. Your contributions are very welcomed!
If you find this survey useful, please consider citing our paper:
@article{sun2025survey,
title={Speed Always Wins: A Survey on Efficient Architectures for Large Language Models},
author={Sun, Weigao and Hu, Jiaxi and Zhou, Yucheng and Du, Jusen and Lan, Disen and Wang, Kexin and Zhu, Tong and Qu, Xiaoye and Zhang, Yu and Mo, Xiaoyu and Liu, Daizong and Liang, Yuxuan and Chen, Wenliang and Li, Guoqi and Cheng, Yu},
journal={arXiv preprint arXiv:2508.09834},
year={2025}
}

