[FEATURE] Add DeepSpeed / ZeRO support in training script

**Is your feature request related to a problem? Please describe.**  
DeepSpeed/ZeRO provides meaningful acceleration and memory reduction for large-scale training. However, `timm`’s training script currently does not include an official or built-in way to enable DeepSpeed/ZeRO. I noticed there was an earlier attempt in issue #490, but it appears to have been discontinued. I have implemented DeepSpeed/ZeRO support (with a few compromises) in my own training code built on top of the timm training script. If you think this would be a useful addition, I’d be happy to prepare a PR integrating it into the official `timm` codebase.

**Describe the solution you'd like**  
The current implementation integrates DeepSpeed as an optional dependency and follows `timm`’s existing training structure. Specifically, it:
- Introduces a set of new CLI flags that match the current argument parsing style while enabling DeepSpeed and passing configuration options, such as:
  - `--deepspeed`
  - `--ds-zero-stage {0,1,2,3}`
  - `--ds-offload-optimizer {none,cpu,nvme}`
  - `--ds-offload-param {none,cpu,nvme}`
  - …  
- Uses a small helper function, `build_ds_config`, to construct a DeepSpeed config dict/JSON directly from existing `timm` arguments (batch size, gradient accumulation, AMP dtype, clipping, etc.).
- Wraps the model and parameters with `deepspeed.initialize` only when `--deepspeed` is enabled, keeping the non-DeepSpeed training path completely unchanged. This ensures full backward compatibility with existing scripts while providing users an opt-in path for ZeRO acceleration.

**Additional context**  
The current handling of model EMA, logging, and checkpointing in the DeepSpeed path is not yet elegant. I can refine these components and better align them with `timm`’s existing utilities if this feature is accepted. In practice, the acceleration achieved with DeepSpeed/ZeRO has been quite decent in my training experience.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[FEATURE] Add DeepSpeed / ZeRO support in training script #2623

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[FEATURE] Add DeepSpeed / ZeRO support in training script #2623

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions