Skip to content

[CI Failure]: mi325_1: Multi-Modal Models Test (Extended) 1 #29511

@AndreasKaratzas

Description

@AndreasKaratzas

Name of failing test

pip install git+https://github.com/TIGER-AI-Lab/Mantis.git && pytest -v -s models/multimodal -m 'not core_model' --ignore models/multimodal/generation/test_common.py --ignore models/multimodal/processing

Basic information

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

🧪 Describe the failing test

Failing Tests Summary - Multimodal Models

1. test_granite_speech.py::test_models

Tests IBM Granite Speech model with audio LoRA adapter.

Failure: Test execution failure (specific error not shown in truncated output)
Configuration: Model: ibm-granite/granite-speech-3.3-2b, bfloat16, max_model_len=2048, num_logprobs=10
Likely cause: Audio encoder or LoRA adapter implementation incompatible with ROCm; audio processing pipeline may use CUDA-specific kernels not properly translated for ROCm.

2. test_pixtral.py::test_chat (2 failures - different max_model_len)

Tests Mistral Pixtral/Small multimodal chat with vision.

Failure: Logprobs mismatch - numerical differences in output probabilities
Configuration: Models: Pixtral-12B, Mistral-Small-3.1-24B; max_model_len: 8192, 65536
Likely cause: Vision encoder attention or image processing yields different numerical results. Output shows token probability differences (e.g., "up" vs "directly" ranking flipped), suggesting vision feature extraction divergence.

3. test_clip.py::test_models_* (3 failures)

Tests OpenAI CLIP vision-text embedding model.

Failure: Test failures for text-only, image-only, and text-image pooling
Configuration: Model: openai/clip-vit-base-patch32, float dtype
Likely cause: CLIP vision transformer or pooling operations not ROCm-compatible; dual-encoder architecture may have backend-specific implementations.

4. test_dse_qwen2_vl.py::test_models_image

Tests DSE Qwen2-VL multimodal reranker.

Failure: Image processing test failure
Configuration: Model: MrLight/dse-qwen2-2b-mrl-v1, bfloat16
Likely cause: Qwen2-VL vision encoder not properly supported on ROCm in pooling mode.

5. test_jinavl_reranker.py::test_model_* (3 failures)

Tests Jina VL reranker for text-image, image-text, image-image comparisons.

Failure: All three modality combination tests fail
Configuration: Model: jinaai/jina-reranker-m0, half precision
Likely cause: Jina VL reranker uses custom cross-attention/pooling for multimodal comparison; likely CUDA-specific implementation.

6. test_radio.py::test_radio (2 failures)

Tests NVIDIA C-RADIO vision model.

Failure: Both half and bfloat16 precision tests fail
Configuration: Model: nvidia/C-RADIOv2-H

7. test_siglip.py::test_models_* (6 failures)

Tests Google SigLIP vision-text models.

Failure: Text-only, image-only, and combined tests for both siglip and siglip2 variants
Configuration: Models: google/siglip-base-patch16-224, google/siglip2-base-patch16-224, float dtype
Likely cause: SigLIP contrastive learning architecture uses specific attention patterns.


Common Theme: All 17 failures are multimodal model tests (vision/audio encoders, pooling/embedding models). Specific issues likely in:

  • Vision encoder attention mechanisms (FlashAttention, custom attention kernels)
  • Audio feature extraction pipelines
  • Cross-modal pooling/reranking operations
  • Numerical precision differences in multimodal fusion layers

📝 History of failing test

AMD-CI build Buildkite references:

  • 1041
  • 1077
  • 1088
  • 1109
  • 1111

CC List.

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    ci-failureIssue about an unexpected test failure in CI

    Type

    No type

    Projects

    Status

    Done

    Status

    In review

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions