[CI Failure]: mi325_1: Multi-Modal Models Test (Extended) 1

### Name of failing test

`pip install git+https://github.com/TIGER-AI-Lab/Mantis.git && pytest -v -s models/multimodal -m 'not core_model' --ignore models/multimodal/generation/test_common.py --ignore models/multimodal/processing`

### Basic information

- [ ] Flaky test
- [x] Can reproduce locally
- [ ] Caused by external libraries (e.g. bug in `transformers`)

### 🧪 Describe the failing test

# Failing Tests Summary - Multimodal Models

#### 1. **test_granite_speech.py::test_models**
Tests IBM Granite Speech model with audio LoRA adapter.

**Failure:** Test execution failure (specific error not shown in truncated output)  
**Configuration:** Model: ibm-granite/granite-speech-3.3-2b, bfloat16, max_model_len=2048, num_logprobs=10  
**Likely cause:** Audio encoder or LoRA adapter implementation incompatible with ROCm; audio processing pipeline may use CUDA-specific kernels not properly translated for ROCm.

#### 2. **test_pixtral.py::test_chat** (2 failures - different max_model_len)
Tests Mistral Pixtral/Small multimodal chat with vision.

**Failure:** Logprobs mismatch - numerical differences in output probabilities  
**Configuration:** Models: Pixtral-12B, Mistral-Small-3.1-24B; max_model_len: 8192, 65536  
**Likely cause:** Vision encoder attention or image processing yields different numerical results. Output shows token probability differences (e.g., "up" vs "directly" ranking flipped), suggesting vision feature extraction divergence.

#### 3. **test_clip.py::test_models_*** (3 failures)
Tests OpenAI CLIP vision-text embedding model.

**Failure:** Test failures for text-only, image-only, and text-image pooling  
**Configuration:** Model: openai/clip-vit-base-patch32, float dtype  
**Likely cause:** CLIP vision transformer or pooling operations not ROCm-compatible; dual-encoder architecture may have backend-specific implementations.

#### 4. **test_dse_qwen2_vl.py::test_models_image**
Tests DSE Qwen2-VL multimodal reranker.

**Failure:** Image processing test failure  
**Configuration:** Model: MrLight/dse-qwen2-2b-mrl-v1, bfloat16  
**Likely cause:** Qwen2-VL vision encoder not properly supported on ROCm in pooling mode.

#### 5. **test_jinavl_reranker.py::test_model_*** (3 failures)
Tests Jina VL reranker for text-image, image-text, image-image comparisons.

**Failure:** All three modality combination tests fail  
**Configuration:** Model: jinaai/jina-reranker-m0, half precision  
**Likely cause:** Jina VL reranker uses custom cross-attention/pooling for multimodal comparison; likely CUDA-specific implementation.

#### 6. **test_radio.py::test_radio** (2 failures)
Tests NVIDIA C-RADIO vision model.

**Failure:** Both half and bfloat16 precision tests fail  
**Configuration:** Model: nvidia/C-RADIOv2-H  

## 7. **test_siglip.py::test_models_*** (6 failures)
Tests Google SigLIP vision-text models.

**Failure:** Text-only, image-only, and combined tests for both siglip and siglip2 variants  
**Configuration:** Models: google/siglip-base-patch16-224, google/siglip2-base-patch16-224, float dtype  
**Likely cause:** SigLIP contrastive learning architecture uses specific attention patterns.

---

**Common Theme:** All 17 failures are multimodal model tests (vision/audio encoders, pooling/embedding models). Specific issues likely in:
- Vision encoder attention mechanisms (FlashAttention, custom attention kernels)
- Audio feature extraction pipelines
- Cross-modal pooling/reranking operations
- Numerical precision differences in multimodal fusion layers

### 📝 History of failing test

AMD-CI build Buildkite references: 
- 1041
- 1077
- 1088
- 1109
- 1111

### CC List.

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[CI Failure]: mi325_1: Multi-Modal Models Test (Extended) 1 #29511

Name of failing test

Basic information

🧪 Describe the failing test

Failing Tests Summary - Multimodal Models

1. test_granite_speech.py::test_models

2. test_pixtral.py::test_chat (2 failures - different max_model_len)

3. test_clip.py::test_models_* (3 failures)

4. test_dse_qwen2_vl.py::test_models_image

5. test_jinavl_reranker.py::test_model_* (3 failures)

6. test_radio.py::test_radio (2 failures)

7. test_siglip.py::test_models_* (6 failures)

📝 History of failing test

CC List.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[CI Failure]: mi325_1: Multi-Modal Models Test (Extended) 1 #29511

Description

Name of failing test

Basic information

🧪 Describe the failing test

Failing Tests Summary - Multimodal Models

1. test_granite_speech.py::test_models

2. test_pixtral.py::test_chat (2 failures - different max_model_len)

3. test_clip.py::test_models_* (3 failures)

4. test_dse_qwen2_vl.py::test_models_image

5. test_jinavl_reranker.py::test_model_* (3 failures)

6. test_radio.py::test_radio (2 failures)

7. test_siglip.py::test_models_* (6 failures)

📝 History of failing test

CC List.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions