[CI Failure]: mi325_1: Multi-Modal Models Test (Extended) 2

### Name of failing test

`pip install git+https://github.com/TIGER-AI-Lab/Mantis.git; pytest -v -s models/multimodal/generation/test_common.py -m 'split(group=0) and not core_model'`

### Basic information

- [ ] Flaky test
- [x] Can reproduce locally
- [ ] Caused by external libraries (e.g. bug in `transformers`)

### 🧪 Describe the failing test

**Failing Tests Summary:**

**ovis2_5 model (test_case8-11)**
- Tests: `test_single_image_models`, `test_multi_image_models`, `test_video_models`
- Failure type: Logprob comparison mismatch between vLLM and HuggingFace outputs
- Configuration: dtype=half, model=AIDC-AI/Ovis2.5-2B with image/multi-image/video inputs
- Likely cause: Numerical precision differences in visual encoder or attention computations between vLLM and HF implementations.

**smolvlm model (test_case21-24)**
- Tests: `test_single_image_models`, `test_multi_image_models`
- Failure type: Logprob comparison mismatch
- Configuration: dtype varies, model=HuggingFaceTB/SmolVLM2-2.2B-Instruct with image inputs
- Likely cause: SmolVLM2's Idefics3-based architecture may have numerical stability issues in the vision-text fusion layers when compiled

**minimax_vl_01 model (test_case29-32)**
- Tests: `test_single_image_models`, `test_multi_image_models`
- Failure type: Logprob comparison mismatch
- Configuration: dtype=bfloat16, model=MiniMaxAI/MiniMax-VL-01 with large GPU requirement (80GB)
- Likely cause: Large model with custom HF processor and output post-processing may have numerical divergence in the visual feature extraction or token generation paths

**paddleocr_vl model (test_case45-46)**
- Tests: `test_single_image_models`, `test_multi_image_models`
- Failure type: Logprob comparison mismatch
- Configuration: model=PaddlePaddle/PaddleOCR-VL with OCR-specific visual processing
- Likely cause: OCR-focused visual encoder may have implementation differences in convolution or text detection layers affecting logprob alignment

**qwen2_vl model (test_case54)**
- Tests: `test_single_image_models`
- Failure type: Logprob comparison mismatch
- Configuration: model=Qwen/Qwen2-VL-2B-Instruct, marked as cpu_model
- Likely cause: Test marked for CPU but running on ROCm may expose inconsistencies in vision encoder's dynamic resolution handling or the vision_start/vision_end token processing

### 📝 History of failing test

AMD-CI build Buildkite references: 
- 1041
- 1077
- 1088
- 1109
- 1111

### CC List.

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[CI Failure]: mi325_1: Multi-Modal Models Test (Extended) 2 #29536

Name of failing test

Basic information

🧪 Describe the failing test

📝 History of failing test

CC List.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[CI Failure]: mi325_1: Multi-Modal Models Test (Extended) 2 #29536

Description

Name of failing test

Basic information

🧪 Describe the failing test

📝 History of failing test

CC List.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions