-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Description
Name of failing test
pip install git+https://github.com/TIGER-AI-Lab/Mantis.git && pytest -v -s models/multimodal -m 'not core_model' --ignore models/multimodal/generation/test_common.py --ignore models/multimodal/processing
Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
🧪 Describe the failing test
Failing Tests Summary - Multimodal Models
1. test_granite_speech.py::test_models
Tests IBM Granite Speech model with audio LoRA adapter.
Failure: Test execution failure (specific error not shown in truncated output)
Configuration: Model: ibm-granite/granite-speech-3.3-2b, bfloat16, max_model_len=2048, num_logprobs=10
Likely cause: Audio encoder or LoRA adapter implementation incompatible with ROCm; audio processing pipeline may use CUDA-specific kernels not properly translated for ROCm.
2. test_pixtral.py::test_chat (2 failures - different max_model_len)
Tests Mistral Pixtral/Small multimodal chat with vision.
Failure: Logprobs mismatch - numerical differences in output probabilities
Configuration: Models: Pixtral-12B, Mistral-Small-3.1-24B; max_model_len: 8192, 65536
Likely cause: Vision encoder attention or image processing yields different numerical results. Output shows token probability differences (e.g., "up" vs "directly" ranking flipped), suggesting vision feature extraction divergence.
3. test_clip.py::test_models_* (3 failures)
Tests OpenAI CLIP vision-text embedding model.
Failure: Test failures for text-only, image-only, and text-image pooling
Configuration: Model: openai/clip-vit-base-patch32, float dtype
Likely cause: CLIP vision transformer or pooling operations not ROCm-compatible; dual-encoder architecture may have backend-specific implementations.
4. test_dse_qwen2_vl.py::test_models_image
Tests DSE Qwen2-VL multimodal reranker.
Failure: Image processing test failure
Configuration: Model: MrLight/dse-qwen2-2b-mrl-v1, bfloat16
Likely cause: Qwen2-VL vision encoder not properly supported on ROCm in pooling mode.
5. test_jinavl_reranker.py::test_model_* (3 failures)
Tests Jina VL reranker for text-image, image-text, image-image comparisons.
Failure: All three modality combination tests fail
Configuration: Model: jinaai/jina-reranker-m0, half precision
Likely cause: Jina VL reranker uses custom cross-attention/pooling for multimodal comparison; likely CUDA-specific implementation.
6. test_radio.py::test_radio (2 failures)
Tests NVIDIA C-RADIO vision model.
Failure: Both half and bfloat16 precision tests fail
Configuration: Model: nvidia/C-RADIOv2-H
7. test_siglip.py::test_models_* (6 failures)
Tests Google SigLIP vision-text models.
Failure: Text-only, image-only, and combined tests for both siglip and siglip2 variants
Configuration: Models: google/siglip-base-patch16-224, google/siglip2-base-patch16-224, float dtype
Likely cause: SigLIP contrastive learning architecture uses specific attention patterns.
Common Theme: All 17 failures are multimodal model tests (vision/audio encoders, pooling/embedding models). Specific issues likely in:
- Vision encoder attention mechanisms (FlashAttention, custom attention kernels)
- Audio feature extraction pipelines
- Cross-modal pooling/reranking operations
- Numerical precision differences in multimodal fusion layers
📝 History of failing test
AMD-CI build Buildkite references:
- 1041
- 1077
- 1088
- 1109
- 1111
CC List.
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status