Skip to content

[Bug]: It took long time to recover when working on multi-replica #46182

@chyezh

Description

@chyezh

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 25cee2bf188255d75f5d83de92ba61bb14564bfe
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

A search operation on multi-replica-milvus took over 50s when a replica is down.

[2025/12/04 07:51:06.534 +00:00] [WARN] [grpcclient/client.go:385] ["fail to get session"] [clientRequestUnixmsec=1764834666223] [traceID=62b418b4118f551ec30fe91acdd265a8] [clientRole=querynode-29] [error="context canceled"]

I find that the shard client on proxy is retrying to verify session and get an unexpected context canceled.

Expected Behavior

A multi-replica cluster should retry the operation on another replica if current replica is down.
So the request should be retried and success after the session of crashed replica is down.

Steps To Reproduce

Milvus Log

No response

Anything else?

No response

Metadata

Metadata

Labels

kind/bugIssues or changes related a bugneeds-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions