Skip to content

Dragonfly latency spikes during full sync replication #6131

@jojoxhsieh

Description

@jojoxhsieh

Describe the bug
When setting up a DragonflyDB replica to sync from a primary instance with 32 million keys, we observe significant latency spikes (p99 response time jumps from 4ms to 50ms) on the primary during the full sync process. This issue persists even under moderate load conditions (around 132K ops/sec). The latency spikes are detrimental to our application's performance and user experience.

To Reproduce
Steps to reproduce the behavior:

  1. Start a DragonflyDB primary instance with below command:
    docker run --rm --name dfly0 -d -p 16379:6379 --ulimit memlock=-1 --cpus=64 docker.dragonflydb.io/dragonflydb/dragonfly:v1.35.1 --cache_mode=true --maxmemory=64g --point_in_time_snapshot=false --background_snapshotting=false --dbfilename=
    
  2. Populate the primary with 32 million keys using a data population script.
  3. Start a DragonflyDB replica instance with below command:
    docker run --rm --name dfly1 -d -p 26379:6379 --ulimit memlock=-1 --cpus=64 docker.dragonflydb.io/dragonflydb/dragonfly:v1.35.1 --cache_mode=true --maxmemory=64g --point_in_time_snapshot=false --background_snapshotting=false --dbfilename=
    
  4. Use memtier_benchmark to generate a load of around 244K ops/sec on the primary as below:
    docker run --rm -d -v /$(pwd):/home redislabs/memtier_benchmark -s <primary_ip> -p 16379 --data-size=256  --command="MGET __key__ __key__ __key__ __key__ __key__" --command-ratio=10 --command="SET __key__ __data__" --command-ratio=1 --test-time=60 --rate-limiting=660 --json-out-file=/home/memtier_out.json
    
  5. Configure the replica to sync from the primary:
    redis-cli -h <replica_ip> -p 26379 replicaof <primary_ip> 16379
    
  6. Monitor the p99 latency on the primary using memtier_benchmark output.

Expected behavior
The p99 latency on the primary should remain stable or show minimal increase during the replica's full sync process, ideally staying below 10ms.

Screenshots
Image
Image

Environment (please complete the following information):

  • OS: [ubuntu 22.04]
  • Kernel: Linux as-fscache-3043 5.15.0-157-generic #167-Ubuntu SMP Wed Sep 17 21:35:53 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
  • Containerized?: Yes, Docker 28.5.1
  • Dragonfly Version: v1.35.1

Reproducible Code Snippet
Script to populate data (populate_data.py):

import redis
import os
import base64

# ---------- Config ----------
REDIS_HOST = '10.143.160.122'
REDIS_PORT = 16379
DB = 0

TOTAL_KEYS = 32_000_000
BATCH_SIZE = 10_000          # increase batch size
KEY_PREFIX = b"rand:"        # use bytes directly


# ---------- Helpers ----------
# Faster random generator using os.urandom + base64
def random_bytes(n: int) -> bytes:
    # base64 expands by ~4/3; trim to exactly n
    return base64.urlsafe_b64encode(os.urandom(int(n * 0.8)))[:n]


def random_key_bytes(length: int = 16) -> bytes:
    return KEY_PREFIX + random_bytes(length)


def random_value_bytes(length: int = 1024) -> bytes:
    return random_bytes(length)


# ---------- Main ----------
def main():
    # decode_responses=False to avoid encoding/decoding overhead
    r = redis.Redis(
        host=REDIS_HOST,
        port=REDIS_PORT,
        db=DB,
        decode_responses=False,
    )

    # transaction=False to avoid MULTI/EXEC
    pipe = r.pipeline(transaction=False)

    for i in range(1, TOTAL_KEYS + 1):
        k = random_key_bytes()
        v = random_value_bytes()
        pipe.set(k, v)

        if i % BATCH_SIZE == 0:
            pipe.execute()
            # print(f"Inserted {i} keys...")

    # flush remaining
    pipe.execute()
    print("Done.")


if __name__ == "__main__":
    main()

Additional context
I found another issue report that seems related: #4787, which discusses DragonflyDB is unresponsive during full sync replication. Looks like that issue is fixed already, but the latency spike problem still persists in our case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions