Skip to content

Performance issue: allreduce_benchmark slower than ncclAllReduce #45

@haolinyan

Description

@haolinyan

First of all, I'd like to express my sincere gratitude to all the contributors of this repository!
I'm able to run the allreduce_benchmark smoothly, but unfortunately, its performance is significantly inferior to that of NCCL. I'm reaching out in the hope of getting some assistance from you.

Experimental Setup

I launched 4 Docker containers on a DGX-1 machine. The Docker image was built from rdma.dockerfile. The command to start the containers is as follows:

sudo docker run -dit --gpus device=$id \  # Each container can only use one V100 GPU
            --net=host --cap-add=IPC_LOCK \
            --shm-size=32768m \
            --cap-add SYS_ADMIN \
            --cap-add SYS_RESOURCE  \
            --device=/dev/infiniband/uverbs$id \
            --name switchml-rdma$id \
            my_image_name:tag

I compiled all the files using the command:

make RDMA=1 TIMEOUTS=0 VCL=1 DEBUG=0

I modified the configuration file as follows:

num_workers = 4
num_worker_threads = 8
max_outstanding_packets = 256
packet_numel = 64
backend = rdma
prepostprocessor = bypass 
msg_numel = 64

After that, I ran the allreduce_benchmark test:

./bin/allreduce_benchmark --tensor-numel=1048576 --tensor-type=int32

The output is shown below:

Signal handler thread started. Waiting for any signals.
Submitting 5 warmup jobs.
Warmup finished.
Submitting 10 jobs.
Job(s) #0# finished. Duration: #531972764# ns Goodput: #0.0630755# Gbps.
Job(s) #1# finished. Duration: #539886204# ns Goodput: #0.0621509# Gbps.
Job(s) #2# finished. Duration: #575978341# ns Goodput: #0.0582564# Gbps.
Job(s) #3# finished. Duration: #567985377# ns Goodput: #0.0590762# Gbps.
Job(s) #4# finished. Duration: #587951282# ns Goodput: #0.0570701# Gbps.
Job(s) #5# finished. Duration: #572039624# ns Goodput: #0.0586575# Gbps.
Job(s) #6# finished. Duration: #560061712# ns Goodput: #0.059912# Gbps.
Job(s) #7# finished. Duration: #575842023# ns Goodput: #0.0582702# Gbps.
Job(s) #8# finished. Duration: #583945036# ns Goodput: #0.0574616# Gbps.
Job(s) #9# finished. Duration: #556026482# ns Goodput: #0.0603468# Gbps.
All jobs finished.


Min 531972764 ns 0.0630755 Gbps
Max 587951282 ns 0.0570701 Gbps
Median 572039624 ns 0.0586575 Gbps
Mean 565168884 ns 0.0593706 Gbps
Std dev 1.73446e+07 ns
Cleaning up.
Signal handler thread is exiting 

As you can see, the results are quite poor (I'm using four mlx5 NICs, each with a 100Gbps port).

Image

ports 3/0, 5/0, 7/0, 9/0 are used in our experiments.

NCCL Test

To ensure the fairness of the experiment as much as possible, I used 4 containers with a similar configuration. I specifically set --net=none and manually modified the network namespace of the network card so that each container could only use one MLX5 NIC.

sudo docker run --name $container_name  \
                --rm \
                --gpus $gpu_device \ # Each container can only use one V100 GPU
                --net=none \
                --cap-add IPC_LOCK \
                --cap-add NET_ADMIN \
                --shm-size=32768m \
                --cap-add SYS_ADMIN \
                --cap-add SYS_RESOURCE  \
                --device /dev/infiniband/rdma_cm \
                --device /dev/infiniband/issm$id \
                --device /dev/infiniband/umad$id \
                --device /dev/infiniband/uverbs$id \
                --hostname worker-$id \
                -v $share_dir:/shared \
                -dit $image /bin/bash

Then I conducted the NCCL allreduce experiment. Each time, it processed 4MB of data. I first performed 5 warmup runs and then the official test:

mpirun -np 4 -hostfile hostfile.txt -x NCCL_IB_GID_INDEX=3 ./nccl_allreduce_test int32

The results are as follows:

[MPI Rank 0] Success with data type int32, Time taken: 0.002519 seconds, Throughput: 13.320537 Gbps
[MPI Rank 2] Success with data type int32, Time taken: 0.002435 seconds, Throughput: 13.780054 Gbps
[MPI Rank 3] Success with data type int32, Time taken: 0.002456 seconds, Throughput: 13.662228 Gbps
[MPI Rank 1] Success with data type int32, Time taken: 0.002563 seconds, Throughput: 13.091858 Gbps

I've ensured that all containers communicate through the switch:

Image

In the figure, each container received a total of 40156402 bytes. In the NCCL ring allreduce with 4 ranks, each rank needs to receive 1MB * 2 * 3 = 6MB of data in one task, and a total of 36MB in 6 tasks. This is close to the actual result, which proves that NCCL is indeed performing inter-machine communication across 4 ranks rather than intra-machine communication.

I'm really looking forward to your help in resolving the performance issue of allreduce_benchmark. Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions