Performance issue: allreduce_benchmark slower than ncclAllReduce

First of all, I'd like to express my sincere gratitude to all the contributors of this repository! 
I'm able to run the `allreduce_benchmark` smoothly, but unfortunately, its performance is significantly inferior to that of NCCL. I'm reaching out in the hope of getting some assistance from you.

### Experimental Setup
I launched 4 Docker containers on a DGX-1 machine. The Docker image was built from [rdma.dockerfile](https://github.com/p4lang/p4app-switchML/blob/main/dev_root/dockerfiles/rdma.dockerfile). The command to start the containers is as follows:
```
sudo docker run -dit --gpus device=$id \  # Each container can only use one V100 GPU
            --net=host --cap-add=IPC_LOCK \
            --shm-size=32768m \
            --cap-add SYS_ADMIN \
            --cap-add SYS_RESOURCE  \
            --device=/dev/infiniband/uverbs$id \
            --name switchml-rdma$id \
            my_image_name:tag
```
I compiled all the files using the command:
```
make RDMA=1 TIMEOUTS=0 VCL=1 DEBUG=0
```
I modified the configuration file as follows:
```
num_workers = 4
num_worker_threads = 8
max_outstanding_packets = 256
packet_numel = 64
backend = rdma
prepostprocessor = bypass 
msg_numel = 64
```

After that, I ran the `allreduce_benchmark` test:
```
./bin/allreduce_benchmark --tensor-numel=1048576 --tensor-type=int32
```
The output is shown below:
```
Signal handler thread started. Waiting for any signals.
Submitting 5 warmup jobs.
Warmup finished.
Submitting 10 jobs.
Job(s) #0# finished. Duration: #531972764# ns Goodput: #0.0630755# Gbps.
Job(s) #1# finished. Duration: #539886204# ns Goodput: #0.0621509# Gbps.
Job(s) #2# finished. Duration: #575978341# ns Goodput: #0.0582564# Gbps.
Job(s) #3# finished. Duration: #567985377# ns Goodput: #0.0590762# Gbps.
Job(s) #4# finished. Duration: #587951282# ns Goodput: #0.0570701# Gbps.
Job(s) #5# finished. Duration: #572039624# ns Goodput: #0.0586575# Gbps.
Job(s) #6# finished. Duration: #560061712# ns Goodput: #0.059912# Gbps.
Job(s) #7# finished. Duration: #575842023# ns Goodput: #0.0582702# Gbps.
Job(s) #8# finished. Duration: #583945036# ns Goodput: #0.0574616# Gbps.
Job(s) #9# finished. Duration: #556026482# ns Goodput: #0.0603468# Gbps.
All jobs finished.


Min 531972764 ns 0.0630755 Gbps
Max 587951282 ns 0.0570701 Gbps
Median 572039624 ns 0.0586575 Gbps
Mean 565168884 ns 0.0593706 Gbps
Std dev 1.73446e+07 ns
Cleaning up.
Signal handler thread is exiting 
```
As you can see, the results are quite poor (I'm using four mlx5 NICs, each with a 100Gbps port).

<img width="1226" alt="Image" src="https://github.com/user-attachments/assets/d7ea936d-6e47-4156-9963-1ab13fe9aa20" />

ports 3/0, 5/0, 7/0, 9/0 are used in our experiments.
### NCCL Test
To ensure the fairness of the experiment as much as possible, I used 4 containers with a similar configuration. I specifically set `--net=none` and manually modified the network namespace of the network card so that each container could only use one MLX5 NIC.
```
sudo docker run --name $container_name  \
                --rm \
                --gpus $gpu_device \ # Each container can only use one V100 GPU
                --net=none \
                --cap-add IPC_LOCK \
                --cap-add NET_ADMIN \
                --shm-size=32768m \
                --cap-add SYS_ADMIN \
                --cap-add SYS_RESOURCE  \
                --device /dev/infiniband/rdma_cm \
                --device /dev/infiniband/issm$id \
                --device /dev/infiniband/umad$id \
                --device /dev/infiniband/uverbs$id \
                --hostname worker-$id \
                -v $share_dir:/shared \
                -dit $image /bin/bash
```
Then I conducted the NCCL allreduce experiment. Each time, it processed 4MB of data. I first performed 5 warmup runs and then the official test:
```
mpirun -np 4 -hostfile hostfile.txt -x NCCL_IB_GID_INDEX=3 ./nccl_allreduce_test int32
```

The results are as follows:
```
[MPI Rank 0] Success with data type int32, Time taken: 0.002519 seconds, Throughput: 13.320537 Gbps
[MPI Rank 2] Success with data type int32, Time taken: 0.002435 seconds, Throughput: 13.780054 Gbps
[MPI Rank 3] Success with data type int32, Time taken: 0.002456 seconds, Throughput: 13.662228 Gbps
[MPI Rank 1] Success with data type int32, Time taken: 0.002563 seconds, Throughput: 13.091858 Gbps
```
I've ensured that all containers communicate through the switch:

![Image](https://github.com/user-attachments/assets/0c37e069-61be-4788-918d-8f3d9cad3b90)

In the figure, each container received a total of 40156402 bytes. In the NCCL ring allreduce with 4 ranks, each rank needs to receive 1MB * 2 * 3 = 6MB of data in one task, and a total of 36MB in 6 tasks. This is close to the actual result, which proves that NCCL is indeed performing inter-machine communication across 4 ranks rather than intra-machine communication.

I'm really looking forward to your help in resolving the performance issue of `allreduce_benchmark`. Thank you! 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance issue: allreduce_benchmark slower than ncclAllReduce #45

Experimental Setup

NCCL Test

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance issue: allreduce_benchmark slower than ncclAllReduce #45

Description

Experimental Setup

NCCL Test

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions