-
Notifications
You must be signed in to change notification settings - Fork 50
Description
First of all, I'd like to express my sincere gratitude to all the contributors of this repository!
I'm able to run the allreduce_benchmark smoothly, but unfortunately, its performance is significantly inferior to that of NCCL. I'm reaching out in the hope of getting some assistance from you.
Experimental Setup
I launched 4 Docker containers on a DGX-1 machine. The Docker image was built from rdma.dockerfile. The command to start the containers is as follows:
sudo docker run -dit --gpus device=$id \ # Each container can only use one V100 GPU
--net=host --cap-add=IPC_LOCK \
--shm-size=32768m \
--cap-add SYS_ADMIN \
--cap-add SYS_RESOURCE \
--device=/dev/infiniband/uverbs$id \
--name switchml-rdma$id \
my_image_name:tag
I compiled all the files using the command:
make RDMA=1 TIMEOUTS=0 VCL=1 DEBUG=0
I modified the configuration file as follows:
num_workers = 4
num_worker_threads = 8
max_outstanding_packets = 256
packet_numel = 64
backend = rdma
prepostprocessor = bypass
msg_numel = 64
After that, I ran the allreduce_benchmark test:
./bin/allreduce_benchmark --tensor-numel=1048576 --tensor-type=int32
The output is shown below:
Signal handler thread started. Waiting for any signals.
Submitting 5 warmup jobs.
Warmup finished.
Submitting 10 jobs.
Job(s) #0# finished. Duration: #531972764# ns Goodput: #0.0630755# Gbps.
Job(s) #1# finished. Duration: #539886204# ns Goodput: #0.0621509# Gbps.
Job(s) #2# finished. Duration: #575978341# ns Goodput: #0.0582564# Gbps.
Job(s) #3# finished. Duration: #567985377# ns Goodput: #0.0590762# Gbps.
Job(s) #4# finished. Duration: #587951282# ns Goodput: #0.0570701# Gbps.
Job(s) #5# finished. Duration: #572039624# ns Goodput: #0.0586575# Gbps.
Job(s) #6# finished. Duration: #560061712# ns Goodput: #0.059912# Gbps.
Job(s) #7# finished. Duration: #575842023# ns Goodput: #0.0582702# Gbps.
Job(s) #8# finished. Duration: #583945036# ns Goodput: #0.0574616# Gbps.
Job(s) #9# finished. Duration: #556026482# ns Goodput: #0.0603468# Gbps.
All jobs finished.
Min 531972764 ns 0.0630755 Gbps
Max 587951282 ns 0.0570701 Gbps
Median 572039624 ns 0.0586575 Gbps
Mean 565168884 ns 0.0593706 Gbps
Std dev 1.73446e+07 ns
Cleaning up.
Signal handler thread is exiting
As you can see, the results are quite poor (I'm using four mlx5 NICs, each with a 100Gbps port).
ports 3/0, 5/0, 7/0, 9/0 are used in our experiments.
NCCL Test
To ensure the fairness of the experiment as much as possible, I used 4 containers with a similar configuration. I specifically set --net=none and manually modified the network namespace of the network card so that each container could only use one MLX5 NIC.
sudo docker run --name $container_name \
--rm \
--gpus $gpu_device \ # Each container can only use one V100 GPU
--net=none \
--cap-add IPC_LOCK \
--cap-add NET_ADMIN \
--shm-size=32768m \
--cap-add SYS_ADMIN \
--cap-add SYS_RESOURCE \
--device /dev/infiniband/rdma_cm \
--device /dev/infiniband/issm$id \
--device /dev/infiniband/umad$id \
--device /dev/infiniband/uverbs$id \
--hostname worker-$id \
-v $share_dir:/shared \
-dit $image /bin/bash
Then I conducted the NCCL allreduce experiment. Each time, it processed 4MB of data. I first performed 5 warmup runs and then the official test:
mpirun -np 4 -hostfile hostfile.txt -x NCCL_IB_GID_INDEX=3 ./nccl_allreduce_test int32
The results are as follows:
[MPI Rank 0] Success with data type int32, Time taken: 0.002519 seconds, Throughput: 13.320537 Gbps
[MPI Rank 2] Success with data type int32, Time taken: 0.002435 seconds, Throughput: 13.780054 Gbps
[MPI Rank 3] Success with data type int32, Time taken: 0.002456 seconds, Throughput: 13.662228 Gbps
[MPI Rank 1] Success with data type int32, Time taken: 0.002563 seconds, Throughput: 13.091858 Gbps
I've ensured that all containers communicate through the switch:
In the figure, each container received a total of 40156402 bytes. In the NCCL ring allreduce with 4 ranks, each rank needs to receive 1MB * 2 * 3 = 6MB of data in one task, and a total of 36MB in 6 tasks. This is close to the actual result, which proves that NCCL is indeed performing inter-machine communication across 4 ranks rather than intra-machine communication.
I'm really looking forward to your help in resolving the performance issue of allreduce_benchmark. Thank you!
