## Distributed Data Parallel

distributed.py contains the source code for `apex.parallel.DistributedDataParallel`, a module wrapper that enables multi-process multi-GPU data parallel training optimized for NVIDIA's NCCL communication library.

`apex.parallel.DistributedDataParallel` achieves high performance by overlapping communication with
computation in the backward pass and bucketing smaller transfers to reduce the total number of
transfers required.

multiproc.py contains the source code for `apex.parallel.multiproc`, a launch utility that places one process on each of the node's available GPUs.

#### [API Documentation](https://nvidia.github.io/apex/parallel.html)

#### [Example/Walkthrough](https://github.com/NVIDIA/apex/tree/master/examples/distributed)

#### [Imagenet example with Mixed Precision](https://github.com/NVIDIA/apex/tree/master/examples/imagenet)

#### [Simple example with FP16_Optimizer](https://github.com/NVIDIA/apex/tree/master/examples/FP16_Optimizer_simple/distributed_apex)

### Synchronized Batch Normalization

`apex.parallel.SyncBatchNorm` has similar APIs as with `torch.nn.BatchNorm*N*d`.
It reduces stats on the first (channel) dimension of the Tensor and accepts
arbitrary spatial dimensions.

#### Installation

Apex provides two sync BN implementation:

1. There is the Python-only implementation, which is the default implementation
when install with `python setup.py install`.
It uses PyTorch primitive operations and distributed communication package from
`torch.distributed`.

   - _Python-only implementation requires input tensor to be of same data type as
layer_

2. We also provide implementation with kernels through CUDA/C++ extension with
improved performance. We are experimenting with Welford and Kahan for reduction
hoping to get better accuracy.
   To use the kernel implementation, user need to install Apex with CUDA extension
enabled `python setup.py install --cuda_ext`.

   - _Custom kernel implementation supports fp16 input with fp32 layer as cudnn.
This is required to run imagenet example in fp16._

   - _Currently kernel implementation only supports GPU._

#### HowTo

1. User could use `apex.parallel.SyncBatchNorm` by building their module with
the layer explicitly.

```
import apex
input_t = torch.randn(3, 5, 20).cuda()
sbn = apex.parallel.SyncBatchNorm(5).cuda()
output_t = sbn(input)
```

2. User could also take a constructed `torch.nn.Model` and replace all its `torch.nn.BatchNorm*N*d` modules with `apex.parallel.SyncBatchNorm` through utility function `apex.parallel.convert_syncbn_model`.

```
# model is an instance of torch.nn.Module
import apex
sync_bn_model = apex.parallel.convert_syncbn_model(model)
```