o оЯ?e┌у@s@dZddlZddlZddlmZddlmZddlmZddlm Z ddlm Z dd ДZd dДZdd ДZ ddДZddДZddДZ dvєz!_padded_split..r ) rr r┌dims┌valuerrr┌concatZzerosr┌split┌range) r┌piecesr┌ tensor_len┌pad_lenZextended_whole┌parts┌last_chunk_size┌ piece_lensrrr┌ _padded_splitJsD Ї э х$уr.c CsЮ|stdГВ|dj}t|ГdkrtdГВt|d|Г}|dkr%tdГВg}|D]#}tа|бП|аtа|dg|gббWdГn1sGwYq)|S)aГStrip the suffix padding added by _padded_split. Args: tensors: list of `tf.Tensor` of identical length 1D tensors. pad_len: number of elements to be stripped from the end of each tensor. Returns: list of `tf.Tensor` which are the stripped inputs. Raises: ValueError: tensors must be a non-empty list of 1D tensors, and each must be longer than pad_len. rrr ztensors must be 1Dzpad_len longer than tensorN) rrr ┌intrrrr┌slice)rr*r┌ prefix_len┌strippedrrrr┌_strip_paddings Аr3cs╓|j}dt|Гkr tdГВ|jdj}||Йtа|бПE||ИkrT|dks)JВ||dИ}|dks7JВЗfddДt|dГDГ|g}tа ||бWdГStа ||бWdГS1sdwYdS)a▐Like split for 1D tensors but allows case where len % pieces != 0. Args: tensor: `tf.Tensor` that must be 1D. pieces: a positive integer specifying the number of pieces into which tensor should be split. Returns: list of `tf.Tensor` of length pieces, which hold the values of the input tensor, in order. The final tensor may be shorter than the others, which will all be of equal length. Raises: ValueError: input tensor must be 1D. r rrcrrrrrrrr!╖r"z!_ragged_split..N) rr rr#r$rrr'rr&)rr(rr)r,r-rrr┌ _ragged_splitЬs ∙ $ўr4cs`t|ГЙ|ИЙИdkrggfS|Иkrtd|ИfГВtdtИ|ГГ}g}td|ГD]/}g}||}td|ГD]ЙЗЗfddД|DГ}||dЕ|d|Е} || 7}q9|а|бq,ЗfddДtd|ГDГ} ЗfddДtd|ГDГ}td|ГD]2}tdИГD]*}tdИГD]"} |||| krй| |||<||| ИdИ| ||<qкqЗqАqy| |fS) a╜"Generate an array of device index arrays, one for each subchunk. In the basic ring reduction algorithm there are size(T)/num_devices data chunks and each device process one chunk per tick, i.e. sending one chunk and receiving one chunk. The idea of subchunking is that each device processes num_subchunks smaller data regions per tick, and the ring rank permutation is different for each subchunk index so that a device is potentially sending to and receiving from num_subchunks different other devices at each tick. Where multiple independent data channels exist between devices, this strategy supplies a method of using them in parallel. Args: num_workers: number of worker tasks num_subchunks: number of subchunks into which to divide each per-GPU chunk. gpu_perm: an array of integers in [0, num_gpus-1] giving the default ring order of GPUs at each worker. Other permutations will be generated by rotating this array and splicing together per-worker instances. Raises: ValueError: the number of subchunks may not exceed the number of GPUs. Returns: pred_by_s_d: list of lists that maps (by index) from (subchunk, dev) to preceding device in the permutation for that subchunk. The device index of GPU i at worker j is i + (j * num_gpus). rank_by_s_d: list of lists that maps (by index) from (subchunk, dev) to local rank of device d in the permutation for that subchunk. rz'num_subchunks %d must be <= num_gpus %dr csg|]}ИИ|СqSrr)r┌i)┌num_gpus┌wrrr!шsz&_ring_permutations..Ncє g|]}ddДtdИГDГСqS)cSєg|]}dСqSйr rйr┌drrrr!ьr"·1_ring_permutations...rйr'йr┌sй┌devicesrrr!ьє cr8)cSr9r:rr;rrrr!юr"r=rr>r?rArrr!юrC)r r┌maxr/r'r)┌num_workers┌ num_subchunks┌gpu_permZrotation_intervalZ perms_by_sr@Z full_order┌offsetZ default_orderZ dev_order┌pred_by_s_d┌rank_by_s_dr<rr)rBr6r7r┌_ring_permutations╜sF ¤А■rKc CsЪt|Гdkr tdГВt|Г\}}ddД|DГ}t|||Г\}} t||||| |Г\} }|r1t|| Г} t|| | Г}|dkr@t||Г}t|ГdkrKt||Г}|S)avConstruct a subgraph performing a ring-style all-reduce of input_tensors. Args: input_tensors: a list of `tf.Tensor` objects, which must all have the same shape and type. num_workers: number of worker tasks spanned by input_tensors. num_subchunks: number of subchunks each device should process in one tick. gpu_perm: a list of ints giving a ring-wise rank ordering of GPUs at each worker. All workers must have the same number of GPUs with the same rank ordering. If NVLINK is available, this should be a ring order supported by NVLINK edges. red_op: a binary operator for elementwise reduction. un_op: an optional unary operator to apply to fully reduced values. Raises: ValueError: empty input_tensors or they don't all have same size. Returns: a list of `tf.Tensor` identical sum-reductions of input_tensors. щz(input_tensors must be length 2 or longercSєg|]}|jСqSrrйrrrrrr!єz)build_ring_all_reduce..rr ) r rrrK┌_build_ring_gather┌_apply_unary_to_chunks┌_build_ring_scatterr3r) ┌ input_tensorsrErFrG┌red_op┌un_oprrBrIrJ┌ chunks_by_devr*┌output_tensorsrrr┌build_ring_all_reduce·s* ■ rXc Cs▄t|Г}|dkr gS|dkr|S|dj}dt|ГkrtdГВ||}|d} g} d}td|ГD]'}tа||бПt|||Г\} }| а| бWdГn1sRwYq0td| ГD]М}ddДtd|ГDГ}td|ГD]L}tа||бП;td|ГD]-}|||}||d||}|||}|||}|| ||| ||Г||<q}WdГn1s╡wYqntd|ГD](}td|ГD] }|||}||d||}|||}||| ||<q╟q└q]| |fS)aХConstruct a subgraph for the first (reduction) pass of ring all-reduce. Args: input_tensors: a list of `tf.Tensor` 1D input tensors of same shape and type. devices: array of device name strings num_subchunks: number of subchunks each device should process in one tick. pred_by_s_d: as produced by _ring_permutations rank_by_s_d: as produced by _ring_permutations red_op: a binary operator for elementwise reduction Raises: ValueError: tensors must all be one dimensional. Returns: list of list of `tf.Tensor` of (partially) reduced values where exactly num_subchunks chunks at each device are fully reduced. rr zinput tensors must be 1DNcSr9йNrrrrrr!Mr"z&_build_ring_gather..rL)r rrr'rrr.r)rSrBrFrIrJrT┌num_devicesr┌ num_chunks┌ num_ticksrVZ split_pad_lenr<Zsplits┌tickZnew_partial_reductionsr@┌rank┌ seg_index┌pred_dev┌chunk_indexrrrrP%sV ■А ■√ А № rPc sXg}|D]%}tа|dбП|аЗfddД|DГбWdГn1s$wYq|S)a&Apply a unary op to each tensor in chunks_by_dev, on same device. Args: f: a unary function over `tf.Tensor`. chunks_by_dev: list of lists of `tf.Tensor`. Returns: new list of lists of `tf.Tensor` with the same structure as chunks_by_dev containing the derived tensors. rcєg|]}И|ГСqSrrrNй┌frrr!qєz*_apply_unary_to_chunks..N)rrr)rdrV┌output┌xrrcrrQcs АrQc Csоt|Г}t|dГ}d||krtdГВt||Г}|d}td|ГD]К}ddДtd|ГDГ}td|ГD]J} tа|| dбП7td|ГD])} || | }||d||}|| | } ||| }tа|| |б||<qEWdГn1sywYq4td|ГD](} td|ГD] } || | }||d||}||| }|||| |<qЛqДq#g}|D]"}tа|dбП|аtа |dббWdГn1s╧wYq▓|S)a┌Construct subgraph for second (scatter) pass of ring all-reduce. Args: pred_by_s_d: as produced by _ring_permutations rank_by_s_d: as produced by _ring_permutations chunks_by_dev: list of list of `tf.Tensor` indexed by ints (device, chunk) Raises: ValueError: chunks_by_dev is not well-formed Returns: list of `tf.Tensor` which are the fully reduced tensors, one at each device corresponding to the outer dimension of chunks_by_dev. rzAExpect number of chunks per device to be divisible by num_devicesr cSr9rYrrrrrr!Оr"z'_build_ring_scatter..N) r rr/r'rrr┌identityrr%)rIrJrVrZr[rFr\r]Z passed_valuesr<r@r^r_r`rarfrgrrrrRusL √ А№ АrRcs`ddД|DГ}t|Г\}}t|||Г}ИrЗfddД|DГ}t||Г}t|Гdkr.t||Г}|S)aConstruct a subgraph for recursive halving-doubling all-reduce. The recursive halving-doubling algorithm is described in (Thakur et al., 2015). The concept is to arrange the participating n devices in a linear sequence where devices exchange data pairwise with one other device in each round. During the gather phase there are lg(n) rounds where devices exchange increasingly smaller sub-tensors with another device at increasingly greater distances, until at the top each device has 1/n of the fully reduced values. During the scatter phase each device exchanges its fully reduced sub-tensor (which doubles in length at each round) with one other device at increasingly smaller distances until each device has all of the fully reduced values. Note: this preliminary version requires that len(input_tensors) be a power of 2. TODO(tucker): relax this restriction. Also, the number of elements in each tensor must be divisible by 2^h where h is the number of hops in each phase. This will also be relaxed in the future with edge-case specific logic. Args: input_tensors: list of `tf.Tensor` to be elementwise reduced. red_op: a binary elementwise reduction Op. un_op: an optional unary elementwise Op to apply to reduced values. Returns: list of `tf.Tensor` which are the fully reduced tensors, one at each device of input_tensors. Raises: ValueError: num_devices not a power of 2, or tensor len not divisible by 2 the proper number of times. References: Optimization of Collective Communication Operations in MPICH: [Thakur et al., 2005] (https://journals.sagepub.com/doi/abs/10.1177/1094342005051521) ([pdf](http://wwwi10.lrr.in.tum.de/~gerndt/home/Teaching/HPCSeminar/mpich_multi_coll.pdf)) cSrMrrrNrrrr!╤rOz1build_recursive_hd_all_reduce..crbrrrNйrUrrr!╒rer )r┌_build_recursive_hd_gather┌_build_recursive_hd_scatterr r)rSrTrUrBr┌reduced_shardsrWrrir┌build_recursive_hd_all_reduceжs+ rmc CsDt|Г}ttа|dбГ}|d|krtdГВ|}td|ГD]В}d|}|d}ddД|DГ} td|ГD]i} | ||dkr>q3|| }|| |}tа|| dб} tа|| |dб}tа |бП|| d|dГ| | <WdГn1suwYtа |бП|| d|dГ| | |<WdГn1sЧwYq3| }q|S)aConstruct the gather phase of recursive halving-doubling all-reduce. Args: input_tensors: list of `tf.Tensor` to be elementwise reduced. devices: a list of strings naming the devices hosting input_tensors, which will also be used to host the (partial) reduction values. red_op: a binary elementwise reduction Op. Returns: list of `tf.Tensor` which are the fully reduced tensor shards. Raises: ValueError: num_devices not a power of 2, or tensor len not divisible by 2 the proper number of times. rL· num_devices must be a power of 2rcSєg|]}gСqSrrrrrrr!Їr"z._build_recursive_hd_gather..Nr ) r r/┌math┌logrr'rr&rr)rSrBrTrZ┌num_hops┌chunks┌h┌span┌ group_size┌ new_chunksr<┌left_dev┌ right_dev┌ left_split┌right_splitrrrrj▄s2 Аrjc Cs4t|Г}ttа|dбГ}|d|ksJdГВ|}ttd|ГГD]x}d|}|d}ddД|DГ}td|ГD]_} | ||dkr@q5| } | |}|| }||} tа|бПtа || ||gdб|| <WdГn1slwYtа| бПtа || ||gdб||<WdГn1sПwYq5|}q|S)aRConstruct the scatter phase of recursive halving-doubling all-reduce. Args: input_tensors: list of `tf.Tensor` that are fully-reduced shards. devices: a list of strings naming the devices on which the reconstituted full tensors should be placed. Returns: list of `tf.Tensor` which are the fully reduced tensors. rLrnrcSrorrrrrrr!r"z/_build_recursive_hd_scatter..N) r r/rprq┌reversedr'rrrr%)rSrBrZrrrsrtrurvrwr<Zleft_idxZ right_idxrxryrrrrks@ АrkcCsLt|Г\}}ddД|DГ}t||||Г}t||Г}t|Гdkr$t||Г}|S)aConstruct a subgraph for shuffle all-reduce. Shuffle reduce is essentially the algorithm implemented when using parameter servers. Suppose tensor length is n, there are d devices and g gather shards. Each device sends a n/g length sub-tensor to each gather shard. The gather shards perform a reduction across d fragments, then broadcast the result back to each device. The devices then join the g fully reduced fragments they receive from the shards. The gather shards could perform d-1 pairwise reductions, or one d-way reduction. The first is better where reduction Op time is low compared to transmission time, the second better in the other case. Args: input_tensors: list of `tf.Tensor` values to be reduced. gather_devices: list of names of devices on which reduction shards should be placed. red_op: an n-array elementwise reduction Op un_op: optional elementwise unary Op to be applied to fully-reduced values. Returns: list of `tf.Tensor` which are the fully reduced tensors. cSrMrrrNrrrr!CrOz,build_shuffle_all_reduce..r )r┌_build_shuffle_gather┌_build_shuffle_scatterr r)rS┌gather_devicesrTrUr┌dst_devicesrlrWrrr┌build_shuffle_all_reduce*s rБc s·t|Г}t|Г}|dj}t|ГdkrtdГВg}td|ГD]#Йtа|ИбП|аt|И|ГбWdГn1s.) r rrr'rrrr4r)rSrrTrUZnum_source_devicesZnum_gather_devicesrZshards_by_sourcerl┌valuesZ red_shardrrВrr}Ls0 А√Аr}c Cs`t|Г}g}td|ГD]"}tа||бП|аtа|dббWdГn1s(wYq|S)aBuild the scatter phase of shuffle all-reduce. Args: reduced_shards: list of `tf.Tensor` fully reduced shards dst_devices: list of names of devices at which the fully-reduced value should be reconstituted. Returns: list of `tf.Tensor` scattered tensors. rN)r r'rrrrr%)rlrАrZZout_tensorsr<rrrr~qs Аr~cCs┌t|Г}|t|ГkrtdГВtаб}tаб}t|ГD]F}tjа||б}t|dГr.|j dur6Jd||ГВ|j p:d|jp>d|j f}||vrNg||<g||<||а||б||а||бqt |абГt |абГfS)a]Partition devices and values by common task. Args: devices: list of device name strings values: list of `tf.Tensor` of same length as devices. Returns: (per_task_devices, per_task_values) where both values are lists of lists with isomorphic structure: the outer list is indexed by task, and the inner list has length of the number of values belonging to that task. per_task_devices contains the specific devices to which the values are local, and per_task_values contains the corresponding values. Raises: ValueError: devices must be same length as values. z#len(devices) must equal len(values)┌taskNFzfailed to parse device %s┌ localhostr)r r┌collections┌OrderedDictr'┌ device_libZ DeviceSpecZfrom_string┌hasattrrДZjobZreplicar┌listrГ)rBrГrZZper_task_devicesZper_task_valuesr<Zd_spec┌indexrrr┌_split_by_taskДs rМc Csr|tjkrtа|б}ntd|ГВ|r7g}|D]}tа|бП|а||ГбWdГn1s/wYq|}|S)aдBuild a subgraph that does one full all-reduce, using NCCL. Args: input_tensors: list of `tf.Tensor` of same-shape and type values to be reduced. red_op: binary elementwise reduction operator. Must be one of {tf.add} un_op: optional unary elementwise Op to apply to fully-reduce values. Returns: list of `tf.Tensor` of reduced values. Raises: ValueError: red_op not supported. z)red_op not supported by NCCL all-reduce: N)r┌addrZall_sumrrrr)rSrTrUrWZ un_op_wrappedrrrr┌build_nccl_all_reduceйs АrОc Cs╨t|Г\}}ddД|DГ}t||Г\}}t|Г}ddДtd|ГDГ}|ddЕ} |ddЕ} td|ГD]G}t|||Г}tа|бП1tа|djбПtа |dб||<WdГn1s^wY||d| |<WdГn1suwYq3||Г} td|ГD]O}g}tа||dбПt аtа | |бб}WdГn1sжwY||D]}tа|бП|аtа |ббWdГn1s╔wYqп|| |<qДddД| DГ}t|Гdkrцt ||Г}|S)aИConstruct a subgraph for NCCL hybrid all-reduce. Args: input_tensors: list of `tf.Tensor` of same-shape and type values to be reduced. red_op: binary elementwise reduction operator. upper_level_f: function for reducing one value per worker, across workers. Returns: list of `tf.Tensor` of reduced values. Raises: ValueError: inputs not well-formed. cSrMrrrNrrrr!╫rOz&_build_nccl_hybrid..cSr9rYr)rr7rrrr!┌r"rNcSsg|] }|D]}|СqqSrr)rZsublist┌vrrrr!єsr )rrМr r'rОrZcontrol_dependenciesrrrhr┌ broadcastrr)rSrT┌ upper_level_frrB┌per_worker_devices┌per_worker_valuesrE┌ up_valuesZ up_devicesZdown_valuesr7Z worker_values┌level_2_outputZdst_tensorsZ broadcast_srcr<rWrrr┌_build_nccl_hybrid╞s@ ¤А А rЦc Csft|Гdkr ||ГS|s|Sg}|D]}tа|бП|а||ГбWdГn1s+wYq|S)z9If len(input_tensors) > 1, apply red_f, else apply un_op.r N)r rrr)rSZred_frUrWrrrr┌_reduce_non_singleton∙s АrЧcs*ЗЗЗfddДЙЗЗfddД}t|И|ГS)z=Construct hybrid of NCCL within workers, Ring across workers.cєt|t|ГИdgИИГSйNrйrXr )┌yйrT┌subdivrUrr┌ upper_builder sz+build_nccl_then_ring..upper_buildercєt|ИИГSrYйrЧйrgйrUrЮrrrСєz+build_nccl_then_ring..upper_level_fйrЦ)rSrЭrTrUrСrйrTrЭrUrЮr┌build_nccl_then_ringsrжcsЗЗfddД}t|И|ГS)zEConstruct hybrid of NCCL within workers, Recursive-HD across workers.cst|ИИГSrY)rmrбйrTrUrr┌sz.build_nccl_then_recursive_hd..rд)rSrTrUrСrrзr┌build_nccl_then_recursive_hdsrйcsЗЗЗfddД}t|||ГS)z@Construct hybrid of NCCL within workers, Shuffle across workers.cst|ИИИГSrYйrБrбйr┌shuffle_red_oprUrrrСsz.build_nccl_then_shuffle..upper_level_frд)rSrZnccl_red_oprмrUrСrrлr┌build_nccl_then_shufflesrнcCs╩t|Г\}}ddД|DГ}t||Г\}}t|Г}g} t|Г|kr$tdГВtd|ГD]} t|| || g|Г}| а|dбq)|| Г}g} td|ГD]} | t|| g|| Г7} qIt|Гdkrct| |Г} | S)a╘Construct a subgraph for Shuffle hybrid all-reduce. Args: input_tensors: list of `tf.Tensor` of same-shape and type values to be reduced. gather_devices: list of device names on which to host gather shards. red_op: binary elementwise reduction operator. upper_level_f: function for reducing one value per worker, across workers. Returns: list of `tf.Tensor` of reduced values. Raises: ValueError: inputs not well-formed. cSrMrrrNrrrr!2rOz)_build_shuffle_hybrid..zGFor shuffle hybrid, gather_devices must contain one device per worker. rr ) rrМr rr'r}rr~r)rSrrTrСrrBrТrУrErФr7rlrХrWrrr┌_build_shuffle_hybrids* rоcs,ЗЗЗfddДЙЗЗfddД}t||||ГS)z@Construct hybrid of Shuffle within workers, Ring across workers.crШrЩrЪйrrЬrrrЮLs z.build_shuffle_then_ring..upper_buildercrЯrYrаrпrвrrrСOrгz.build_shuffle_then_ring..upper_level_fйrо)rSrrЭZred_n_oprTrUrСrrеr┌build_shuffle_then_ringIє r▒cs,ЗЗЗfddДЙЗЗfddД}t||И|ГS)zCConstruct hybrid of Shuffle within workers, Shuffle across workers.cst|ИИИГSrYrкrп)rT┌second_gather_devicesrUrrrЮXs z1build_shuffle_then_shuffle..upper_buildercrЯrYrаrпrвrrrС[rгz1build_shuffle_then_shuffle..upper_level_fr░)rSZfirst_gather_devicesr│rTrUrСr)rTr│rUrЮr┌build_shuffle_then_shuffleUr▓r┤rY)%┌__doc__rЖrpZtensorflow.python.frameworkrrИrZtensorflow.python.opsrrrrrr.r3r4rKrXrPrQrRrmrjrkrБr}r~rМrОrЦrЧrжrйrнrоr▒r┤rrrr┌sL5!> +> 16) % "% %3 +