In this section, we first describe the architecture of PDLB, presenting several challenges that need to be addressed. Then, we present the details how to estimate path asymmetry at the end-host. Moreover, we theoretically analyze how to obtain the granularity of the flowcell according to the network congestion state. Finally, we evaluate the accuracy of the model.
Path asymmetry estimation
Load balancing schemes that require path congestion information, naturally, are much more complex [
22]. Modern data center networks are usually organized in multi-rooted tree topologies, in which the load balancing schemes split traffic across multiple paths between the source leaf switch and the destination one. To obtain the path congestion state, PDLB should measure the end-to-end delay of the transmission path.
The path asymmetry can be estimated in advance at the sender based on the measurement of RTT [
23]. Since PDLB mainly adjusts the granularity of flowcell adaptively according to the difference of the path, in order to realize the purpose of transferring the traffic from the congested path to the uncongested path. But the sender is unable to directly obtain the exact RTT for each path. Here, we utilize the TCP congestion control mechanism [
15]. Specifically, when the sender receives the ACK packets, based on the corresponding RTT of each ACK, the bad path probability is calculated as the ratio of the number of ACK packets with large RTT to the total number of received ACK packets. For example, assuming that 10 flowcells are sent at the sender, the number of flowcells should generally be consistent with the number of paths. Since the switch uses polling scattering to route the flowcells, the sender calculates the probability of the bad path according to the path delay. If the transmission times of two flowcells are obviously longer than that of other 8 flowcells, it is possible that these two flowcells are taking the bad path. Then, the probability of a bad path
\(P_{b}\)=0.2. To limit the computing and memory overhead, the sender measures the path delay every 100
\(\mu\)s.
PDLB brings about limited overhead, since it only measures the one-way-delay by using source and destination hosts. Moreover, to reduce overhead and enhance scalability, PDLB periodically uses a few data and ACK packets to carry the delay information in the option field of TCP header. Thus, the path delay is collected with very small traffic overhead and deployment overhead.
Flowcell size adjustment
The flowcell size affects both the TCP reordering probability and network utilization under different asymmetric degrees. We give the analysis on how to get the optimal value of flowcell size as follows.
Let
\(F_{size}\) and
gran denote the size of a TCP flow and flowcell granularity, respectively. Because the smallest granularity is a packet, and the largest granularity is 64KB (44 packets) [
8,
18], the value of
gran ranges from 1 to 44. Then the flow is cut into
\({\frac{F_{size}}{gran}}\) flowcells according to
gran granularity.
When the flowcell are transferred on multiple paths, a flowcell is out-of-order only when at least one flowcell sent later arrives at the destination earlier. We assume that the n flowcells may select m parallel paths, which consist of \(N_{b}\) congested paths and \(N_{g}\) uncongested paths with the propagation delay \(D_{b}\) and \(D_{g}\), respectively. Assuming that the ratio of the number of bad paths to the number of good paths is R, then R is equal to \(\frac{N_{b}}{N_{g}}\). Let X denote the ratio of bad paths to good paths, then X is equal to \(\frac{D_{b}}{D_{g}}\).
Here, we classify the path types according to the path delay with the following considerations. Firstly, since RTT asymmetry is caused by dynamic traffic and hardware failures such as frame checksum errors and high CPU utilization, the delay suddenly jumps up when an incident occurs [
24]. Since the uncongested paths between host pair have the same numbers of hops, they have similar RTT [
25]. Secondly, in the model analysis, references [
18,
23,
26] divide the paths into congested paths and non-congested paths. For the convenience of modeling, we also divide the paths into two types. Thirdly, the transmission of the flowcell should be considered, because the flowcell is composed of multiple packets which are transmitted on each path. Since the size of the flowcell is not large, the variance of experienced delay is also not large. Moreover, to track the real-time delay information, the sender will update the path delay once receiving the ACK packets.
Therefore, for simplicity, we classify the paths into congested and uncongested ones according to their delay. The congested paths are defined as the ones with a latency larger than 2x the average RTT of all paths. If \(P_{g}\) and \(P_{b}\) are the probabilities that the flowcell selects the uncongested and congested paths, respectively. Specifically, \(P_{g}\) and \(P_{b}\) may be various under different load balancing schemes. In our design, in order to avoid synchronization effect, each flowcell is randomly assigned to one of the available paths. Thus, the probability \(P_{b}\) that a congested path is selected is calculated as \(\frac{N_{b}}{m}\).
Then, we get the probability
\(P_{g}\) that an uncongested path is selected as
$$\begin{aligned} P_{g}=1-\frac{N_{b}}{m}. \end{aligned}$$
(1)
At the receiving host, we can use the monotonically increasing sequence number method to determine whether a out of order occurs. That is, the sequence number of the received data packet is monotonically increasing, then the data packet arrives in order, otherwise, a out of order occurs.
The out-of-order event occurs when one packet train is transmitted on congested path and at least one flowcell sent later is transmitted on uncongested path. Therefore, the reordering probability
P of
n flowcells is calculated as
$$\begin{aligned} P&=1-(P_{b}^{n}+P_{b}^{n-1}\times P_{g}+P_{b}^{n-2}\times P_{g}^2+...\nonumber \\&\quad +P_{b}^{n-k}\times P_{g}^{k}+P_{b}\times P_{g}^{n-1}+P_{g}^{n})\nonumber \\&= 1-\sum _{k=0}^{n}P_{b}^{n-k} \times P_{g}^{k}. \end{aligned}$$
(2)
Substitute
\(P_{b}\) and
\(P_{g}\) into Eq. (
2), we get the reordering probability
P as
$$\begin{aligned} P=1- \sum _{k=0}^{n}(1-\frac{N_{b}}{m})^{k} \times (\frac{N_{b}}{m})^{n-k}. \end{aligned}$$
(3)
Assume that the largest window in each round of transmission on a path is
MaxW, and the initial value of the network congestion window is
\(W_{0}\). When detecting the out-of-order packet, the TCP sender reduces its congestion window by half. Thus, in each round of data transmission, if
\(n_{i}\) represents the number of flowcells in
i-th round, the
i-th round window
\(W_{i}\) and the out-of-order ratio of
i-th round
\(P_{i}\) are
$$\begin{aligned} \left\{ \begin{array}{l} n_{1}=f(W_{0})=min(m,\frac{W_{0}}{gran}) \\ P_{1}=g(n_{1})=1-\sum _{k=0}^{n_{1}}P_{b}^{n_{1}-k} \times P_{g}^{k} \\ W_{1}=\delta (P_{1})\\ =min(n_1 \times MaxW,(W_{0}+1)\times (1-P_{1})+\frac{W_{0}}{2}\times P_{1}) \end{array} \right. \end{aligned}$$
\(\cdot \cdot \cdot\)$$\begin{aligned} \left\{ \begin{array}{l} n_{i}=f(W_{i})=min(m,\frac{W_{i-1}}{gran})\\ P_{i}=g(n_{i})=1-\sum _{i=0}^{n_{i}}P_{b}^{n_{i}-i} \times P_{g}^{i}\\ W_{i}=\delta (P_{i})\\ =min(n_i\times MaxW,(W_{i-1}+1) \times (1-P_{i})+\frac{W_{i-1}}{2}\times P_{i}) \\ \end{array}\right. \end{aligned}$$
In the slow start phase of the protocol, the congestion window increases exponentially. After the sending rate reaching the link capacity, the slow start phase is switched to the congestion avoidance phase. Then we assume that at the switch point (i.e.,
i=0), the total maximum window is
\(m \times MaxW\) for
m paths. Therefore, the congestion window for each round
\(W_{i}\) is given by:
$$\begin{aligned} W_{i}=\left\{ \begin{array}{l} m \times MaxW \,\,\,\,\,\,\,\,\, i=0;\\ min(n_i\times MaxW,(W_{i-1}+1)\times (1-P_{i})\\ +\frac{W_{i-1}}{2}\times P_{i}) \,\,\,\,\,\,\,\,i \&{}gt;0.\\ \end{array}\right. \end{aligned}$$
(4)
Let
r and
\(W_{s}\) represent the number of rounds required to transmit
n flowcells and the sum of congested windows respectively, then
\(W_{s}\) can be expressed as
$$\begin{aligned} W_{s}=\sum _{i=0}^{r}W_{i}. \end{aligned}$$
(5)
When
\(W_{s}\) is firstly greater than or equal to
\(F_{size}\), the subscript of
\(W_{i}\) corresponds to the number of transmission rounds
r, then we get the average congestion window
\(\overline{W}\) as
$$\begin{aligned} \overline{W}=\frac{W_{s}}{r}. \end{aligned}$$
(6)
Though the small size of flowcell leads to large packet reordering probability, the small flowcell could utilize more paths, increasing the total utilized bandwidth. Given the link bandwidth C for each path, since each flowcell randomly picks its transmission path, the total bandwidth for n flowcell is \(n \times C\).
Typically, the end-to-end latency mainly consists of the queueing and propagation delay. Given the average congestion window
\(\overline{W}\), we obtain the average end-to-end round-trip time
\(\overline{RTT}\) as
$$\begin{aligned} \overline{RTT}=(1-(\frac{N_{b}}{m})^{\overline{W}})\times D_{g} + (\frac{N_{b}}{m})^{\overline{W}} \times {D_{b}}+\frac{\overline{W}}{n \times C}. \end{aligned}$$
(7)
Substitute
\(D_{b}\) and
\(\overline{W}\) into Eq. (
7), we can get the
\(\overline{RTT}\) as
$$\begin{aligned} \overline{RTT}=((X-1)\times (\frac{N_{b}}{m})^{\overline{W}}+1)\times D_{g} + \frac{W_s}{r\times n \times C}. \end{aligned}$$
(8)
After introducing the end-to-end delay of each round of transmission, we only need to calculate the number of rounds of transmission, and use the product of both the end-to-end delay of each round and the number of rounds to deduce the flow completion time. So
FCT can be expressed as
$$\begin{aligned} FCT&=\frac{F_{size}}{\overline{W}} \times \overline{RTT}\nonumber \\&=\frac{F_{size}}{\overline{W}} \times ((X-1)\times (\frac{N_{b}}{m})^{\overline{W}}+1)\times D_{g} + \frac{F_{size}}{n \times C}\nonumber \\&=\frac{F_{size}}{\overline{W}} \times ((X-1)\times (\frac{N_{b}}{m})^{\overline{W}}+1)\times D_{g}+ \frac{gran}{C} \end{aligned}$$
(9)
wherein
\(F_{size}\),
C,
\(N_{b}\),
m,
\(D_{g}\),
X are constant coefficients. There is a quadratic function relationship between the variable
\(\overline{W}\) and the independent variable
gran. Finally, according to Eq.
9 we can get the optimal granularity
gran to obtain the minimum value of average flow completion time FCT as
$$\begin{aligned} gran=&\;\;\arg\;\min\;\;\;\;\left\|\mathrm{FCT}\left({\mathrm{gran}}_{\mathrm i}\right)\right\|.\\\,&gran_{i\in}\lbrack1,44\rbrack \end{aligned}$$
(10)