Rewiring 2 Links Is Enough: Accelerating Failure Recovery in Production Data Center Networks

As a repository that holds computing facilities, storage facilities, network facilities and other facilities, the Software Defined Data Center (SDDC) can provide computing and storage resources for users. For a SDDC, it is important to provide continuous services for users. Hence, in order to achieve high reliability in Software Defined Data Center Networks (SDDCNs), a network failure recovery method for software defined data center networks (REVERT) is proposed to recover failures in SDDCNs. In REVERT, the network failures that occurred in SDDCNs are classified into three types, which are switch failure, failure of links among switches and failure of links between switches and servers. Specially, except recovering the switch failure and failure of links between switches, REVERT can also recover the failures of links between the switches and servers. To achieve that, a failure preprocessing method used to classify the network failures, a data structure for storing and finding the affected flows, a server cluster agent for communicating with the server clustering algorithm and a routing path calculation method are designed in REVERT. Meanwhile, REVERT has been implemented and evaluated on RYU controller and Mininet using three routing algorithms. Compared with the link usage before recovering the network failures, when there are more than 200 flows in the network, the mean link usages only slightly increase at about 1.83 percent. More importantly, the evaluation results also demonstrate that except recovering switch failures, intra-topo link failures, REVERT has the ability of recovering failures of links between servers and edge switches successfully.

Download Full-text

An adaptive failure recovery mechanism based on asymmetric routing for data center networks

The Journal of Supercomputing ◽

10.1007/s11227-020-03337-4 ◽

2020 ◽

Author(s):

Yong Liu ◽

Huaxi Gu ◽

Kun Wang ◽

Xiaoshan Yu ◽

Yunhao Wang

Keyword(s):

Data Center ◽

Failure Recovery ◽

Data Center Networks ◽

Recovery Mechanism

Download Full-text

Analysis on Buffer Occupancy of Quantized Congestion Notification in Data Center Networks

IEICE Transactions on Communications ◽

10.1587/transcom.2016ebp3052 ◽

2016 ◽

Vol E99.B (11) ◽

pp. 2361-2372 ◽

Cited By ~ 1

Author(s):

Chang RUAN ◽

Jianxin WANG ◽

Jiawei HUANG ◽

Wanchun JIANG

Keyword(s):

Data Center ◽

Data Center Networks ◽

Buffer Occupancy

Download Full-text

HTPC: heterogeneous traffic-aware partition coding for random packet spraying in data center networks

Journal of Cloud Computing Advances Systems and Applications ◽

10.1186/s13677-021-00248-4 ◽

2021 ◽

Vol 10 (1) ◽

Author(s):

Jiawei Huang ◽

Shiqi Wang ◽

Shuping Li ◽

Shaojun Zou ◽

Jinbin Hu ◽

...

Keyword(s):

Data Center ◽

Large Scale ◽

Network Performance ◽

Rooted Tree ◽

Heterogeneous Traffic ◽

Data Center Networks ◽

Packet Reordering ◽

Traffic Characteristics ◽

Network Utilization ◽

The Impact

AbstractModern data center networks typically adopt multi-rooted tree topologies such leaf-spine and fat-tree to provide high bisection bandwidth. Load balancing is critical to achieve low latency and high throughput. Although the per-packet schemes such as Random Packet Spraying (RPS) can achieve high network utilization and near-optimal tail latency in symmetric topologies, they are prone to cause significant packet reordering and degrade the network performance. Moreover, some coding-based schemes are proposed to alleviate the problem of packet reordering and loss. Unfortunately, these schemes ignore the traffic characteristics of data center network and cannot achieve good network performance. In this paper, we propose a Heterogeneous Traffic-aware Partition Coding named HTPC to eliminate the impact of packet reordering and improve the performance of short and long flows. HTPC smoothly adjusts the number of redundant packets based on the multi-path congestion information and the traffic characteristics so that the tailing probability of short flows and the timeout probability of long flows can be reduced. Through a series of large-scale NS2 simulations, we demonstrate that HTPC reduces average flow completion time by up to 60% compared with the state-of-the-art mechanisms.

Download Full-text