As the LLMS scale, its computational and bandwidth demands increase significantly, raising challenges for ai training infrastructure. After scale laws, LLM improves understanding, reasoning and generation by expanding parameters and data sets, which requires robust computer systems. Large-scale ai groups now require tens of thousands of GPU or NPU, as seen in the 16K call-3 GPU training configuration, which took 54 days. With ai data centers that implement more than 100k GPU, scalable infrastructure is essential. In addition, interconnection bandwidth requirements exceed 3.2 TBPS per node, far exceeding CPU -based traditional systems. The growing costs of symmetrical architectures of nearby networks make profitable critical solutions, together with the optimization of operating expenses, such as energy and maintenance. In addition, high availability is a key concern, since mass training groups experience frequent hardware failures, demanding fault -tolerant network designs.
Addressing these challenges requires rethinking the architecture of the ai data center. First, network topologies must align with structured traffic patterns of LLM Training, which differ from traditional workloads. Tensioning parallelism, responsible for most data transfers, operates within small groups, while data parallelism implies minimal but long -range communication. Secondly, computing systems and networks of networks must be co -speaker, ensuring effective parallelism and resource distribution strategies to avoid congestion and underutilization. Finally, ai groups must present self -care mechanisms for failure tolerance, automatically redirect traffic or activate backup NPUs when failures occur. These principles (localized network architectures, conscious calculation of the topology and self -care systems) are essential to create efficient and resistant training infrastructure.
Huawei researchers introduced UB-MEH, a network of the ai data centers designed for scalability, efficiency and reliability. Unlike traditional symmetrical networks, UB-MEsh uses a hierarchically located ND-Fullmesh topology, optimizing short-range interconnections to minimize switch dependence. Based on a 4D-Fullmesh design, its UB-MEH-POD integrates specialized hardware and a unified bus technique (UB) for a flexible bandwidth allocation. The All-PATH (APR) routing mechanism improves data traffic management, while a 64+1 backup system guarantees fault tolerance. Compared to closed networks, UB-Malla reduces the use of the switch by 98% and the dependence of the optical module in 93%, achieving 2.04 × profitability efficiency with minimal compensation compensations in LLM training.
UB-MESH is an high-dimension complete mesh interconnection architecture designed to improve the efficiency in large-scale ai training. Use an ND-Fullmesh topology, minimizing the dependence of expensive switches and optical modules by maximizing direct electrical connections. The system is based on modular hardware components linked through an UB interconnection, rationalizing communication through CPU, NPUS and switches. A 2D mesh structure connects 64 NPus inside a shelf, which extends to a 4D mesh at the Pod level. For scalability, a superpod structure integrates multiple pods using a closed hybrid topology, balance performance, flexibility and profitability in ai data centers.
To improve the efficiency of the UB mesh in the training of large -scale ai, we use conscious strategies of the topology to optimize collective communication and parallelization. To All reduce, a multiple -ring algorithm minimizes congestion when efficiently mapping and using inactive links to improve bandwidth. In total communication, a multi -way approach increases data transmission rates, while hierarchical methods optimize bandwidth for transmission and reduction in operations. In addition, the study refines parallel through a systematic search, prioritizing high bandwidth configurations. Comparisons with nearby architecture reveal that UB-MEsh maintains competitive performance while significantly reducing hardware costs, so it is a profitable alternative for the training of large-scale models.

In conclusion, the UB IO controller incorporates a specialized coprocessor, the collective communication unit (CCU), to optimize collective communication tasks. The CCU manages data transfers, NPU transmissions and online data reduction using a SRAM buffer in chip, minimizing redundant memory copies and reducing HBM bandwidth consumption. It also improves the superposition of computer communication. In addition, UB-MESH efficiently admits huge MOE models and experts by taking advantage of total hierarchical optimization and cargo/store transfer. The study introduces UB-MEH, an ND-Fullmesh network architecture for LLM training, which offers profitable and high-performance networks with 95%+ linearity, 7.2% improved availability and 2.04 × better profitability than closed networks.
Verify he Paper. All credit for this investigation goes to the researchers of this project. In addition, feel free to follow us <a target="_blank" href="https://x.com/intent/follow?screen_name=marktechpost” target=”_blank” rel=”noreferrer noopener”>twitter And don't forget to join our 85k+ ml of submen.
(Register now) Virtual Minicon Conference on open source ai: Free Registration + Assistance Certificate + Short Event of 3 Hours (April 12, 9 AM- 12 PM PST) + Hands on Workshop (sponsored)

Sana Hassan, a consulting intern in Marktechpost and double grade student in Iit Madras, passionate to apply technology and ai to address real world challenges. With great interest in solving practical problems, it provides a new perspective to the intersection of ai and real -life solutions.