2024

One-stop Large Model Service Platform

Entrant Company

China Mobile(Suzhou)software Technology Co., Ltd.

Category

Strategic Program - Internal Comm Campaign

Client's Name

Country / Region

China

The One-stop Large Model Service Platform offers robust functionalities, powerful performance, and reliable stability. Relying on the 'Skydome Computational Network Brain' to orchestrate and schedule resources across the domain, refining and optimizing models with hundreds of billions of parameters, and deploying massive domestic computing power resources.
Slow Provisioning: In terms of computational power provisioning, the supply cycle for clusters with thousands of cards is typically measured in months.
Interruptions: Regarding training stability, clusters with thousands of cards can maintain stable training for an average of only 2-3 days.
High Threshold: In practical use, issues such as scattered computing power, heterogeneous chipsets, and different computing frameworks continue to trouble users.
We developed and open-sourced the Elastic Resource Management Architecture KOSMOS, achieving the launch of ten-thousand-card computing resources within minutes, leading the industry. Through data preheating and cache acceleration, read-write I/O was increased by 20%, and costs were reduced by one-third. As a result, the training cycle for large models with hundreds of billions of parameters was shortened from 45 days to 30 days.
We innovatively introduced the 'Training Continues Despite Checkpoints' technology. When a GPU failure occurs during training, other computing resources can continue the training process. The lost computing power is merged and recovered at some future point, reducing the interruption duration from several hours in traditional checkpoint resumption to just 5 minutes, a reduction of over 90%. Leveraging this technology, we achieved ultra-stable training for 20 days with a thousand-card cluster.
The platform offers an integrated toolchain, providing unified design, unified scheduling, and unified operations. Relying on the Computational Network Brain, it enables intelligent edge orchestration, processing data through distributed computing. Data is expedited via data courier services, transferring it across domains to central nodes for training. Following model framework one-click conversion, the models are automatically deployed to heterogeneous chips at the edge for inference.