MUSE Silver

2024

One-stop Large Model Service Platform

Entrant Company

China Mobile(Suzhou)software Technology Co., Ltd.

Category

Strategic Program - Internal Comm Campaign

Client's Name

Country / Region

China

The One-stop Large Model Service Platform offers robust functionalities, powerful performance, and reliable stability. Relying on the 'Skydome Computational Network Brain' to orchestrate and schedule resources across the domain, refining and optimizing models with hundreds of billions of parameters, and deploying massive domestic computing power resources.
Slow Provisioning: In terms of computational power provisioning, the supply cycle for clusters with thousands of cards is typically measured in months.
Interruptions: Regarding training stability, clusters with thousands of cards can maintain stable training for an average of only 2-3 days.
High Threshold: In practical use, issues such as scattered computing power, heterogeneous chipsets, and different computing frameworks continue to trouble users.
We developed and open-sourced the Elastic Resource Management Architecture KOSMOS, achieving the launch of ten-thousand-card computing resources within minutes, leading the industry. Through data preheating and cache acceleration, read-write I/O was increased by 20%, and costs were reduced by one-third. As a result, the training cycle for large models with hundreds of billions of parameters was shortened from 45 days to 30 days.
We innovatively introduced the 'Training Continues Despite Checkpoints' technology. When a GPU failure occurs during training, other computing resources can continue the training process. The lost computing power is merged and recovered at some future point, reducing the interruption duration from several hours in traditional checkpoint resumption to just 5 minutes, a reduction of over 90%. Leveraging this technology, we achieved ultra-stable training for 20 days with a thousand-card cluster.
The platform offers an integrated toolchain, providing unified design, unified scheduling, and unified operations. Relying on the Computational Network Brain, it enables intelligent edge orchestration, processing data through distributed computing. Data is expedited via data courier services, transferring it across domains to central nodes for training. Following model framework one-click conversion, the models are automatically deployed to heterogeneous chips at the edge for inference.


More Silver Winners
Video
2024
MUSE Advertising Awards - hugs
Inebir

Entrant Company

Parnaso

Category

Video - Branding

Country / Region

Spain

Event
2024
MUSE Advertising Awards - Delta Air Lines - End of Summer Employee Care-A-Van Tour

Entrant Company

Engine Shop

Category

Event - Roadshow

Country / Region

United States

Student Submission
2024
MUSE Advertising Awards - C-Loop
Beijing University of Posts and Telecommunications

Entrant Company

Weining Yan, Linpei Zhang, Yizheng Wang, Jingyi Xiao

Category

Student Submission - Student Conceptual Design

Country / Region

China

Publication
2024
MUSE Advertising Awards - 2022 Link Logistics ESG Report

Entrant Company

Link Logistics

Category

Publication - Annual Report

Country / Region

United States