MUSE Silver

2024

One-stop Large Model Service Platform

Entrant Company

China Mobile(Suzhou)software Technology Co., Ltd.

Category

Strategic Program - Internal Comm Campaign

Client's Name

Country / Region

China

The One-stop Large Model Service Platform offers robust functionalities, powerful performance, and reliable stability. Relying on the 'Skydome Computational Network Brain' to orchestrate and schedule resources across the domain, refining and optimizing models with hundreds of billions of parameters, and deploying massive domestic computing power resources.
Slow Provisioning: In terms of computational power provisioning, the supply cycle for clusters with thousands of cards is typically measured in months.
Interruptions: Regarding training stability, clusters with thousands of cards can maintain stable training for an average of only 2-3 days.
High Threshold: In practical use, issues such as scattered computing power, heterogeneous chipsets, and different computing frameworks continue to trouble users.
We developed and open-sourced the Elastic Resource Management Architecture KOSMOS, achieving the launch of ten-thousand-card computing resources within minutes, leading the industry. Through data preheating and cache acceleration, read-write I/O was increased by 20%, and costs were reduced by one-third. As a result, the training cycle for large models with hundreds of billions of parameters was shortened from 45 days to 30 days.
We innovatively introduced the 'Training Continues Despite Checkpoints' technology. When a GPU failure occurs during training, other computing resources can continue the training process. The lost computing power is merged and recovered at some future point, reducing the interruption duration from several hours in traditional checkpoint resumption to just 5 minutes, a reduction of over 90%. Leveraging this technology, we achieved ultra-stable training for 20 days with a thousand-card cluster.
The platform offers an integrated toolchain, providing unified design, unified scheduling, and unified operations. Relying on the Computational Network Brain, it enables intelligent edge orchestration, processing data through distributed computing. Data is expedited via data courier services, transferring it across domains to central nodes for training. Following model framework one-click conversion, the models are automatically deployed to heterogeneous chips at the edge for inference.


More Silver Winners
Experiential & Immersive
2024
2024 MUSE Creative Awards Winner - Zhuan Qingshan

Entrant Company

University of Southern California

Category

Experiential & Immersive - Hologram / Projection (NEW)

Country / Region

United States

Marketing & Promotional
2024
2024 MUSE Creative Awards Winner - Appearances are not reality

Entrant Company

ZHANG QIAN

Category

Marketing & Promotional - Illustration

Country / Region

China

Integrated Marketing
2024
2024 MUSE Creative Awards Winner - John Player Special Brand Identity Redesign

Entrant Company

Marks

Category

Integrated Marketing - Rebranding

Country / Region

Germany

Branded Content
2024
2024 MUSE Creative Awards Winner - No Excuse brochure

Entrant Company

ABC DESIGN COMMUNICATION

Category

Branded Content - Cause / Awareness (NEW)

Country / Region

Greece