MUSE Silver

2024

One-stop Large Model Service Platform

Entrant Company

China Mobile(Suzhou)software Technology Co., Ltd.

Category

Strategic Program - Internal Comm Campaign

Client's Name

Country / Region

China

The One-stop Large Model Service Platform offers robust functionalities, powerful performance, and reliable stability. Relying on the 'Skydome Computational Network Brain' to orchestrate and schedule resources across the domain, refining and optimizing models with hundreds of billions of parameters, and deploying massive domestic computing power resources.
Slow Provisioning: In terms of computational power provisioning, the supply cycle for clusters with thousands of cards is typically measured in months.
Interruptions: Regarding training stability, clusters with thousands of cards can maintain stable training for an average of only 2-3 days.
High Threshold: In practical use, issues such as scattered computing power, heterogeneous chipsets, and different computing frameworks continue to trouble users.
We developed and open-sourced the Elastic Resource Management Architecture KOSMOS, achieving the launch of ten-thousand-card computing resources within minutes, leading the industry. Through data preheating and cache acceleration, read-write I/O was increased by 20%, and costs were reduced by one-third. As a result, the training cycle for large models with hundreds of billions of parameters was shortened from 45 days to 30 days.
We innovatively introduced the 'Training Continues Despite Checkpoints' technology. When a GPU failure occurs during training, other computing resources can continue the training process. The lost computing power is merged and recovered at some future point, reducing the interruption duration from several hours in traditional checkpoint resumption to just 5 minutes, a reduction of over 90%. Leveraging this technology, we achieved ultra-stable training for 20 days with a thousand-card cluster.
The platform offers an integrated toolchain, providing unified design, unified scheduling, and unified operations. Relying on the Computational Network Brain, it enables intelligent edge orchestration, processing data through distributed computing. Data is expedited via data courier services, transferring it across domains to central nodes for training. Following model framework one-click conversion, the models are automatically deployed to heterogeneous chips at the edge for inference.


More Silver Winners
Experiential & Immersive
2024
MUSE Advertising Awards - AgeTech CES 2024
AgeTech Collaborative From AARP

Entrant Company

AARP Brand Creative Services

Category

Experiential & Immersive - Exhibition Experience

Country / Region

United States

Audio
2024
MUSE Advertising Awards - Road Trips

Entrant Company

Jon Whiting Has A Podcast

Category

Audio - Podcast

Country / Region

United States

Video
2024
MUSE Advertising Awards - Memories of Bethesda, Composition and Artwork
Mark Andersen and Lynn Noble

Entrant Company

International Artists Foundation

Category

Video - Art & Design

Country / Region

United States

Strategic Program
2024
MUSE Advertising Awards - Golf Future Rebrand
Golf Future

Entrant Company

Erickson Group Inc

Category

Strategic Program - Branding Refresh

Country / Region

Canada