MUSE Silver

2024

One-stop Large Model Service Platform

Entrant Company

China Mobile(Suzhou)software Technology Co., Ltd.

Category

Strategic Program - Internal Comm Campaign

Client's Name

Country / Region

China

The One-stop Large Model Service Platform offers robust functionalities, powerful performance, and reliable stability. Relying on the 'Skydome Computational Network Brain' to orchestrate and schedule resources across the domain, refining and optimizing models with hundreds of billions of parameters, and deploying massive domestic computing power resources.
Slow Provisioning: In terms of computational power provisioning, the supply cycle for clusters with thousands of cards is typically measured in months.
Interruptions: Regarding training stability, clusters with thousands of cards can maintain stable training for an average of only 2-3 days.
High Threshold: In practical use, issues such as scattered computing power, heterogeneous chipsets, and different computing frameworks continue to trouble users.
We developed and open-sourced the Elastic Resource Management Architecture KOSMOS, achieving the launch of ten-thousand-card computing resources within minutes, leading the industry. Through data preheating and cache acceleration, read-write I/O was increased by 20%, and costs were reduced by one-third. As a result, the training cycle for large models with hundreds of billions of parameters was shortened from 45 days to 30 days.
We innovatively introduced the 'Training Continues Despite Checkpoints' technology. When a GPU failure occurs during training, other computing resources can continue the training process. The lost computing power is merged and recovered at some future point, reducing the interruption duration from several hours in traditional checkpoint resumption to just 5 minutes, a reduction of over 90%. Leveraging this technology, we achieved ultra-stable training for 20 days with a thousand-card cluster.
The platform offers an integrated toolchain, providing unified design, unified scheduling, and unified operations. Relying on the Computational Network Brain, it enables intelligent edge orchestration, processing data through distributed computing. Data is expedited via data courier services, transferring it across domains to central nodes for training. Following model framework one-click conversion, the models are automatically deployed to heterogeneous chips at the edge for inference.


More Silver Winners
Marketing & Promotional
2024
MUSE Advertising Awards - Luxury Hospitality Print Marketing Application Sample Kit

Entrant Company

Trekk

Category

Marketing & Promotional - Media Kit / Sales Kit / Folder

Country / Region

United States

Publication
2024
MUSE Advertising Awards - La Divina Commedia

Entrant Company

Macau University of Science and Technology

Category

Publication - Book

Country / Region

China

Student Submission
2024
MUSE Advertising Awards - AccessibleNYC
NYC & Company

Entrant Company

Fashion Institute of Technology

Category

Student Submission - Student App

Country / Region

United States

Advertising
2024
MUSE Advertising Awards - Why Vinyl - Simonton Windows & Doors
Cornerstone Building Brands

Entrant Company

Jan Kelley

Category

Advertising - Advertising Campaign

Country / Region

United States

This site uses cookies to offer you a better browsing experience. Find out more on how we use cookies and how you can change your settings.
Cookie Policy