Microsoft builds a supercomputer to train large AI models

The supercomputer developed for OpenAI is a single system with more than 285,000 CPU cores, 10,000 GPUs and 400 gigabits per second of network connectivity for each GPU server

Microsoft has built a new powerful supercomputer. This is in collaboration with Artificial Intelligence (AI) startup OpenAI, to make new infrastructure available in Azure to train extremely large AI models.

Microsoft announced a multi-year supercomputer partnership with OpenAI in 2019, including a $1 billion investment by the tech giant.

The supercomputer developed for OpenAI is a single system with more than 285,000 CPU cores, 10,000 GPUs and 400 gigabits per second of network connectivity for each GPU server.

Compared with other machines listed on the TOP500 supercomputers in the world, it ranks in the top five.

"Built-in collaboration with and exclusively for OpenAI, the supercomputer hosted in Azure was designed specifically to train that company's AI models," the company announced at its virtual 'Build 2020' conference on Tuesday.

Hosted in Azure, the supercomputer also benefits from all the capabilities of robust modern cloud infrastructure, including rapid deployment, sustainable data centres and access to Azure services.

"This is about being able to do a hundred exciting things in natural language processing at once and a hundred exciting things in computer vision, and when you start to see combinations of these perceptual domains, you're going to have new applications that are hard to even imagine right now," explained Microsoft Chief Technical Officer Kevin Scott.

As part of its 'AI at Scale' initiative, Microsoft has developed its own family of large AI models, the Microsoft Turing models, which it has used to improve many different language understanding tasks across Bing, Office, Dynamics and other productivity products.

Earlier this year, it also released to researchers the largest publicly available AI language model in the world, the Microsoft Turing model for natural language generation.

The goal, Microsoft said, is to make its large AI models, training optimization tools and supercomputing resources available through Azure AI services and GitHub so developers, data scientists and business customers can easily leverage the power of AI at Scale.

"As we've learned more and more about what we need and the different limits of all the components that make up a supercomputer, we were really able to say, 'If we could design our dream system, what would it look like?'" said OpenAI CEO Sam Altman.

And then Microsoft was able to build it.

"We are seeing that larger-scale systems are an important component in training more powerful models," Altman added.

Microsoft has also unveiled a new version of DeepSpeed, an open-source deep learning library for PyTorch that reduces the amount of computing power needed for large distributed model training.

The update is significantly more efficient than the version released just three months ago and now allows people to train models more than 15 times larger and 10 times faster than they could without DeepSpeed on the same infrastructure.

*Edited from an IANS report

Microsoft builds a supercomputer to train large AI models

Related Stories