AI News spoke with Damian Bogunowicz, a machine learning engineer at Neural Magic, to shed light on the company’s innovative approach to deep learning model optimisation and inference on CPUs.
One of the key challenges in developing and deploying deep learning models lies in their size and computational requirements. However, Neural Magic tackles this issue head-on through a concept called compound sparsity.
Compound sparsity combines techniques such as unstructured pruning, quantisation, and distillation to significantly reduce the size of neural networks while maintaining their accuracy.
“We have developed our own sparsity-aware runtime that leverages CPU architecture to accelerate sparse models. This approach challenges the notion that GPUs are necessary for efficient deep learning,” explains Bogunowicz.
Bogunowicz emphasised the benefits of their approach, highlighting that more compact models lead to faster deployments and can be run on ubiquitous CPU-based machines. The ability to optimise and run specified networks efficiently without relying on specialised hardware is a game-changer for machine learning practitioners, empowering them to overcome the limitations and costs associated with GPU usage.
When asked about the suitability of sparse neural networks for enterprises, Bogunowicz explained that the vast majority of companies can benefit from using sparse models.
By removing up to 90 percent of parameters without impacting accuracy, enterprises can achieve more efficient deployments. While extremely critical domains like autonomous driving or autonomous aeroplanes may require maximum accuracy and minimal sparsity, the advantages of sparse models outweigh the limitations for the majority of businesses.
Looking ahead, Bogunowicz expressed his excitement about the future of large language models (LLMs) and their applications.
“I’m particularly excited about the future of large language models LLMs. Mark Zuckerberg discussed enabling AI agents, acting as personal assistants or salespeople, on platforms like WhatsApp,” says Bogunowicz.
One example that caught his attention was a chatbot used by Khan Academy—an AI tutor that guides students to solve problems by providing hints rather than revealing solutions outright. This application demonstrates the value that LLMs can bring to the education sector, facilitating the learning process while empowering students to develop problem-solving skills.
“Our research has shown that you can optimise LLMs efficiently for CPU deployment. We have published a research paper on SparseGPT that demonstrates the removal of around 100 billion parameters using one-shot pruning without compromising model quality,” explains Bogunowicz.
“This means there may not be a need for GPU clusters in the future of AI inference. Our goal is to soon provide open-source LLMs to the community and empower enterprises to have control over their products and models, rather than relying on big tech companies.”
As for Neural Magic’s future, Bogunowicz revealed two exciting developments they will be sharing at the upcoming AI & Big Data Expo Europe.
Firstly, they will showcase their support for running AI models on edge devices, specifically x86 and ARM architectures. This expands the possibilities for AI applications in various industries.
Secondly, they will unveil their model optimisation platform, Sparsify, which enables the seamless application of state-of-the-art pruning, quantisation, and distillation algorithms through a user-friendly web app and simple API calls. Sparsify aims to accelerate inference without sacrificing accuracy, providing enterprises with an elegant and intuitive solution.
Neural Magic’s commitment to democratising machine learning infrastructure by leveraging CPUs is impressive. Their focus on compound sparsity and their upcoming advancements in edge computing demonstrate their dedication to empowering businesses and researchers alike.
As we eagerly await the developments presented at AI & Big Data Expo Europe, it’s clear that Neural Magic is poised to make a significant impact in the field of deep learning.