Microsoft brings distilled DeepSeek R1 models to Copilot+ PCs
DeepSeek conquered the mobile world and it is now expanding to Windows – with the full support of Microsoft, surprisingly. Yesterday, the software giant added the DeepSeek R1 model to its Azure AI Foundry to allow developers to test and build cloud-based apps and services with it. Today, Microsoft announced that it is bringing distilled versions of R1 to Copilot+ PCs.
The distilled models will first be available to devices powered by Snapdragon X chips, the ones with Intel Core Ultra 200V processors and then AMD Ryzen AI 9 based PCs.
The first model will be DeepSeek-R1-Distill-Qwen-1.5B (i.e. a 1.5 billion parameter model) with larger and more capable 7B and 14B models coming soon. These will be available for download from Microsoft’s AI Toolkit.
Microsoft had to tweak these models to optimize them to run on devices with NPUs. Operations that rely heavily on memory access run on the CPU, while computationally-intensive operations like the transformer block run on the NPU. With the optimizations, Microsoft managed to achieve fast time to first token (130ms) and a throughput rate of 16 tokens per second for short prompts (under 64 tokens). Note that a “token” is similar to a vowel (importantly, one token is usually more than one character long).
Microsoft is a strong supporter of and deeply invested in OpenAI (the makers of ChatGPT and GPT-4o), but it seems that it doesn’t play favorites – its Azure Playground has GPT models (OpenAI), Llama (Meta), Mistral (an AI company), now DeepSeek too.
DeepSeek R1 in the Azure AI Foundry playgroundAnyway, if you’re more into local AI, download the AI Toolkit for VS Code first. From there, you should be able to download the model locally (e.g. “deepseek_r1_1_5” is the 1.5B model). Finally, hit Try in Playground and see how smart this distilled version of R1 is.
“Model distillation”, sometimes called “knowledge distillation”, is the process of taking a large AI model (the full DeepSeek R1 has 671 billion parameters) and transferring as much of its knowledge as possible to a smaller model (e.g. 1.5 billion parameters). It’s not a perfect process and the distilled model is less capable than the full model – but its smaller size allows it to run directly on consumer hardware (instead of dedicated AI hardware that costs tens of thousands of dollars).