Local AI in the mobile NPU: what it actually does and how far it goes

  • The mobile SoC's NPU is a specialized neural network accelerator that complements the CPU and GPU, offering more performance per watt in AI tasks.
  • Local AI reduces latency and improves privacy by processing data on the device, but it is limited by RAM, heat, battery, and the size of the models it can handle.
  • Manufacturers are integrating increasingly powerful NPUs into mobile phones, PCs, and cars, but many apps still don't take full advantage of them, so the CPU and GPU continue to do most of the work.
  • The immediate future involves a hybrid model: part of the AI ​​runs locally on the NPU and part in the cloud, balancing speed, model quality and consumption.

Local AI in the mobile NPU

The idea of ​​having a model of Powerful AI running directly on mobile Being cloud-free sounds great… until you actually try it out. If you have a Galaxy S24 Ultra, download models like Qwen 3.5 4B, and run them with apps like PocketPal, Offgrid, or ChatterUI, you'll encounter a less glamorous reality: 4 tokens per secondEternal times until seeing the first token, the terminal overheating, and the feeling that your super SoC is nowhere near squeezing its NPU as the marketing promised.

At the same time, the sector is constantly talking about NPU, Local AI, Copilot PC, Apple Neural Engine And so on. Manufacturers have been packing AI accelerators into their SoCs for years, both in phones and laptops, assuring us that they are the future of personal computing. The problem is that with so many acronyms and promises, it's easy to get lost: what exactly does the phone's NPU do? Why does the CPU sometimes seem to perform better? When does it make sense to use cloud-based AI and when is it worthwhile to rely on local AI?

What exactly is the NPU in a mobile SoC and what role does it play in local AI?

In a modern smartphone, the so-called “processor” is actually a SoC (System on Chip)On the same silicon chip, you'll find the CPU, GPU, ISP, modem, security units… and, for some years now, an NPU or neural engine dedicated to AI. It doesn't replace the CPU or GPU: it complements them for a very specific type of work.

An NPU (Neural Processing UnitIt's a hardware block designed to run neural networks at a massive pace: thousands of multiplication and addition operations in parallel, with low-precision data (INT8, FP16, even INT4) and with very close memory to avoid wasting time moving weights and activations. It can't "do a little bit of everything" like a CPU, but what it can do, it does with brutal efficiency.

That specialization fits like a glove with almost everything we understand today as AI: computer visionSpeech recognition, image classification, translation, language modeling, and, in general, any modern neural network. Instead of overloading the CPU or turning on the GPU for each AI task, the system sends those operations to the NPU, which performs them with less energy and less heat.

In fact, most major manufacturers describe their NPU in those terms. Qualcomm talks about more performance per watt for AI workloads; Huawei sells it as the key to doing more in less time without draining the battery; Apple defines it as a GPU-like engine to accelerate matrix convolutions and multiplications; AMD and Intel integrate it into their CPUs to offload low-power AI tasks, while Samsung insists that its NPU is optimized for simultaneous matrix operations and continuous learning with accumulated data.

NPUs: neither new nor exclusive to mobile

It may seem that the NPUs have appeared out of nowhere With the hype surrounding generative AI, the reality is that we've been living with it in our pockets for almost a decade without even realizing it. In 2017, Apple released the iPhone X with Face ID and Animoji thanks to its A11 Bionic chip, which already featured a dedicated "neural engine," although few paid attention to the name at the time.

Since then, Apple has been inflating that Apple Neural Engine generation after generation. The ANE of the iPhone X was around... 0,6 TOPS (trillions of operations per second) in FP16. Today, an A17 Pro in an iPhone 15 Pro is around 35 TOPS, and the M4 chip for iPad and Mac goes up to about 38 TOPS. That is, in a few years we have gone from a "token" neural engine to one capable of running models that we previously only saw in data centers.

Google has done something similar on its side with the TPU (Tensor Processing Unit)First in their data centers with giant chips for training neural networks, and then in Pixel phones with the Google Tensor family (Pixel 6, 7, 8…). There they integrate a TPU/NPU into the SoC to squeeze camera, voice and, increasingly, generative AI functions into the device itself.

In the PC world, Intel and AMD have had to step up their game. Intel is including NPUs in its Core Ultra (Meteor Lake) processors, with around 8-12 TOPS, while AMD debuted Ryzen AI in its Ryzen 7040 laptop processors, with up to 10 TOPS, and even reached 39 TOPS of NPU in a short batch of Ryzen 8000 desktop processors. The idea is the same: taking AI to the edge and not depend so much on the cloud for everything.

How an NPU works: why it's so good for AI… and so bad for everything else

If we mentally open up the chip, an NPU looks more like a matrix multiplication factory than a classic CPU. Instead of a few highly versatile cores, it has tens of thousands of simple ALUs arranged in a matrix or network, capable of performing "multiply-accumulate" (MAC) products in parallel, often with low precision.

Nokia predicts the disappearance of mobile phones before 2030

The trick is to organize these units as a kind of systolic matrixData enters on one side, passes from cell to cell, and each cell performs its small operation before passing the result to the next. This minimizes accesses to main memory and maximizes the use of the MAC units, which is precisely what a neural network needs when inferring.

To achieve this efficiency, the NPU forgoes many of the features that make a CPU or GPU more expensive: it lacks complex branch prediction logic, an elaborate cache system, and support for all general-purpose instructions. Its ISA is typically minimal. DMA for moving data, point products, sums, activations and little else.

He also plays with the numerical precisionWhile a traditional CPU or GPU operates comfortably in 32-bit or 64-bit floating-point units, an NPU typically works in INT8, FP16, and even INT4. For a trained neural network, this level of precision is sufficient to deliver excellent results, allowing for significantly more operations per cycle with much less energy per operation.

CPU, GPU, NPU and TPU: who does what in AI

The CPU remains the "general brain": it manages the operating system, coordinates tasks, and executes control logic. It's capable of running small models, but when you ask it to handle a large network or maintain sustained text generation, it becomes a bottleneck in latency and power consumption.

The GPU is the workhorse of the deep learningIt translates very well the work of rendering graphics (many similar operations on large vectors) to training and running neural networks. Modern GPUs also incorporate tensor nuclei specific ones that, in practice, behave like small NPUs within the GPU itself.

The NPU, on the other hand, is designed solely for AI inference. It's not suitable for gaming, rendering interfaces, or compiling code, but it is ideal for running vision, voice, or language networks with an energy efficiency that the GPU can't match in a mobile phone or ultralight laptop.

Google's TPUs are a close cousin: ASICs focused on tensor operations to accelerate AI models, especially in their data centers. The Edge TPU on the Coral Dev Board, for example, offers some 4 TOPS with only a few wattsIdeal for cameras and IoT devices that need real-time computer vision without overheating or consuming too much power.

In summary, the ideal combination in a modern device is: CPU for general logic, GPU for graphics workloads and flexible parallel computing, and NPU/TPU for neural networksEach one does its own thing, and when the software is well written, the system distributes work quite intelligently.

Cloud AI vs. on-premises AI: Speed, Privacy, and Cost

Until very recently, almost everything we associated with "powerful AI" happened in the cloud: ChatGPT, Gemini, Stable Diffusion, advanced assistants… Mobile phones only acted as a dumb terminal that sent data and received a processed response on a server full of GPUs or TPUs.

This architecture has an obvious advantage: you can run gigantic models without worrying about the end user's power. A cheap low-end device and a top-of-the-range flagship receive the same result, because the heavy lifting is done by a processor. data center with dedicated hardware.

But it also has significant drawbacks. latency It depends entirely on the connection: if you have poor coverage, are on a plane, or in a town with unreliable ADSL, many features cease to be "magical" and become downright useless. Furthermore, each request requires sending data to third parties and trusting that it will be handled correctly.

Cloud storage

The local AI plays precisely the opposite game: bring the model to the device and run the inference on the device's own CPU, GPU, or NPU. This eliminates network latency, enables offline AI, and, most importantly, makes it your data doesn't have to leave the phone, the laptop or the car unless you want it.

However, local AI is limited by what the hardware can handle: RAM, VRAM, thermal power, batteryA model with 70.000 billion parameters doesn't fit comfortably on a phone today; we have to resort to reduced, quantized, and highly optimized versions if we want something fluid and sustainable.

Mobile NPUs: from the camera to the assistant, including local LLMs

In the smartphone world, NPUs have been working quietly for years on everything related to mobile photography and video, facial recognition, voice, and translation. Manufacturers have been adding features on top of that.

In the Apple ecosystem, the Neural Engine handles Face ID, face and object detection in the gallery, dictation, live translation, text recognition in images, AR, and a whole host of other tasks we take for granted. With the A16, A17, and the M3/M4 family, Apple is starting to make moves so that Siri and other generative AI features work on the device itself without so much dependence on the cloud, taking advantage of those 30-40 TOPS of neural engine.

Google, with its Tensor G2 and G3, does something similar in the Pixel. The Pixel 8, with its Integrated TPU, can run reduced versions of models such as PaLM 2 or Gemini Nano on the device for tasks such as translation, reading websites aloud, local summaries, smoother voice typing, or camera tricks like Best Take and Audio Magic Eraser, all with the chip working without the constant need to send data to its servers.

Qualcomm, for its part, has used Hexagon NPU engines in the Snapdragon series for several generations. The Snapdragon 8 Gen 3 boasts an NPU that is 98% faster than the Gen 2 and capable of running LLMs of up to 10.000 billion parameters on the mobile device itself, with public demonstrations of Stable Diffusion generating images at high speed and Llama 2 or Llama 3 running completely offline.

MediaTek is not far behind with its APUs (AI Processing Units) in the Dimensity series, reaching tasks such as with the sixth generation APU real-time AI photo remastering in mobiles like the Oppo Find X8, and pointing to the fact that this same NPU technology will be coming to televisions, IoT and even automotive.

What's happening in PCs and cars with NPUs

In the PC arena, Microsoft has launched the category of “PC with AI” Relying on NPUs integrated into Intel, AMD, and Qualcomm SoCs, Intel Core Ultra (Meteor Lake) incorporates an NPU of around 8-12 TOPS to accelerate Windows 11 features such as background blur, synthetic eye contact, noise reduction, and, in the future, parts of Copilot.

AMD debuted Ryzen AI in the Ryzen 7040 series for laptops and, briefly, in the Ryzen 8000 series desktops with an NPU of up to 39 TOPS. Although that approach has been readjusted, the message is clear: The PC of the future will always have a dedicated AI block., just like it has had an integrated GPU for years.

In the automotive industry, things are getting much more advanced. Tesla has two generations of Full Self-Driving hardware with dual NPUs: HW3 was around 144 TOPS and HW4 is around 200-250 TOPS, all to process in real time the signals from a lot of cameras and sensors and run neural networks that make driving decisions in a matter of milliseconds.

NVIDIA, with its Drive Thor platform, takes another leap: a single chip can reach up to 1000 TOPS, or 2000 TOPS with two linkedIt's designed to centralize both autonomous driving and in-cabin AI (voice assistants, driver monitoring, entertainment, etc.). The philosophy is the same: the more AI you want to integrate into the car in real time, the more sense a dedicated accelerator in the vehicle makes.

Outside of private cars, NPUs also reign supreme in security cameras, drones, and robots: devices like the Hailo-8 (26 TOPS with low wattage) or Intel's Myriad and Google's Edge TPU allow computer vision at the edge without overloading networks or data centers.

Local AI on the "real" mobile: PocketPal, MNN Chat and others

mnn-chat

Beyond the functions decided by the manufacturer, there are increasingly more users who want run your own language models locally On your mobile device, without using ChatGPT, Gemini, or similar apps. That's where apps like PocketPal, Offgrid, ChatterUI, or MNN Chat come in.

PocketPal is one of the most accessible. It allows you to download open-source models (Llama, Gemma, Phi, Qwen, Mistral…) in compact formats like GGUF and run them directly on your phone, offline. total privacyThe prompts and responses never leave the device. All you need is a relatively modern Android or iOS mobile phone, a few 6-8 GB of RAM and several gigabytes free for models.

In practice, models with parameters between 1B and 4B (such as Qwen2.5-1.5B, Llama 3.2 3B, or Qwen3-4B-Instruct) work reasonably well on mid-range phones. However, typical performance is usually between 5 and 20 tokens per second in high-end, and even less so in lower-end, far from what can be achieved on a server with a professional GPU.

To squeeze out extra performance, on iPhone it's advisable to use Metal and increase the number of GPU layers; on Android, some apps are starting to take advantage of this. Vulkan, GPU and, on rare occasions, NPU via NNAPIEven so, in many of these solutions the real burden still falls on the CPU and GPU, and the NPU remains underutilized because the software layer is not mature.

The case of MNN Chat is illustrative: it is one of the fastest apps that many users have tried on an S24 Ultra, but at the cost of using highly quantized models, with some sacrifice in quality, and without it being clear whether it is fully utilizing the Snapdragon's NPU or "only" optimizing the CPU/GPU route very well.

Why your S24 Ultra isn't getting 100% out of its NPU with Qwen 3.5 4B

Although on paper the SoC of an S24 Ultra or S25 Ultra can handle models with up to 10 billion parameters and more than 40 TOPS of AI calculation, when you install an LLM like Qwen 3.5 4B in a generic app, the same thing usually happens: It starts quickly, then heats up, performance drops, and stabilizes well below expectations..

The main reason is that, in most third-party apps, the model runs on the CPU or GPU using general-purpose libraries (BLAS, Vulkan, Metal) without direct, fine-grained access to the SoC's NPU. On mobile devices, the NPU is typically exposed through APIs like NNAPI on Android or Core ML on iOS, but not all local LLM frameworks are well-integrated with them, and manufacturer support varies.

The result is that a simple test, such as the one Nexa AI showed with a high-end Galaxy generating continuous text, clearly demonstrates the behavior: if everything relies on the CPU, initially the tokens per second are very highBut within minutes the temperature rises, the system lowers frequencies to avoid exceeding the thermal limit, and performance drops to a much more modest but sustainable level.

When the workload truly shifts to the NPU, the profile changes: you don't see such a spectacular spike at the beginning, but you do see much higher token production. flat and stable over timewith a lower temperature and less impact on battery life. The problem, as of today, is getting a local LLM app to communicate with that NPU seamlessly.

Furthermore, there are other physical limitations that cannot be addressed with software: the amount of available RAM, the SoC's memory bandwidth, and the model's size itself. In mobile devices, the "comfort zone" for LLM is usually in quantized models of about 3-4 GB in sizeAbove that, loading times, consumption, and throttling almost always increase.

Therefore, although the marketing of chips like Snapdragon 8 Gen 3 or 8 Gen 4 talks about "10B LLMs on the device", in practice the user experience with heavy open source models remains delicate, especially if the app is not designed from scratch to squeeze the most out of the NPU using the manufacturer's official SDKs.

Advantages and drawbacks of local AI on mobile

How to improve mobile coverage in areas with poor signal

Running AI locally on mobile devices has enormous appeal. To begin with, privacyIf the model is on the phone and there are no calls to external servers, everything you tell it stays there. This is invaluable for sensitive uses (personal notes, medical data, internal company documents, etc.).

La latency It also works in your favor: you're not dependent on the network, so a text summary, a quick translation, or a bit of reasoning arrives as fast as the chip allows, wherever you are. Even on the subway with no signal or on a trip without data, you still have a functional assistant.

Furthermore, on a large scale, offloading work from the cloud reduces costs. It's not the same as millions of users making each query to a cluster of paid GPUs as it is to move some of those requests to... NPUs that have already paid when buying the mobile phoneThat's why companies like Qualcomm, MediaTek, and Apple are pushing AI so hard in devices.

The toll is on the other side. battery and temperature They suffer if you overuse heavy models, the quality of the smaller models doesn't yet reach the level of GPT-4 or Gemini Ultra, and the experience can be inconsistent if the software is still in its early stages: crashes, models that won't load, frustratingly long times to the first token…

That's why many brands are betting on a model hybridSimple, quick, and responsive tasks (basic translations, text correction, certain photo editing, and shortcuts) are handled directly on the mobile device, while more complex requests or those requiring a high-end processor are sent to the cloud. This creates a seamless and private experience without sacrificing the capabilities of more powerful devices when needed.

Ultimately, the NPU's role is to make all of this work: without a highly efficient AI core in the SoC, local AI would be an occasional luxury that would drain the battery in minutes. With a mature NPU and good software, it becomes a seamless feature working in the background on your phone, computer, or car while you simply see everything respond faster and more intelligently.

Given this scenario, the feeling is clear: AI no longer lives only in the cloud or only on the servers of large technology companies, but is landing directly in your pocket and on your deskThe mobile SoC's NPU is not just for show: it's the silent engine that makes that local AI reasonably fast, useful, and private, although we still need a leap in software and ecosystem so that anyone can get the most out of it without racking their brains or settling for 4 tokens per second.


You might be interested in:
What are the most important characteristics when choosing a new mobile?