Skip to main content
Apple

Apple Unveils M4 Neural Engine Training Capabilities in Third Inside Look

Published by
SectorHQ Editorial
Apple Unveils M4 Neural Engine Training Capabilities in Third Inside Look

Photo by Growtika (unsplash.com/@growtika) on Unsplash

19 TFLOPS FP16 at just 2.8 W—yielding 6.6 TFLOPS/W efficiency—powers Apple’s new M4 Neural Engine training, Maderix reports, marking the first full forward‑and‑backward transformer training on the ANE without CoreML.

Key Facts

  • Key company: Apple

Apple’s M4 Neural Engine (ANE) can now run full‑stack transformer training without relying on Core ML, according to a detailed walkthrough published by Maderix. The team first exposed the ANE’s private compile‑load‑evaluate API chain in Part 1, then measured raw throughput in Part 2 (19 TFLOPS FP16 at 2.8 W, 6.6 TFLOPS/W). In Part 3 they pushed the silicon beyond inference, executing a forward pass, backward pass, gradient calculation and Adam updates for a 109‑million‑parameter model built from scratch on hardware originally designed for inference‑only workloads.

The training pipeline hinges on the ANE’s ability to execute a compiled MIL program atomically, which imposes strict limits on control flow and in‑place mutation. To work around the lack of a causal attention mask, Maderix split the scaled‑dot‑product attention (SDPA) into three dispatches: Q × Kᵀ on the ANE, mask + softmax on the CPU, and the resulting scores × V back on the ANE. Weight gradients and the Adam optimizer remain on the CPU because they involve large outer‑product updates and in‑place weight mutation that the ANE cannot handle directly. This hybrid split allows a single training step for the 12‑layer “Stories110M” model (dim = 768, hidden = 2048, seq = 256) to consume roughly 1.7 GFLOP of matmuls, with the backward pass costing about twice the compute of the forward pass.

Maderix experimented with two distinct pipelines to manage the compile‑time overhead inherent in the ANE’s workflow. The first “static‑weights” pipeline baked every weight matrix into the MIL program as a compile‑time constant. Each Adam update forced a full recompilation of all kernels—60 per batch for the forward pass plus 12 for the backward SDPA, totaling 72 compilations per step. This approach achieved 106.7 ms per training step (≈1.6 TFLOPS combined) but quickly hit a resource leak in the ANE’s compile subsystem after roughly 119 compilations, forcing the process to checkpoint and restart after every ten steps. The second pipeline moved additional operations onto the ANE—specifically the 32 k‑output classifier, softmax, and RMSNorm backward kernels—via a C‑callable bridge API contributed by Vipul Divyanshu (PR #19). Offloading these kernels cut step time to 91.8 ms, a 14 % speedup, but increased the compile count to 86 per batch, aggravating the same leak issue.

Scaling the experiment, the team trained a larger Qwen‑3‑0.6B model (596 M parameters, grouped‑query attention) on the same hardware. Each scaling iteration exposed a new bottleneck: the first iteration was limited by compile latency, the second by CPU‑side mask + softmax throughput, and the third by memory pressure on the ANE’s 32 MB SRAM “cliff.” Despite these constraints, the M4 ANE sustained a forward‑and‑backward throughput that translates to roughly 6.6 TFLOPS/W, confirming the efficiency numbers reported in Part 2. The authors note that the ANE’s native SDPA op ignores the causal mask without affecting output, a quirk that simplifies the split but also underscores the need for CPU assistance in certain attention patterns.

The findings suggest that Apple’s inference‑focused silicon can be repurposed for modest‑scale training workloads, provided developers accept a hybrid CPU‑ANE execution model and manage the compile‑time overhead. Maderix’s work demonstrates that, with careful pipeline design, the M4 ANE can deliver training performance comparable to low‑power GPUs while consuming under 3 W. However, the reliance on frequent recompilation and the need to offload key ops to the CPU indicate that the ANE is not yet a drop‑in replacement for dedicated training accelerators. Future firmware updates that address the compile‑resource leak could unlock more aggressive on‑chip training pipelines, potentially expanding the ANE’s role in on‑device AI development.

Sources

Primary source

Reporting based on verified sources and public filings. Sector HQ editorial standards require multi-source attribution.

More from SectorHQ:📊Intelligence📝Blog

🏢Companies in This Story

Related Stories