Meta Releases Quantized Llama 3.2 for 4x Inference Speed on Android Phones [Video]

Meta has introduced quantized versions of its Llama 3.2 models, enhancing on-device AI performance with up to four times faster inference speeds, a 56% model size reduction, and a 41% decrease in memory usage.

Check out the models on Hugging Face.

These models, designed to operate effectively on mobile devices, can now be accessed through Meta and Hugging Face, expanding deployment possibilities across mobile CPUs in collaboration with Arm, MediaTek, and Qualcomm.

These quantized Llama models in the 1B and 3B categories are designed to match the quality and safety standards of their original versions while offering significant improvements in performance, achieving speeds 2-4 times faster. Additionally, these models reduce memory usage by an average of 41% and decrease model size by 56% compared to the initial BF16 format.

The Llama Stack reference implementation, through PyTorch’s ExecuTorch framework, supports inferences for both quantization techniques. Developed in partnership with industry leaders, these optimised models are now available for Qualcomm and MediaTek …

Tags Email Marketing, Global Marketing, Marketing Ideas, Marketing Mix, Strategies