SmolLM-135M-Instruct Open-Source Language Model - Lightweight, Suitable for Mobile Devices and Free for Deployment

Smollm 135M Instruct

Developed by litert-community

A lightweight instruction fine-tuned language model optimized for mobile deployment

Large Language Model Open Source License:Apache-2.0 #Mobile optimization #Low-memory inference #Instruction fine-tuning

Downloads 131

Release Time : 4/30/2025

Model Overview

This model is a variant of HuggingFaceTB/SmolLM-135M-Instruct, supporting efficient operation on Android devices through the LiteRT framework and MediaPipe LLM inference API.

Model Features

Mobile optimization

Optimized for Android/iOS/Web platforms, supporting efficient deployment

Quantization support

Provides dynamic_int8 and dynamic_int4 quantization versions, significantly reducing the model size

Efficient inference

Accelerated by LiteRT XNNPACK, supporting 4-thread CPU inference

Low memory usage

The memory usage of the quantized model is significantly reduced, suitable for mobile devices

Model Capabilities

Instruction following

Text generation

Mobile inference

Use Cases

Mobile applications

Device-side chat assistant

Deploy a localized chat application on Android devices

The quantized version can significantly reduce resource usage while maintaining performance

🚀 litert-community/SmolLM-135M-Instruct

This model offers several variants of HuggingFaceTB/SmolLM-135M-Instruct, which are ready for deployment on Android using the LiteRT (fka TFLite) stack and MediaPipe LLM Inference API.

🚀 Quick Start

✨ Features

Provide variants of HuggingFaceTB/SmolLM-135M-Instruct for Android deployment.
Support deployment using LiteRT (fka TFLite) stack and MediaPipe LLM Inference API.

📦 Installation

This section will guide you on how to use the model in different environments.

Colab

⚠️ Important Note

The target deployment surface for the LiteRT models is Android/iOS/Web and the stack has been optimized for performance on these targets. Trying out the system in Colab is an easier way to familiarize yourself with the LiteRT stack, with the caveat that the performance (memory and latency) on Colab could be much worse than on a local device.

Click the following button to open the notebook in Colab:

Android

Download and install the APK: Download and install the apk.
Follow the in - app instructions: After installation, follow the instructions in the app.

To build the demo app from source, please follow the instructions from the GitHub repository.

📚 Documentation

Performance on Android

Note that all benchmark stats are from a Samsung S24 Ultra with 1280 KV cache size with multiple prefill signatures enabled.

Property	Details
Model Type	Variants of HuggingFaceTB/SmolLM-135M-Instruct
Training Data	Not provided in the original document

Backend	Context length	Prefill (tokens/sec)	Decode (tokens/sec)	Time - to - first - token (sec)	Memory (RSS in MB)	Model size (MB)
cpu (fp32 baseline)	1280	498.05 tk/s	47.96 tk/s	0.78 s	931 MB	527 MB
cpu (dynamic_int8)	1280	1084.75 tk/s	43.50 tk/s	0.46 s	579 MB	159 MB

Model Size: measured by the size of the .tflite flatbuffer (serialization format for LiteRT models).
Memory: indicator of peak RAM usage.
Acceleration: The inference on CPU is accelerated via the LiteRT XNNPACK delegate with 4 threads.
Benchmark condition: Benchmark is run with cache enabled and initialized. During the first run, the time to first token may differ.
Model types:
- dynamic_int4: quantized model with int4 weights and float activations.
- dynamic_int8: quantized model with int8 weights and float activations.

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご