MLC LLM: A Deployment Engine for ML Compilation

What is MLC LLM?

MLC LLM (Machine Learning Compilation for Large Language Models) is an approach used to run and optimize large language models (LLMs) more efficiently. MLC uses compilation techniques and optimization methods to make language models faster and more cost-effective. MLC is specifically built on top of Apache TVM (Tensor Virtual Machine) and optimizes LLMs to run on various platforms such as browsers, mobile devices (Android and iOS), CPUs, and GPUs.

 

 

 

Figure: Workflow in MLC LLM

Apache TVM

To compile MLC LLM, the latest development of TVM Unity, also known as Apache TVM, is required. TVM Unity is a compiler developed to make large language models (LLM) faster and more efficient. This compiler enhances speed and efficiency by performing hardware-specific graph and memory optimizations. Additionally, it maximizes performance using kernel fusion techniques.

Key features of TVM Unity include:

High-Performance CPU/GPU Code Generation: Provides instant high-performance code generation without the need for additional tuning or adjustments.
Support for Both Inference and Training: Supports both inference and training phases of the model.
Efficient Python-First Compiler Implementation: The compilation process can be entirely done in Python, offering a user-friendly API.

For example, let’s consider a simple CNN model. Let’s choose Pytorch FX as the frontend. Pytorch FX allows us to trace the execution graph of Pytorch-based models.

class MLPModel(nn.Module):
def__init__(self):
super(MLPModel, self).__init__()
self.fc1 = nn.Linear(784, 256)
self.relu1 = nn.ReLU()
self.fc2 = nn.Linear(256, 10)
defforward(self, x):
x = self.fc1(x)
x = self.relu1(x)
x = self.fc2(x)
return x

Unlike ONNX models, PyTorch models do not contain information about the shape and data type of the input data. Therefore, when providing the model to the compiler or converter, we need to manually define this information.

The input structure is defined as follows:

input_info = [((1, 784), “float32”)]

Then, using the TVM Unity API, the PyTorch FX model is converted to a Relax model.Up to this point, we have successfully converted the PyTorch FX model to a TVM IRModule. This module is used for subsequent transformation and optimization processes.

Some optimization processes are as follows:

 

LegalizeOps: Converts mathematical operations or computations in the model to be executable on hardware. These are converted to TensorIR functions.
AnnotateTIROpPattern: Adds Pattern Tags to TensorIR Functions. Operations are grouped together, thus increasing efficiency and speed.
FoldConstant: Pre-solves Operations with Constant Values. If there are operations with constant values in the model, it calculates them before running the model and saves the result.
FuseOps and FuseTIR: Some sequential operations in a model are combined to be executed at once.

After the optimization, we can compile the model into a TVM runtime module. Apache TVM Unity use Relax Virtual Machine to run the model.

exec = relax.build(mod, target=target)
dev = tvm.device(str(target.kind), 0)
vm = relax.VirtualMachine(exec, dev)

Now we can run the model on the TVM runtime module. This way, the model will be compiled. You can compile any Hugging Face model to run on your hardware. For more information, visit the MLC LLM site.

How to run llm model with MLC?

To run the model with MLC LLM, you need to follow three steps: convert weights, generate `mlc-chat-config.json`, and compile the model.

 

Firstly, you need to download the quantized model or use the `gen_config` function of the MLC LLM Python package to create the `mlc-chat-config.json` file, process tokens, and quantize the models. The, download the TVM Unity Compiler to compile the model. To specify the desired hardware, use the `–device` parameter with one of the following options:

  • Cuda
  • Metal
  • Vulkan
  • Iphone
  • Android
  • Webgpu

This way, LLM models can be run on different hardware platforms.

Figure: Workflow in MLC LLM

MLC LLM is also integrated with an interface called WebLLM . WebLLM is a modern and user-friendly platform developed to enable users to leverage the capabilities of MLC LLM more efficiently and effectively. This interface aims to maximize the potential of the model by offering a wide range of use cases for both developers and end users. The main features of WebLLM are:

 

In-Browser Inference: High-performance language model operations are performed directly in the browser with WebGPU support, eliminating the need for a server.
OpenAI API Compatibility: Provides full compatibility with JSON mode, function calling, and streaming support.
Wide Model Support: Supports models such as Llama, Phi, Gemma, RedPajama, Mistral, Qwen, and more.
Custom Model Integration: Easily integrate custom models in MLC format.
Easy Integration: Quick setup with NPM, Yarn, or CDN and easy connection with UI components.
Real-Time Output: Provides instant output for interactive applications with streaming chat completion.
Web Worker and Service Worker Support: Optimizes interface performance by performing computations in the background.

Conclusion

MLC LLM is a powerful tool for running large language models more efficiently and optimally. By integrating with Apache TVM Unity, it offers high-performance model compilation and execution on various hardware. The optimization techniques and broad hardware support provided by MLC LLM enable language models to be used faster and more cost-effectively. Additionally, with WebLLM integration, it offers high-performance inference within the browser. For more information and detailed guidance, visit the official documentation of MLC LLM.