IQ*_K/IQ*_KS Quantization Merged into llama.cpp — ik_llama.cpp Contributions Go Mainline

IQ*_K/IQ*_KS Quantization Merged into llama.cpp — ik_llama.cpp Contributions Go Mainline

IQ-series quantization methods developed in ik_llama.cpp are being merged into llama.cpp mainline. Learn about IQ2_K through IQ4_KS precision improvements and local LLM inference optimization.

Overview

llama.cpp’s quantization methods have reached a major turning point. The IQ_K / IQ_KS series quantization, independently developed in ik_llama.cpp (a fork of llama.cpp), is being merged into the llama.cpp mainline through PR #19726. With 125 points on Reddit r/LocalLLaMA, this has garnered significant attention in the local LLM community.

This article covers the technical background of IQ-series quantization, how it differs from existing methods, and its impact on local LLM inference.

What Is IQ Quantization?

Limitations of Conventional Quantization

In llama.cpp, k-quant series quantization like Q4_K_M and Q5_K_S has been the mainstream approach. These use uniform quantization grids that don’t fully exploit the distribution characteristics of model weights.

The IQ Approach

IQ (Importance-aware Quantization) series quantization adopts non-uniform quantization based on weight importance. Specifically:

  • Lattice-based quantization: Uses information-theoretically optimal lattice points instead of uniform grids
  • Importance weighting: Adjusts quantization precision based on each weight’s contribution to the loss function
  • Block-level optimization: Computes optimal scale and offset for each weight block
Traditional Q4_K:  Uniform 16-level quantization grid
IQ4_K:             Non-uniform lattice points adapted to weight distribution
Result:            Higher precision at the same bit count

What’s in PR #19726

Ported Quantization Types

GitHub PR #19726 ports the following quantization types from ik_llama.cpp, contributed by AesSedai:

Quant TypeBits/WeightUse Case
IQ2_K~2.5 bpwUltra-low bit, memory-constrained environments
IQ2_KS~2.5 bpwIQ2_K variant for smaller models
IQ3_K~3.5 bpwBalanced, optimal for many use cases
IQ3_KS~3.5 bpwIQ3_K variant for smaller models
IQ4_K~4.5 bpwHigh precision when sufficient memory available
IQ4_KS~4.5 bpwIQ4_K variant for smaller models

The ik_llama.cpp Relationship

The backstory of this PR is noteworthy. Iwan Kawrakow, the developer of ik_llama.cpp, clarified on the PR:

  • The port in its current form (with copyright attribution) is perfectly fine
  • Proper credit attribution in the spirit of the MIT license is important
  • It should be recognized as a “copy” rather than a “rewrite”

This represents a model example of contributing fork innovations back to the upstream project in the open source ecosystem.

Technical Deep Dive

How Lattice Quantization Works

The core of IQ-series quantization lies in lattice-based quantization.

graph TD
    A[Original Weight Matrix] --> B[Block Partitioning]
    B --> C[Importance Score Calculation]
    C --> D{Quant Type Selection}
    D -->|IQ2_K| E[2-bit Lattice Quantization]
    D -->|IQ3_K| F[3-bit Lattice Quantization]
    D -->|IQ4_K| G[4-bit Lattice Quantization]
    E --> H[Scale & Offset Optimization]
    F --> H
    G --> H
    H --> I[Quantized Weights]

In traditional k-quants, quantization grid points are evenly spaced. In IQ-series, lattice points are placed based on the probability distribution of weights, providing higher resolution in frequently occurring weight value ranges and lower resolution in rare ranges.

K vs KS Variants

Each quantization type has two variants — K and KS:

  • K (Standard): Optimized for large models (7B+)
  • KS (Small): Parameters optimized for smaller models (3B and below)

Since weight distributions in smaller models differ from larger ones, KS variants have adjusted lattice placement and scaling.

Benchmark Comparison

Comparison between existing Q-series and IQ-series quantization (reference values):

QuantizationPerplexityModel SizeInference Speed
Q2_KBaselineBaselineBaseline
IQ2_K5-10% betterSameSame to slight decrease
Q3_K_MBaselineBaselineBaseline
IQ3_K3-7% betterSameSame to slight decrease
Q4_K_MBaselineBaselineBaseline
IQ4_K2-5% betterSameSame to slight decrease

The biggest advantage is improved perplexity at the same bit count. The improvement is especially pronounced at low-bit quantization (2-3 bits).

Impact on Local LLM Inference

Improved Memory Efficiency

The integration of IQ-series quantization brings benefits in these scenarios:

  • 8GB VRAM environments: Using IQ3_K enables higher quality 7B models that previously degraded with Q3_K_M
  • Apple Silicon Macs: Run larger models at higher quality within unified memory constraints
  • Edge devices: IQ2_K/IQ2_KS makes LLM inference practical with just 2-3GB of memory

Quantization Ecosystem Evolution

graph LR
    A[ik_llama.cpp<br/>Research & Experiment] -->|Port Results| B[llama.cpp<br/>Mainline]
    B -->|Broad Adoption| C[Quantized Model<br/>Distribution]
    C -->|Feedback| A
    B -->|Integration| D[ollama / LM Studio<br/>End-user Tools]

Integration into llama.cpp mainline means propagation to end-user tools like ollama and LM Studio. Users will be able to use higher-quality quantized models without special configuration.

Practical Usage: IQ Quantization

After the merge is complete, you can use it as follows:

# Quantize a model (llama-quantize)
./llama-quantize model-f16.gguf model-iq3k.gguf IQ3_K

# Use KS variant for smaller models
./llama-quantize small-model-f16.gguf small-model-iq3ks.gguf IQ3_KS

# Run inference
./llama-cli -m model-iq3k.gguf -p "Hello, world"

Future Outlook

The mainline integration of IQ-series quantization is a significant milestone in the local LLM inference efficiency trend:

  1. Even lower-bit quantization: Potential for sub-2-bit research with IQ1_K series
  2. Model-specific optimization: Automatic tuning of quantization parameters per architecture
  3. Hardware optimization: IQ-series kernel optimization for ARM NEON, AVX-512, etc.

Conclusion

The integration of IQ_K/IQ_KS quantization from ik_llama.cpp to llama.cpp is an exemplary case of contributing fork innovations back to upstream in the open source ecosystem. This technology, achieving higher precision at the same bit count, significantly improves LLM inference quality in memory-constrained environments.

For local LLM users, the day when simply selecting IQ3_K or IQ4_K in llama-quantize yields higher quality models than existing Q-series quantization is fast approaching.

References

Read in Other Languages

Was this helpful?

Your support helps me create better content. Buy me a coffee! ☕

About the Author

JK

Kim Jangwook

Full-Stack Developer specializing in AI/LLM

Building AI agent systems, LLM applications, and automation solutions with 10+ years of web development experience. Sharing practical insights on Claude Code, MCP, and RAG systems.