ASIC Inference Chip Runs Llama 3.1 8B at 16,000 tok/s

Overview

AI inference cost and speed have long depended on GPU hardware. Now startup Taalas has achieved 16,000 tok/s on Llama 3.1 8B using custom ASIC chips — and they’re offering it for free. The announcement garnered 77 points and 70+ comments on Reddit’s r/LocalLLaMA.

Achieving this speed without GPUs signals a paradigm shift in AI inference infrastructure.

What Is Taalas and Their ASIC Inference Chip?

The Limitations of GPU Inference

Current LLM inference relies heavily on NVIDIA GPUs (A100, H100, etc.). The problems are clear:

High cost: A single H100 costs over $30,000
High power consumption: GPU clusters consume hundreds of kilowatts
Complex infrastructure: Requires liquid cooling, HBM stacks, high-speed I/O
Inefficiency of general-purpose design: GPUs were designed for graphics processing

Taalas’s Approach: Total Specialization

Founded 2.5 years ago, Taalas developed a platform for creating model-specific custom silicon. Their three core principles:

Total Specialization: Produce optimal silicon for each individual AI model
Merged Storage and Computation: Unify memory and compute on a single chip at DRAM-level density
Radical Simplification: No HBM, advanced packaging, 3D stacking, or liquid cooling needed

graph LR
    A[Receive AI Model] --> B[Custom Silicon Design]
    B --> C[ASIC Manufacturing]
    C --> D[16,000 tok/s Inference]
    style D fill:#00E5FF,color:#000

They claim the process takes just two months from model receipt to hardware realization.

Performance Comparison: GPU vs ASIC

Metric	GPU (H100)	Taalas ASIC
Llama 3.1 8B Speed	~1,500-2,000 tok/s	16,000+ tok/s
Speed Multiplier	1x	~10x
Power Efficiency	Low (700W/chip)	High (significantly reduced)
Cooling Method	Liquid cooling required	Air cooling possible
Infrastructure Complexity	High	Low

A ~10x speed improvement over conventional GPUs, with dramatically simpler infrastructure.

The Trend Away from GPU Dependency

This movement isn’t unique to Taalas. GPU alternatives are emerging rapidly in the AI inference hardware market:

Groq: Ultra-fast inference with LPU (Language Processing Unit)
Cerebras: Wafer-scale chips for large model processing
Etched: Transformer-specific ASIC development
Taalas: Model-specific custom ASICs

graph TD
    GPU[GPU-Centric Era] --> |Cost/Speed Limits| Alt[Alternative Hardware Emerges]
    Alt --> Groq[Groq LPU]
    Alt --> Cerebras[Cerebras WSE]
    Alt --> Etched[Etched Sohu]
    Alt --> Taalas[Taalas ASIC]
    Taalas --> Future[Model-Specific Custom Silicon Era]
    style Future fill:#FF6D00,color:#fff

Taalas CEO Ljubisa Bajic draws an analogy to the transition from ENIAC to transistors, emphasizing that AI must evolve to become “easy to build, fast, and cheap.”

The Inference Cost Revolution

Current Cost Structure

Most LLM inference costs come from hardware and power:

GPU hardware: 40-50%
Power and cooling: 20-30%
Network/storage: 10-15%
Personnel/operations: 10-15%

How ASICs Will Transform Costs

As ASIC-specific chips become mainstream:

Dramatic hardware cost reduction: No HBM or advanced packaging needed
Plummeting power costs: 10x+ efficiency improvement
Infrastructure simplification: Reduced data center complexity
Cost per token drops to 1/10 or less

This means price disruption for current per-API-call pricing models. When inference approaches near-zero cost, AI adoption will expand explosively.

Limitations and Caveats

There are important caveats at this stage:

Model limitation: Currently supports only Llama 3.1 8B (a small model)
Lack of flexibility: Model changes require new chips
Unproven at scale: Large-scale commercialization still needs time
No large model support: 70B, 405B models are still on the roadmap

The Reddit community was divided between “8B is too small” and “it’s sufficient as a proof of concept.”

Try It Yourself

Taalas currently offers two free options:

Chatbot demo: Experience 16,000 tok/s speed firsthand at ChatJimmy
Inference API: Apply for free access via the API request form

As Reddit users noted, the sheer speed is an experience worth trying.

Conclusion

Taalas’s ASIC inference chip is a significant milestone showing the future of AI inference hardware. While currently limited to an 8B model, if this technology scales to larger models, it could fundamentally transform GPU-dependent AI infrastructure.

Key takeaways:

10x+ inference speed compared to GPUs
Dramatic reduction in power, cooling, and infrastructure costs
A new paradigm of model-specific custom silicon
The potential for fundamental changes in inference cost structures

For AI to become truly ubiquitous, inference infrastructure must be democratized first. ASIC-specific chips mark the beginning of that journey.

Reading Complete!

ASIC Inference Chip Runs Llama 3.1 8B at 16,000 tok/s — The GPU-Free AI Inference Era

Overview

What Is Taalas and Their ASIC Inference Chip?

The Limitations of GPU Inference

Taalas’s Approach: Total Specialization

Performance Comparison: GPU vs ASIC

The Trend Away from GPU Dependency

The Inference Cost Revolution

Current Cost Structure

How ASICs Will Transform Costs

Limitations and Caveats

Try It Yourself

Conclusion

References

Read in Other Languages

Was this helpful?

About the Author

Kim Jangwook

Reading Complete!

Overview

What Is Taalas and Their ASIC Inference Chip?

The Limitations of GPU Inference

Taalas’s Approach: Total Specialization

Performance Comparison: GPU vs ASIC

The Trend Away from GPU Dependency

The Inference Cost Revolution

Current Cost Structure

How ASICs Will Transform Costs

Limitations and Caveats

Try It Yourself

Conclusion

References

Read in Other Languages

Was this helpful?

About the Author

Kim Jangwook

Related Articles

DDR5 RDIMM vs RTX 3090 — The Cost-per-GB Tipping Point for Local LLMs

Training an LLM on CPU in 1.2 Hours — The Promise of MatMul-Free Architecture

NVIDIA DGX Spark CUDA Compatibility Issues — The Reality of Personal AI Workstations