At its core, AI inference refers to using a pre-trained model to perform forward-pass computations on new, previously unseen data in order to generate prediction results. This process can be analyzed in depth from several key perspctives:
Unlike training, which requires extensive backpropagation and gradient calculations (often involving high-precision formats such as FP32, FP64, or mixed precision), inference focuses exclusively on forward propagation. This leads to several important characteristics:
These properties allow hardware to be optimized to an extreme degree, for example:
Multiple consecutive neural network layers (such as Conv–BN–ReLU) can be fused into a single, highly optimized kernel, significantly reducing off-chip memory accesses. Since memory access is one of the primary bottlenecks for inference latency and power consumption, operator fusion delivers substantial performance and efficiency gains.

Static Scheduling
Because the computation graph is known before deployment, the compiler can predefine the optimal execution schedule, eliminating runtime dynamic scheduling overhead entirely.
Training requires high numerical precision to ensure the stability and correctness of gradient descent. Inference, by contrast, is far more tolerant of lower precision. As a result, INT8, INT4, and even binary or ternary networks are widely adopted in inference workloads.
Lower precision delivers two major advantages:
Chip design priorities: compute density and interconnect bandwidth
Chip design priorities: power efficiency and on-chip memory capacity
From an application perspective, AI chips can be broadly divided into two categories:
From an architectural perspective, AI chips can also be divided into:
(For a detailed comparison, see my previous article “A Comparison of GPU and ASIC Strengths and Weaknesses in AI”.)
Beyond use case (training vs. inference) and chip type (GPU vs. ASIC), AI chips can also be classified by their compute architecture:
Represented by NVIDIA GPUs, SIMT (Single Instruction, Multiple Threads) architectures excel at highly parallel, homogeneous workloads. Their strong programmability and general-purpose flexibility form the foundation of their dominance in AI training.
Represented by Google TPU, Domain-Specific Architectures customize hardware units for specific computation patterns such as matrix multiplication and convolution. While extremely efficient, they are less flexible than GPUs.
Represented by Graphcore IPU, dataflow architectures map the entire computation graph onto the chip, allowing data to flow directly between processing elements. This greatly reduces global memory access and is especially well suited for graph-based models.
The AI training chip market has very few serious competitors. NVIDIA alone holds over 90% market share, with its Blackwell architecture supporting training of models with up to 1.8 trillion parameters, and NVLink 6 enabling seamless interconnection of 72-GPU clusters.
Huawei and AMD are the only vendors besides NVIDIA with a meaningful presence in the AI training market. However, their market share exists on a completely different scale and is not directly comparable.
Intel’s Gaudi series has almost no visibility in the market, with a share of less than 1%.
The MI300X is close to—or even exceeds—the H100 in certain hardware metrics, but the ROCm software ecosystem remains its most significant weakness.

The Ascend 910, paired with the CANN software stack, has formed a viable alternative in the domestic Chinese market under policy support. However, it faces challenges globally in terms of software ecosystem maturity and access to advanced manufacturing nodes.
Gaudi 3 emphasizes cost-effectiveness, but still lags behind leading players in absolute performance and ecosystem maturity.
AI inference inherently involves vendor-specific algorithms, making customization essential. Customized chips are, by definition, ASICs, which is why AI inference chips are predominantly ASIC-based.
According to Verified Market Research, the global AI inference chip market was valued at USD 15.8 billion in 2023 and is expected to reach USD 90.6 billion by 2030, with a compound annual growth rate (CAGR) of 22.6% from 2024 to 2030.
As discussed earlier, AI inference requires customization to fully exploit each vendor’s proprietary algorithms and unique functional requirements. Only ASICs can deliver this level of specialization.
This explains why, beyond purchasing large volumes of general-purpose GPUs, major companies still invest in developing their own inference ASICs to achieve the exact inference capabilities they need.
“Fixed-functionality” is the core advantage of ASICs. By tailoring the hardware architecture to a single task, inference logic and data paths can be permanently embedded into the chip.
All unnecessary general-purpose components—such as dynamic scheduling units and generic memory controllers found in GPUs—are removed, allowing 100% of hardware resources to serve inference workloads.
Inference workloads are far more sensitive to energy efficiency (performance per watt) and cost than training workloads, and ASICs offer overwhelming advantages in both areas.
A large model may require only tens to hundreds of training chips, but inference at scale can demand tens or even hundreds of thousands of chips. For example, ChatGPT’s inference cluster is more than 10× larger than its training cluster. This makes ASIC customization highly effective at reducing per-chip cost.
ASIC design cycles typically span 1–2 years, while AI models evolve at an extremely rapid pace. For example, large language models progressed from GPT-3 to GPT-4 in just one year.
If an ASIC is designed around a model architecture that becomes obsolete (such as CNNs being displaced by Transformers), the chip may effectively become unusable.
For the same reasons, ASICs are relatively weak for training workloads. Training requires flexibility and rapid algorithm iteration, and using ASICs for training introduces a high risk of obsolescence, resulting in poor cost-effectiveness.
Virtually every major global technology company you are familiar with—including Apple, Amazon, Google, Meta, Microsoft, Huawei, Tencent, ByteDance, Alibaba, and OpenAI—has already deployed, is deploying, or has commissioned the development of inference chips.
Most AI giants are software companies and do not maintain large internal chip design teams. As a result, ASIC design is typically outsourced.
Currently, Broadcom leads the ASIC market with 55%–60% share, followed by Marvell with 13%–15% share.
Below is a list of well-known inference chips that have already been deployed (excluding those still under development):
|
Company |
Representative Products |
Architecture |
Application Scenarios |
|
|
TPU v6 series |
ASIC |
Cloud inference, training |
|
Amazon |
Inferentia, Trainium |
ASIC |
Inferentia for inference, Trainium for training |
|
Microsoft |
Maia 100 series |
ASIC |
Cloud inference, training |
|
Meta |
MTIA series |
ASIC |
Cloud inference, training |
|
Huawei HiSilicon |
Ascend 910 series |
ASIC |
Cloud inference, training |
|
Cambricon |
Siyuan 590 series |
ASIC |
Cloud inference, training |
|
Other vendors |
— |
— |
— |
Please note that NVIDIA, AMD, and Intel AI chips can also be used for inference, but their strengths are far more pronounced in training scenarios.
In addition, several smaller startups—such as SambaNova, Cerebras Systems, Graphcore, Groq, Tenstorrent, Hailo, Mythic, and KAIST’s C-Transformer—have introduced AI chips capable of inference. However, shipment volumes remain relatively small and are not comparable to inference chips designed in-house by major technology companies.