Dr. Long Cheng · Senior Architect at NVIDIA

LLM infrastructure, GPU kernels, and compiler optimization.

I build TensorRT-LLM infrastructure for NVIDIA's latest GPUs, spanning sparse attention, lossless Top-K, and LLM for system and algorithm design and optimization, from GPU kernels to compiler workflows such as LLM4Compiler.

China TensorRT-LLM GPU kernels LLM-native algorithms ML systems Compiler optimization
TensorRT-LLM 1.88x avg kernel speedup +9.3% end-to-end generation gain LLM4Compiler / VecTrans / PASA
Portrait of Long Cheng

About

My work sits at the boundary between algorithms, systems, and hardware. I focus on building efficient AI infrastructure that turns research ideas into production-quality implementations with measurable impact.

Before NVIDIA, I worked at Huawei Beijing Research Center on low-precision LLM operators, heterogeneous acceleration, and LLM-native algorithm and compiler workflows. My background also includes high-performance scientific computing, numerical methods, and mixed-precision algorithms from research at Beihang University and the University of Cambridge.

TensorRT-LLM Sparse attention LLM-native algorithms GPU kernel optimization Mixed precision LLM4Compiler ARM NEON/SVE HPC

Experience

Industry roles and research engineering experience.

Nov 2025 - Present China

NVIDIA

Senior Architect, Compute DL Architecture

Building TensorRT-LLM inference infrastructure for the latest NVIDIA GPUs, with focus on sparse attention, LLM-native algorithm design, GPU kernel optimization, long-context decoding, and system-algorithm co-design.

Apr 2021 - Nov 2025 Beijing, China

Huawei Beijing Research Center

Researcher to Principal Research Engineer, Compilers and Programming Languages Lab

Led low-precision LLM operators, heterogeneous acceleration, and LLM-native compiler workflows, including PASA, LLM4Compiler, VecTrans, and mixed-precision solver optimization across Ascend and Kunpeng.

Jan 2020 - Apr 2020 Cambridge, UK

Huawei Cambridge Research Center / HiSilicon

PhD Intern Researcher

Researched scientific computing and mixed-precision methods for PDE workloads in collaboration with UK-based teams.

Education

Academic training in engineering, mathematics, and scientific computing.

2014 - 2021 Beijing, China

Beihang University (BUAA)

PhD, Aerospace Engineering · Supervisor: Prof. Xiaofeng Sun

PhD research in computational aeroacoustics, immersed boundary methods, mixed precision, and massively parallel scientific computing.

2019 - 2020 Cambridge, UK

University of Cambridge

CSC-Sponsored Full-time Visiting PhD Student, Computational and Applied Mathematics · Host Supervisor: Prof. Paul Tucker

Worked on high-performance scientific computing under a full-time visiting PhD arrangement during the doctoral program.

2010 - 2014 Xi'an, China

Northwestern Polytechnical University (NWPU)

BEng, Aerospace Engineering

Selected Work

Representative projects and technical artifacts.

GVR Top-K for TensorRT-LLM

NVIDIA · Sparse attention · Production infrastructure

A four-stage, lossless Top-K algorithm and GPU kernel for DSA sparse attention: search-based, sort-free, and almost atomic-free, designed for long-context inference and integrated into TensorRT-LLM.

Developed through AI-assisted, human-in-the-loop kernel co-design, with AI accelerating algorithm exploration and optimization while human judgment drove problem framing, correctness, and final performance decisions.

LLM4Compiler

openEuler AI4C · Open source

LLM-native compiler and code optimization initiative including VecTrans and G2CTrans for transformation and performance-oriented workflows.

PASA

Huawei · Low-precision LLM inference

Robust low-precision attention algorithm for large-model inference on Ascend NPUs, combining algorithmic, numerical, and systems considerations.

CartCAaS

PhD research · Scientific computing

Massively parallel CFD/CAA solver for flow-sound interaction in turbomachinery, forming the foundation of later work in numerical methods and performance engineering.

Selected Publications

Full list on Google Scholar

VecTrans: LLM Transformation Framework for Better Auto-vectorization on High-Performance CPUs

arXiv 2503.19449 · CGO 2026 submission

Read paper

Online Pseudo-average Shifting Attention (PASA)

arXiv 2503.01873 · NeurIPS 2025 submission

Read paper

Inviscid Nonlinear Modeling of Vibration-Induced Acoustic Resonance of a Linear Cascade

AIAA Journal · 2021

Publisher page

A semi-implicit immersed boundary method for simulating viscous flow-induced sound with moving boundaries

Computer Methods in Applied Mechanics and Engineering · 2021

Publisher page

Talks & Recognition

Selected external talks and organizational recognition.

Heterogeneous Computing -- A Bidirectional Journey between Algorithms and Computing Power

15th C-Talk Public Welfare Science and Technology Speech Conference · 2023

Public report

The principle of Mixed-precision numerical algorithms and the application in computational physics

HKRC Open Webinar · 2022

Huawei Golden Medal Team Award

Huawei's highest team honor · 2024

Selected Awards

  • Silver Award for Innovation Pioneer · 2024
  • Software Elite Award · 2023
  • Best New Employee · 2022
  • Meritorious Winner, Mathematical Contest in Modeling (MCM) · 2013
  • National Scholarship · top 2.2% · 2013

Open to research, systems, and infrastructure conversations

The best way to reach me is by email. You can also find my code, publications, and professional profile through the links below.