Dr. Long Cheng · Senior Architect at NVIDIA

From Mathematical Insight to Production LLM Infrastructure.

I work across algorithms, systems, mathematics, and hardware, turning research ideas into production AI infrastructure.

Building TensorRT-LLM inference systems through algorithm/kernel co-design, sparse attention, and LLM-guided optimization.

Before NVIDIA, I led low-precision LLM operators, LLM-native compiler workflows, and mixed-precision solvers on a Huawei research team, grounded in HPC and numerical methods.

TensorRT-LLM Sparse Attention GPU Kernels Mixed Precision LLM4Compiler

Email Resume GitHub LinkedIn Scholar

Production Default TensorRT-LLM Top-K path for DeepSeek-V4 Flash / Pro

Original IP NVIDIA filed patents plus Huawei PASA-related high-value patent

Research Depth Numerical error analysis, mixed precision, and LLM-guided compilers

Selected Work

Three representative projects: innovation, impact, evidence.

arXiv · Apr 2026 · NVIDIA · TensorRT-LLM · Sparse Attention

GVR Top-K

Innovation Temporal-correlation threshold search predicts compact candidate sets losslessly.

Impact 1.88x avg Top-K speedup; up to 9.3% generation gain; default DSv4 indexer path.

Evidence TensorRT-LLM PRs, NVIDIA tech blog, arXiv paper, three filed patents.

Technical formula

\[ f(T)=\left|\{\,i\mid x_i\ge T\,\}\right|, \qquad K\le f(T^\ast)\le C \]

\[ T_{\mathrm{new}} =T_{\mathrm{lo}}+ \frac{f(T_{\mathrm{lo}})-f_{\mathrm{target}}} {f(T_{\mathrm{lo}})-f(T_{\mathrm{hi}})} \left(T_{\mathrm{hi}}-T_{\mathrm{lo}}\right) \]

Lossless Top-K 1.88x avg Top-K speedup Default DSv4 indexer Top-K

arXiv paper Merged PR Technical blog

arXiv · Mar 2025 · Huawei · CANN SoftmaxFlashV3 · Full Low-Precision Attention

PASA

Innovation Equivalent shifting plus finite-precision parameter tuning enables full-FP16 attention.

Impact Near-FP32 error, up to 1.65x E2E speedup, Atlas A3/A2 support.

Evidence CANN SoftmaxFlashV3 API, arXiv paper, Huawei high-value patent.

Technical formula

\[ \mathrm{softmax}(\mathbf Q\mathbf K^{\mathsf T}) = \mathrm{softmax}\!\left(\mathbf Q(\mathbf K^{\mathsf T}-\mathbf K_0^{\mathsf T})\right), \qquad \mathbf M=\frac{1}{\alpha}\left(\mathbf I-\frac{\beta}{s_2}\mathbf J\right) \]

\[ \frac{\beta}{1-\beta}=f(\beta), \qquad f(\beta)=\frac{bn}{a(a-bn)}+\frac{1-a}{a}, \qquad \beta=0.984497 \]

Full FP16 attention path CANN SoftmaxFlashV3 API Near-FP32 operator error

Read paper CANN API

arXiv · Mar 2025 · openEuler AI4C · Auto-vectorization · LLM4Compiler

VecTrans

Innovation LLM agents propose, test, and refine vectorization-friendly source transforms.

Impact Better ARM NEON/SVE workflows and LLM4Compiler optimization loops.

Evidence arXiv paper, openEuler AI4C code, EuroLLVM 2025 invited talk.

LLM-guided auto-vectorization ARM NEON / SVE EuroLLVM 2025 invited talk

VecTrans paper EuroLLVM program GitHub mirror

Technical Throughline

A compact workflow behind my AI systems work.

Mathematical structure + LLM-agent search + hardware-aware kernels for production AI infrastructure.

Model the Structure

Error bounds, equivalent transformations, sparsity, temporal correlation, data movement.

Search with LLM Agents

Generate, critique, and refine algorithms, rewrites, and schedules.

Optimize the Kernel

Use roofline analysis and scheduling for CUDA / CUTE DSL / Triton, Ascend kernels, and vectorization.

Validate in Production

Close the loop with accuracy checks, profiling, TensorRT-LLM PRs, patents, and papers.

GVR: temporal correlation to TensorRT-LLM Top-K PASA: full-FP16 attention with near-FP32 error Roofline-guided kernel optimization VecTrans: LLM-guided auto-vectorization

Experience

Industry roles and research engineering experience.

Nov 2025 - Present China

NVIDIA

Senior Architect, Compute DL Architecture

Building TensorRT-LLM inference infrastructure for sparse attention, long-context decoding, GPU kernels, and system-algorithm co-design.

Apr 2021 - Nov 2025 Beijing, China

Huawei Beijing Research Center

Researcher to Principal Research Engineer, Compilers and Programming Languages Lab

Led PASA, LLM4Compiler / VecTrans, low-precision LLM operators, and mixed-precision solver optimization across Ascend and Kunpeng.

Jan 2020 - Apr 2020 Cambridge, UK

Huawei Cambridge Research Center / HiSilicon

PhD Intern Researcher

Researched mixed-precision scientific computing for PDE workloads.

Education

Academic training in engineering, mathematics, and scientific computing.

2014 - 2021 Beijing, China

Beihang University (BUAA)

PhD, Aerospace Engineering · Supervisor: Prof. Xiaofeng Sun

Computational aeroacoustics, immersed boundary methods, mixed precision, and parallel scientific computing.

2019 - 2020 Cambridge, UK

University of Cambridge

CSC-Sponsored Joint PhD Student · Host Supervisor: Prof. Paul Tucker

High-performance scientific computing under a CSC-sponsored joint PhD arrangement.

2010 - 2014 Xi'an, China

Northwestern Polytechnical University (NWPU)

BEng, Aerospace Engineering

Selected Publications

Full list on Google Scholar

Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation

arXiv 2604.22312 · TensorRT-LLM / Blackwell sparse-attention decoding

Lossless temporally correlated sparse-attention Top-K with production TensorRT-LLM integration.

Read paper

VecTrans: LLM Transformation Framework for Better Auto-vectorization on High-Performance CPUs

arXiv 2503.19449 · CGO 2026 submission

Read paper EuroLLVM program EuroLLVM: Beyond Pattern-based Optimization

Online Pseudo-average Shifting Attention (PASA)

arXiv 2503.01873 · NeurIPS 2025 submission

Full-FP16 attention with near-FP32 error, productized as Huawei CANN SoftmaxFlashV3.

Read paper

Inviscid Nonlinear Modeling of Vibration-Induced Acoustic Resonance of a Linear Cascade

AIAA Journal · 2021

Publisher page

A semi-implicit immersed boundary method for simulating viscous flow-induced sound with moving boundaries

Computer Methods in Applied Mechanics and Engineering · 2021

Publisher page

Talks & Recognition

Selected external talks and organizational recognition.

Heterogeneous Computing -- A Bidirectional Journey between Algorithms and Computing Power

15th C-Talk Public Welfare Science and Technology Speech Conference · 2023

Public talk on algorithm-hardware co-design, heterogeneous computing, and AI infrastructure.

Click to play. This uses a WeChat signed video URL, which may expire.

Public report Open video

Beyond Pattern-based Optimization: What Can LLM Reshape Auto-vectorization?

EuroLLVM 2025 Developer Meeting · Invited lightning talk · Berlin

Invited talk on VecTrans and LLM-guided compiler optimization.

Play on YouTube

EuroLLVM program Watch on YouTube

Huawei Golden Medal Team Awards

Huawei's highest team honor · Two team awards · 2024

Critical Fundamental Softwares: ODML across OS, database, middleware, languages, and compilers.
High-performance LLM operators on Huawei Ascend NPUs.

Compiler System Design Competition Service

National Undergraduate Computer System Ability Competition · 2024-2025

On-site judge and problem setter for the Compiler System Design / Challenge Track.

Competition website

Selected Awards

Silver Award for Innovation Pioneer · 2024
Software Elite Award · 2023
Best New Employee · 2022
Meritorious Winner, Mathematical Contest in Modeling (MCM) · 2013
National Scholarship · top 2.2% · 2013

Contact

Open to research, systems, and infrastructure conversations

Reach me by email, or use the links below for code, publications, and profile.

buaalongcheng@gmail.com Resume PDF github.com/chenglong92 linkedin.com/in/long-cheng92 Google Scholar