publications | Shaochen (Henry) Zhong

Since I don’t find the typical dump of publication references with no elaboration in CVs particularly helpful for actually knowing someone, I’ve instead provided some tl;drs and topic identifiers for a selected portion of my works below, which should give you a proper overview of me in a more self-contained manner.

For a full publication list, please refer to my Google Scholar — be warned that a large share of my citations comes from surveys I didn’t lead.

Selected Publications

2025

Preprint

Sweeping Promptable Spoofs under the DirtyRAG: A Practical, Query-Blind RAG Attack Done Right

Shaochen (Henry) Zhong^*, Jiamu Zhang^*, Hoang Anh Duy Le, and 15 more authors

TL;DR Security + Evaluation

RAG is powerful attack surface — as it might be the only few places one can reliably do third-party prompt injection — yet existing RAG attacks are often user query-dependent and impractical. We propose a query-blind RAG attack effective under realistic threat models, while advancing metrics and evaluation practices in this field.

Given an arbitrary benign passage and a 128-passage injection budget, DirtyRAG can, on average, cause the top 50+ retrieval results for any reasonable query to be filled with adversarial passages reflecting the attacker’s arbitrary intent.

An evaluation suite is also built to better reflect the realistic threat model and intricacies of RAG attacks.
ICML 2026 Position Track

Position: Want Better ML Reviews? Stop Asking Nicely and Start Incentivizing with a Credit System

Shaochen (Henry) Zhong

TL;DR Hot Take

Current conference organizers have no proper tools to instill accountability and incentives for the actors (authors/reviewers/ACs/SACs). The lack of accountability means many "bad behaviors" (e.g., late review posts, lack of engagement, my-way-or-the-highway attitude) go unchecked; yet the lack of incentives means few actors voluntarily engage in "good behaviors" (e.g., initiating internal review discussions), despite such behaviors being tremendously helpful. These issues are not going to be fixed by asking politely in the Calls for Papers or Reviewer Guidelines, nor by threatening desk rejections.

I propose a CREDIT SYSTEM where community members would earn points by "doing good" — e.g., reviewing a paper would get you +1, being an outstanding reviewer would get +3, helping with an emergency review request would get +5, etc. Then, members can spend points to redeem perks ranging from traditional ones that are already adopted in current ML conferences (e.g., free registration) to new ones like requesting an additional reviewer to sort through a muddy situation.

Such a credit system would also help materialize some smart thinking, e.g., refundable submission fees, mobilizing non-author reviewers, "paid" exemption of unwilling reviewers, voting-based penalties for unresponsive behavior, etc... essentially providing more tools for conference organizers to influence and oversee.
ICML 2026

FAFO: Lossless KV Cache Compression with Draftless Fumble Decoding

Hoang Anh Duy Le^*, Shaochen (Henry) Zhong^*, Yifan Lu, and 7 more authors

TL;DR Efficiency + Robustness

We adopt lossy KV cache compression under the n-gram candidate pool decoding pipeline pioneered in Lookahead Decoding. With a custom cache management strategy built upon FlexAttention, FAFO achieves 1.20-2.71x speedups.

FAFO is likely the most efficient decoding method under the lossless + draftless + training-free paradigm — which makes the most sense for local LLM deployments — with significant latency advantage over the next-best method.
EMNLP 2025 Main Oral

Word Salad Chopper: Reasoning Models Waste A Ton Of Decoding Budget On Useless Repetitions, Self-Knowingly

Wenya Xie^*, Shaochen (Henry) Zhong^*, Hoang Anh Duy Le, and 3 more authors

TL;DR Efficiency + Interpretability

LRMs frequently generate non-verbatim repetitions. We find that LRMs are surprisingly self-aware of such behavior, where the hidden states of "\n\n"-like trailing tokens exhibit distinguishable patterns once the model is trapped. We thereby develop an on-the-fly detector (with practically negligible overhead) for chopping interventions.
EMNLP 2025 Findings

LoRATK: LoRA Once, Backdoor Everywhere in the Share-and-Play Ecosystem

Hongyi Liu^*, Shaochen (Henry) Zhong^*, Xintong Sun^*, and 12 more authors

TL;DR Security

We demonstrate by adopting a specific data and merging recipe, a backdoor LoRA can be trained once and then merged with multiple task LoRAs while retaining both benign and adversarial capabilities. Such scalability makes it particularly dangerous and infectious under the share-and-play ecosystem of LoRA, where an unfortunate real-life tragedy has already happened in a more victim-favorable setting.
NeurIPS 2025

70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float

Tianyi Zhang, Mohsen Hariri, Shaochen (Henry) Zhong, and 4 more authors

TL;DR Efficiency + Robustness

We compress BF16 models with Huffman encoding by exploiting the sparsity of exponent bits. By introducing a new data format — DFloat11/DF11 — we can reduce any BF16 model to roughly 70% of its original size. Over 20k+ monthly downloads on Hugging Face.

Because DF11 ensures lossless output quality while still running on typical GPUs, it has gained strong interest from local deployment–focused communities like r/LocalLLaMA, with over 20k+ monthly downloads on Hugging Face.

2024

ICLR 2025 Spotlight

MQuAKE-Remastered: Multi-Hop Knowledge Editing Can Only Be Advanced with Reliable Evaluations

Shaochen (Henry) Zhong^*, Yifan Lu^*, Lize Shao, and 13 more authors

TL;DR Robustness + Evaluation

Updating one knowledge fact will produce a ripple effect, making multi-hop knowledge editing (MHKE) a desired capability for reliable LLMs. We reveal many unknown errors of MQuAKE — the most popular and only realistic MHKE dataset — through a comprehensive dataset audit and fix everything.

We further propose a simple RAG-based method that leverages graph topology, which does not take advantage of any dataset idiosyncrasies yet massively surpasses all SOTAs.
EMNLP 2025 Findings

KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches

Jiayi Yuan^*, Hongyi Liu^*, Shaochen (Henry) Zhong^*, and 9 more authors

TL;DR Efficiency + Evaluation

Plenty of efficiency approaches claim they are long context (LC)-capable, but which can stand up to comprehensive scrutiny, what are the trade-offs, and what are some promising future directions? We attempt to answer such questions by evaluating many exemplar methods among different schools of thought against a range of LC tasks.

I also designed and led its open-source release, which serves as a representation of my code quality, attention to detail, and project ownership.
ICML 2024

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Zirui Liu^*, Jiayi Yuan^*, Hongye Jin, and 5 more authors

TL;DR Efficiency

KV cache is huge and bottlenecks LLM inference. We are the first to study the outlier pattern of KV cache and leverage their channel-wise structure to quantize them to 2bit in a finetuning-free + plug-and-play fashion.

KIVI is the first KV cache-specific quantization work, where many of its ingredients are adopted as the field’s standards.
EMNLP 2024 Main

Taylor Unswift: Secured Weight Release for Large Language Models via Taylor Expansion

Guanchu Wang^*, Yu-Neng Chuang^*, Ruixiang Tang, and 8 more authors

TL;DR Security

Model owners would often face the dilemma of either fully open-sourcing their models and losing ownership, or offering API-only access but effectively deterring privacy-sensitive users. Taylor Unswift swaps out the MLP modules with Taylor expansion approximations, allowing owners to "semi-release" a special version of their model that remains true to the original yet is purposely slow during inference and almost impossible to finetune (hence, "unswift").

Yes, I picked this name :)
ICML 2024

GNNs Also Deserve Editing, and They Need It More Than Once

Shaochen (Henry) Zhong^*, Hoang Anh Duy Le^*, Zirui Liu, and 10 more authors

TL;DR Robustness

Editing — fixing mistakes of a program postmortem — is popular in almost every other domain except graphs because graphs are theoretically unfriendly to edits. We reveal the root cause is over-accommodation of editing targets and present the first graph editing work that lives up to real-life scrutiny of sequential edits.

2023

NeurIPS 2023

One Less Reason for Filter Pruning: Gaining Free Adversarial Robustness with Structured Grouped Kernel Pruning

Shaochen (Henry) Zhong, Zaichuan You, Jiamu Zhang, and 7 more authors

TL;DR Efficiency + Robustness

Pruned models with conv operators can exhibit similar benign performance while experiencing drastic drops when faced with perturbed inputs. We mitigate this issue with a theoretically motivated GKP procedure utilizing kernel smoothness — a mathematical property known to promote frequency robustness — that requires no extra cost.

2021

ICLR 2022

Revisit Kernel Pruning with Lottery Regulated Grouped Convolutions

Shaochen (Henry) Zhong, Guanqun Zhang, Ningjia Huang, and 1 more author

TL;DR Efficiency

We present a novel pruning granularity called Grouped Kernel Pruning, or GKP, which can deliver a higher degree of pruning freedom on models with conv operators, while remaining densely structured for efficiency benefits.