publications | Shaochen (Henry) Zhong

Since I don’t find the typical dump of publication references with no elaboration in CVs particularly helpful for actually knowing someone, I’ve instead provided some tl;drs and topic identifiers for a selected portion of my works below, which should give you a proper overview of me in a more self-contained manner.

For a full publication list, please refer to my Google Scholar — be warned that a large share of my citations comes from surveys I didn’t lead.

Selected Publications

2025

Under Review

Sweeping Promptable Spoofs under the DirtyRAG: A Practical, Query-Blind RAG Attack Done Right

May 2025

TL;DR Security + Evaluation

RAG is powerful attack surface, yet existing attacks are often query-dependent and impractical. We propose a query-blind RAG attack effective under realistic threat models, while advancing metrics and evaluation practices in this field.

Given an arbitrary benign passage and a 128-passage injection budget, DirtyRAG can, on average, cause the top 50+ retrieval results for any reasonable query to be filled with adversarial passages reflecting the attacker’s arbitrary intent.

An evaluation suite is also built to better reflect the realistic threat model and intricacies of RAG attacks.
EMNLP 2025 Main Oral

Word Salad Chopper: Reasoning Models Waste A Ton Of Decoding Budget On Useless Repetitions, Self-Knowingly

May 2025

TL;DR Efficiency + Interpretability

LRMs frequently generate non-verbatim repetitions. We find that LRMs are surprisingly self-aware of such behavior, where the hidden states of “\n\n”-like trailing tokens exhibit distinguishable patterns once the model is trapped. We thereby develop an on-the-fly detector (with practically negligible overhead) for chopping interventions.
EMNLP 2025 Findings

LoRATK: LoRA Once, Backdoor Everywhere in the Share-and-Play Ecosystem

May 2025

TL;DR Security

We demonstrate by adopting a specific data and merging recipe, a backdoor LoRA can be trained once and then merged with multiple task LoRAs while retaining both benign and adversarial capabilities. Such scalability makes it particularly dangerous and infectious under the share-and-play ecosystem of LoRA, where an unfortunate real-life tragedy has already happened in a more victim-favorable setting.
NeurIPS 2025

70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float

May 2025

TL;DR Efficiency

We compress BF16 models with Huffman encoding by exploiting the sparsity of exponent bits. By introducing a new data format — DFloat11/DF11 — we can reduce any BF16 model to roughly 70% of its original size. Over 20k+ monthly downloads on Hugging Face.

Because DF11 ensures lossless output quality while still running on typical GPUs, it has gained strong interest from local deployment–focused communities like r/LocalLLaMA, with over 20k+ monthly downloads on Hugging Face.

2024

ICLR 2025 Spotlight

MQuAKE-Remastered: Multi-Hop Knowledge Editing Can Only Be Advanced with Reliable Evaluations

Oct 2024

TL;DR Robustness + Evaluation

Updating one knowledge fact will produce a ripple effect, making multi-hop knowledge editing (MHKE) a desired capability for reliable LLMs. We reveal many unknown errors of MQuAKE — the most popular and only realistic MHKE dataset — through a comprehensive dataset audit and fix everything.

We further propose a simple RAG-based method that leverages graph topology, which does not take advantage of any dataset idiosyncrasies yet massively surpasses all SOTAs.
EMNLP 2025 Findings

KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches

Jun 2024

TL;DR Efficiency + Evaluation

Plenty of efficiency approaches claim they are long context (LC)-capable, but which can stand up to comprehensive scrutiny, what are the trade-offs, and what are some promising future directions? We attempt to answer such questions by evaluating many exemplar methods among different schools of thought against a range of LC tasks.

I also designed and led its open-source release, which serves as a representation of my code quality, attention to detail, and project ownership.
ICML 2024

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Feb 2024

TL;DR Efficiency

KV cache is huge and bottlenecks LLM inference. We are the first to study the outlier pattern of KV cache and leverage their channel-wise structure to quantize them to 2bit in a finetuning-free + plug-and-play fashion.

KIVI is the first KV cache-specific quantization work, where many of its ingredients are adopted as the field’s standards.
EMNLP 2024 Main

Taylor Unswift: Secured Weight Release for Large Language Models via Taylor Expansion

Jun 2024

TL;DR Security

Model owners would often face the dilemma of either fully open-sourcing their models and losing ownership, or offering API-only access but effectively deterring privacy-sensitive users. Taylor Unswift swaps out the MLP modules with Taylor expansion approximations, allowing owners to “semi-release” a special version of their model that remains true to the original yet is purposely slow during inference and almost impossible to finetune (hence, “unswift”).

Yes, I picked this name :)
ICML 2024

GNNs Also Deserve Editing, and They Need It More Than Once

Feb 2024

TL;DR Robustness

Editing — fixing mistakes of a program postmortem — is popular in almost every other domain except graphs because graphs are theoretically unfriendly to edits. We reveal the root cause is over-accommodation of editing targets and present the first graph editing work that lives up to real-life scrutiny of sequential edits.

2023

NeurIPS 2023

One Less Reason for Filter Pruning: Gaining Free Adversarial Robustness with Structured Grouped Kernel Pruning

May 2023

TL;DR Efficiency + Robustness

Pruned models with conv operators can exhibit similar benign performance while experiencing drastic drops when faced with perturbed inputs. We mitigate this issue with a theoretically motivated GKP procedure utilizing kernel smoothness — a mathematical property known to promote frequency robustness — that requires no extra cost.

2021

ICLR 2022

Revisit Kernel Pruning with Lottery Regulated Grouped Convolutions

Oct 2021

TL;DR Efficiency

We present a novel pruning granularity called Grouped Kernel Pruning, or GKP, which can deliver a higher degree of pruning freedom on models with conv operators, while remaining densely structured for efficiency benefits.