Since I don’t find the typical dump of publication references with no elaboration in CVs particularly helpful for actually knowing someone, I’ve instead provided some tl;drs and topic identifiers for a selected portion of my works below, which should give you a proper overview of me in a more self-contained manner.
For a full publication list, please refer to my Google Scholar — be warned that a large share of my citations comes from surveys I didn’t lead.
Selected Publications
2025
-
Sweeping Promptable Spoofs under the DirtyRAG: A Practical, Query-Blind RAG Attack Done Right
May 2025
TL;DR Security + Evaluation RAG is powerful attack surface, yet existing attacks are often query-dependent and impractical. We propose a query-blind RAG attack effective under realistic threat models, while advancing metrics and evaluation practices in this field.
Given an arbitrary benign passage and a 128-passage injection budget, DirtyRAG can, on average, cause the top 50+ retrieval results for any reasonable query to be filled with adversarial passages reflecting the attacker’s arbitrary intent.
An evaluation suite is also built to better reflect the realistic threat model and intricacies of RAG attacks.
-
Word Salad Chopper: Reasoning Models Waste A Ton Of Decoding Budget On Useless Repetitions, Self-Knowingly
May 2025
TL;DR Efficiency + Interpretability LRMs frequently generate non-verbatim repetitions. We find that LRMs are surprisingly self-aware of such behavior, where the hidden states of “\n\n”-like trailing tokens exhibit distinguishable patterns once the model is trapped. We thereby develop an on-the-fly detector (with practically negligible overhead) for chopping interventions.
-
LoRATK: LoRA Once, Backdoor Everywhere in the Share-and-Play Ecosystem
May 2025
We demonstrate by adopting a specific data and merging recipe, a backdoor LoRA can be trained once and then merged with multiple task LoRAs while retaining both benign and adversarial capabilities. Such scalability makes it particularly dangerous and infectious under the share-and-play ecosystem of LoRA, where an unfortunate real-life tragedy has already happened in a more victim-favorable setting.
-
70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float
May 2025
We compress BF16 models with Huffman encoding by exploiting the sparsity of exponent bits. By introducing a new data format — DFloat11/DF11 — we can reduce any BF16 model to roughly 70% of its original size. Over 20k+ monthly downloads on Hugging Face.
Because DF11 ensures lossless output quality while still running on typical GPUs, it has gained strong interest from local deployment–focused communities like r/LocalLLaMA, with over 20k+ monthly downloads on Hugging Face.
2024
-
MQuAKE-Remastered: Multi-Hop Knowledge Editing Can Only Be Advanced with Reliable Evaluations
Oct 2024
TL;DR Robustness + Evaluation Updating one knowledge fact will produce a ripple effect, making multi-hop knowledge editing (MHKE) a desired capability for reliable LLMs. We reveal many unknown errors of MQuAKE — the most popular and only realistic MHKE dataset — through a comprehensive dataset audit and fix everything.
We further propose a simple RAG-based method that leverages graph topology, which does not take advantage of any dataset idiosyncrasies yet massively surpasses all SOTAs.
-
KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches
Jun 2024
TL;DR Efficiency + Evaluation Plenty of efficiency approaches claim they are long context (LC)-capable, but which can stand up to comprehensive scrutiny, what are the trade-offs, and what are some promising future directions? We attempt to answer such questions by evaluating many exemplar methods among different schools of thought against a range of LC tasks.
I also designed and led its open-source release, which serves as a representation of my code quality, attention to detail, and project ownership.
-
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
Feb 2024
KV cache is huge and bottlenecks LLM inference. We are the first to study the outlier pattern of KV cache and leverage their channel-wise structure to quantize them to 2bit in a finetuning-free + plug-and-play fashion.
KIVI is the first KV cache-specific quantization work, where many of its ingredients are adopted as the field’s standards.
-
Taylor Unswift: Secured Weight Release for Large Language Models via Taylor Expansion
Jun 2024
Model owners would often face the dilemma of either fully open-sourcing their models and losing ownership, or offering API-only access but effectively deterring privacy-sensitive users. Taylor Unswift swaps out the MLP modules with Taylor expansion approximations, allowing owners to “semi-release” a special version of their model that remains true to the original yet is purposely slow during inference and almost impossible to finetune (hence, “unswift”).
Yes, I picked this name :)
-
GNNs Also Deserve Editing, and They Need It More Than Once
Feb 2024
Editing — fixing mistakes of a program postmortem — is popular in almost every other domain except graphs because graphs are theoretically unfriendly to edits. We reveal the root cause is over-accommodation of editing targets and present the first graph editing work that lives up to real-life scrutiny of sequential edits.
2023
-
One Less Reason for Filter Pruning: Gaining Free Adversarial Robustness with Structured Grouped Kernel Pruning
May 2023
TL;DR Efficiency + Robustness Pruned models with conv operators can exhibit similar benign performance while experiencing drastic drops when faced with perturbed inputs. We mitigate this issue with a theoretically motivated GKP procedure utilizing kernel smoothness — a mathematical property known to promote frequency robustness — that requires no extra cost.
2021
-
Revisit Kernel Pruning with Lottery Regulated Grouped Convolutions
Oct 2021
We present a novel pruning granularity called Grouped Kernel Pruning, or GKP, which can deliver a higher degree of pruning freedom on models with conv operators, while remaining densely structured for efficiency benefits.