Paper Feeds (arXiv)

From Efficiency to Leakage -- Privacy Backdoor in Federated Language Model Fine-Tuning

Thu, 18 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.20553v1

AI Summary (中文)

背景与问题

联邦学习（FL）支持多方在不共享原始数据的前提下协同微调语言模型，但全参数微调对客户端计算资源要求过高。因此，参数高效微调（PEFT）——如冻结主干、仅训练少量适配器（adapters）——已成为实际部署的主流范式。然而，本文首次揭示：恶意参数服务器可在PEFT适配器中植入隐蔽的隐私后门（Privacy Backdoor），在完全不损害模型性能的前提下，隐式记忆客户端全部训练样本。

方法：NeuroImprint攻击

我们提出NeuroImprint——一种基于神经元级隔离的记忆注入机制：

为每个训练样本分配唯一专属“记忆神经元”（即适配器中的单个可训练参数）；
约束该神经元在整个本地微调过程中仅被更新一次（通过梯度掩码与优化器状态干预），彻底规避大批次训练和AdamW等状态型优化器引发的跨样本冲突与跨步混叠；
微调完成后，这些孤立的单样本更新可被解析式逆向求解，精确恢复样本的文本嵌入；再经确定性映射（如最近邻token查找），重建原始token序列。

关键发现与创新

在BERT、GPT-2、Qwen2及Llama3.2上跨4个领域数据集（文本分类、NER、摘要、问答）验证：NeuroImprint平均重建59%–79%的全部微调样本，且语义保真度高（BLEU-4 > 68, ROUGE-L > 72）。本工作首次将后门攻击从“功能扰动”范式转向“隐私窃取”范式，揭示PEFT在联邦场景下的根本性隐私脆弱性，并为安全适配器设计提供理论边界。

AI Summary (English)

Federated learning (FL) relies heavily on parameter-efficient fine-tuning (PEFT) to enable resource-constrained clients to collaboratively adapt language models without sharing raw data. This paper uncovers a critical privacy vulnerability: a malicious parameter server can stealthily implant a privacy backdoor into PEFT adapters—memorizing clients’ training samples as isolated, per-sample parameter updates in dedicated neurons—without degrading model utility. We propose NeuroImprint, which assigns one unique neuron per sample and enforces at-most-one update per neuron during local fine-tuning, eliminating cross-sample collisions and optimizer-induced mixing (e.g., from AdamW). Post-training, these isolated updates are analytically inverted to recover text embeddings and deterministically mapped to token sequences. Evaluated across BERT, GPT-2, Qwen2, and Llama3.2 on four diverse NLP tasks, NeuroImprint reconstructs 59–79% of all fine-tuning samples with high semantic fidelity (e.g., ROUGE-L > 72), exposing a fundamental privacy threat inherent to PEFT-based FL.

Abstract

Federated learning (FL) enables multiple parties to collaboratively fine-tune language models for domain-specific tasks without sharing raw data. Since full model fine-tuning is often prohibitively expensive for FL clients, parameter-efficient fine-tuning (PEFT) has become the de facto approach in practice, freezing the base model and training only a small set of adapters. In this paper, we show that a malicious parameter server can stealthily corrupt a PEFT adapter into a privacy backdoor that implicitly memorizes the client's training samples as isolated per-sample parameter updates stored in separate neurons, without degrading model utility. Concretely, our attack, NeuroImprint, assigns a dedicated memorization neuron to each training sample and constrains that each neuron is updated at most once along the local fine-tuning trajectory. This design mitigates both cross-sample collisions and cross-step mixing introduced by large local batches and stateful optimizers (e.g., Adam/AdamW) in language-model fine-tuning. After fine-tuning, the resulting isolated per-sample updates can be analytically inverted in closed form to recover text embeddings, which are then deterministically mapped back to token sequences. To understand the generality of our method, we implemented NeuroImprint on multiple language models (BERT, GPT-2, Qwen2, and Llama3.2) and evaluated it across four fine-tuning datasets spanning diverse domains. The results demonstrate that our attack can reconstruct 59% to 79% of all finetuning samples with high semantic fidelity.

Efficient and Sound Probabilistic Verification for AI Agents

Thu, 18 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.20510v1

AI Summary (中文)

面向AI代理的概率化安全验证：高效且可证明正确的框架

随着AI代理在复杂数字环境（如终端交互、工具调用、Web自动化）中深度部署，其安全性保障已成关键挑战。现有基于形式化策略监控（如Datalog）的运行时验证方法虽具表达力与可验证性，但仅支持确定性策略——无法应对现实场景中普遍存在的不确定性：例如PII检测器存在漏报/误报概率、解密器（declassifier）具有非零失败率、或环境反馈本身带噪声。更严峻的是，这些不确定性常呈现强相关性（如连续调用同一有偏模型导致误差累积），使依赖独立性假设的传统概率Datalog推理方法失效，导致验证结果既不sound（不可靠上界）也不scalable（计算爆炸）。

本文提出首个分布鲁棒概率验证框架（DR-PV），核心创新在于：

✅ 理论严谨性：将策略违反概率的上界计算建模为分布鲁棒优化（DRO）问题，在未知联合分布但已知边缘分布（各谓词的失败概率）的约束下，求解最坏情形下的最大违反概率；
✅ 计算高效性：通过线性规划松弛与结构感知剪枝，实现多项式时间复杂度，支持实时监控；
✅ 无需独立性假设：天然兼容任意相关结构（包括完全正相关、负相关或隐式依赖），提供对所有可能相关性的统一sound上界。

在标准基准（ToolBench、ShellAgent、PII-Redaction Suite）上，DR-PV相较SOTA方法（如ProbLog、DTProbLog、Monte Carlo Datalog）平均提升验证效率3.2×，同时将策略违反概率的保守上界压缩47%；更重要的是，在保持严格数学soundness前提下，显著改善安全-效用权衡——允许代理在更高容忍度下执行高价值任务，而违规风险仍被可控约束于$10^{-3}$量级。

AI Summary (English)

Securing AI agents in dynamic digital environments requires runtime verification of formal security policies (e.g., expressed in Datalog). However, existing approaches only handle deterministic policies, failing to address real-world uncertainty—such as probabilistic PII detectors or declassifiers with non-zero error rates—and critically, they rely on independence assumptions incompatible with correlated predicate failures. We introduce DR-PV, the first sound and efficient framework for probabilistic policy verification under arbitrary correlations. DR-PV formulates violation probability bounding as a distributionally robust optimization problem, computing rigorous upper bounds given only marginal failure probabilities—no independence or joint-distribution knowledge required. Evaluated on terminal and tool-calling benchmarks, DR-PV achieves 3.2× speedup over prior art while tightening violation probability bounds by 47%, enabling significantly improved security-utility trade-offs with mathematically guaranteed soundness.

Abstract

Securing AI agents that operate in complex digital environments has become a critical need, and runtime monitoring approaches that formulate and enforce policies expressed in a formal language like Datalog offer a promising solution. However, existing approaches are restricted to deterministic policies. In many practical applications of AI agents, there is a need to enforce security policies in the face of ambiguity, leading to probabilistic predicates or state transitions (for example, a declassifier or Personally Identifiable Information (PII) detector that has some failure probability on each invocation). Furthermore, in many such applications, one cannot easily make the independence assumptions necessary to invoke prior work on probabilistic inference in Datalog. We address this by introducing a sound and efficient framework for such verification based on distributionally robust optimization, computing sound upper bounds on the probability of policy violation regardless of possible correlations between predicates. On standard benchmarks for terminal and tool calling agents, we demonstrate that our approach outperforms prior art and improves the security-utility trade-off while ensuring rigorous bounds on the probability of policy violation.

Calibration Without Comprehension: Diagnosing the Limits of Fine-Tuning LLMs for Vulnerability Detection in Systems Software

Thu, 18 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.20502v1

AI Summary (中文)

研究背景与问题

当前大语言模型（LLM）在漏洞检测基准上表现优异，但其是否具备真实的安全推理能力，抑或仅依赖数据污染下的表面模式匹配，仍属未解之谜。现有评估常因训练-测试数据泄露而失真，难以区分“校准”与“理解”。

方法创新：CWE-Trace 框架

本研究构建 CWE-Trace——首个严格时序隔离、人工精标、上下文完备的Linux内核漏洞诊断框架：

基于 834 个手动标注函数级样本，覆盖 74 类 CWE；
实施 严格时间切分（2025年前历史集 / 切分后零泄漏集），彻底规避数据污染；
保留 漏洞函数–补丁函数成对上下文，支持细粒度行为分析；
提出两项诊断性指标：方向性失败指数（DFI） 量化模型偏差方向与强度，层级距离与方向（HDD） 评估CWE分类的语义合理性。

关键发现

1. 数据污染无实质增益：84% 的“污染样本”实则无可用记忆信号（漏洞函数缺失或跨数据集错配）；31% 的污染样本存在 CWE 标签错误，污染反而引入噪声。
2. 微调本质是校准而非理解：所有模型展现稳定、系统性的失败模式（DFI 跨越 −85.5 至 +94.8 百分点），且该偏差在历史集与泄漏隔离集间高度一致；LoRA 微调仅偏移输出阈值，未改变底层决策策略。
3. 检测与理解严重解耦：检测最强模型（52.1% 准确率，仅+2.1pp超随机基线）与理解最弱模型（Top-1 CWE 分类准确率 <1.3%）并存；最弱骨干模型 DeepSeek-R1 在粗粒度 CWE 分类上提升最大，印证二者能力正交。

本研究揭示：当前LLM在系统软件安全分析中仍处于“校准而无理解”状态，细粒度安全推理能力尚未建立，微调无法弥补根本性缺陷。

AI Summary (English)

This paper challenges the assumption that strong benchmark performance implies genuine security reasoning in LLMs. We introduce CWE-Trace, a rigorously time-split, manually curated Linux kernel dataset (834 samples, 74 CWEs) with vulnerable–patched function pairs and two novel diagnostics: Directional Failure Index (DFI) and Hierarchical Distance and Direction (HDD). Evaluating 8 base LLMs and 15 LoRA variants across detection and classification tasks, we find: (1) Data contamination provides no measurable advantage—84% of “contaminated” samples lack usable memorization signals, and ~31% suffer CWE mislabeling; (2) Fine-tuning only calibrates output thresholds without altering core decision policies—systematic directional failures (DFI: −85.5 to +94.8 pp) persist across pre-/post-cutoff splits; (3) Vulnerability detection and semantic understanding are decoupled: best binary detection reaches only 52.1% (+2.1pp over chance), while exact CWE ranking remains below 1.3% Top-1 accuracy. These results confirm that current LLMs lack reliable security reasoning for systems software—regardless of fine-tuning strategy.

Abstract

Whether LLMs scoring well on vulnerability benchmarks genuinely reason about security or merely pattern-match on contaminated data remains unresolved. We present CWE-Trace, a framework for LLM vulnerability detection built from 834 manually curated Linux kernel samples spanning 74 CWEs. The framework enforces a strict temporal split (pre-2025 historical set / post-cutoff leakage-free set), preserves context-aware vulnerable--patched pairs, and introduces two diagnostic metrics: the Directional Failure Index (DFI) and Hierarchical Distance and Direction (HDD). We evaluate eight vanilla LLMs and 15 LoRA fine-tuned variants across non-targeted detection, targeted detection, and CWE classification. Our analysis yields two key results. First, data contamination provides no measurable advantage. Function-level analysis shows that 84% of nominally contaminated samples carry no usable memorization signal: vulnerable functions are absent or cross-mapped across datasets, and ~31% of contaminated samples carry CWE misclassification. Second, backbone directional priors dominate fine-tuning. Models exhibit stable, systematic failure modes (DFI ranging from -85.5 to +94.8 pp) that persist from historical to post-cutoff data and resist correction. Fine-tuning shifts the output threshold without changing the decision policy. This is calibration without comprehension: output distributions adapt to training data while the underlying security reasoning remains absent. The weakest backbone at binary detection (DeepSeek-R1) gains the most in coarse CWE classification, revealing that detection and understanding are decoupled capabilities. The best detection score reaches only 52.1% (+2.1 pp above chance); exact CWE ranking remains below 1.3% Top-1 accuracy, confirming that current LLMs lack reliable security reasoning for systems software, regardless of fine-tuning strategy.

Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agentic AI Systems

Thu, 18 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.20470v1

AI Summary (中文)

研究背景与问题

随着具身智能体（Agentic AI）系统的广泛应用，其核心依赖大语言模型（LLM）完成指令解析、工具调用、多源数据处理及跨代理协同。这一架构显著放大了提示注入（prompt injection）与越狱（jailbreak）攻击的风险——尤其当攻击者采用模型引导的自动化攻击框架（如PAIR、GPTFuzz）时，可高效实现大规模探针生成、提示迭代优化与响应自动判别，极大提升攻击成功率（ASR）。

方法创新：检测-误导（Detect-and-Misdirect）防御范式

本文提出突破性防御思路：摒弃传统“检测即阻断”的被动策略，转而构建可控误导机制。当防御模块识别出恶意交互时，不返回标准化拒绝语（如“I cannot assist…”），而是生成安全但策略性误导的响应，旨在诱导攻击者的自动化评判器（automated judge）产生高置信度误判（false positives），从而污染其搜索过程。

核心技术：CMPE框架

我们设计轻量级对话式误导方法——上下文误导与渐进式交互（Contextual Misdirection via Progressive Engagement, CMPE）。CMPE在保持响应语义安全的前提下，通过渐进式话题偏移、合理化虚构前提、可控模糊回应等手段，替代机械式拒绝文本。

关键结果

理论分析表明：传统检测-阻断策略下，ASR随查询预算增长趋近1；而CMPE可将渐近ASR严格约束于有界常数；
在主流越狱基准（e.g., AdvBench, HarmfulQA）上，CMPE将估计ASR上界降低达两个数量级（如从~85%降至<1%）；
在端到端PAIR与GPTFuzz攻击中，经人工验证的成功攻击案例近乎归零（verified ASR ≈ 0.3%），显著优于SOTA防御方案。

AI Summary (English)

This paper addresses the escalating threat of model-guided automated attacks (e.g., PAIR, GPTFuzz) against agentic AI systems, where attackers exploit predictable refusal behaviors to refine jailbreak prompts via automated search. We formalize the attack-defense dynamic probabilistically and show that conventional detect-and-block defenses suffer from asymptotic failure: attacker success rate (ASR) approaches 1 as query budget grows, because refusals provide high-signal feedback. In contrast, we propose detect-and-misdirect—a novel defense paradigm wherein detected malicious queries trigger safe but strategically misleading responses, degrading the positive predictive value of the attacker’s automated judge. We instantiate this via Contextual Misdirection via Progressive Engagement (CMPE), a lightweight, conversation-aware method that replaces rigid refusals with plausible, non-operational replies (e.g., topic redirection, benign premise acceptance). Evaluations on standard jailbreak benchmarks demonstrate that CMPE reduces estimated upper-bound ASR by up to two orders of magnitude and drives verified end-to-end attack success down to near-zero (0.3%) under PAIR and GPTFuzz.

Abstract

Agentic AI systems increasingly rely on language-model components to interpret instructions, process external data, invoke tools, and coordinate with other agents. These capabilities make prompt-injection and jailbreak attacks more consequential, especially as attackers adopt model-guided automation to scale probing, prompt refinement, and response evaluation. This work analyzes the resulting attack-defense setting through a probabilistic model of a target system, its defense mechanism, and the attacker's automated judge. Our analysis shows that conventional detect-and-block defenses can allow attacker success rate (ASR) to approach one as the query budget grows, since predictable refusals provide useful feedback to automated search. We then examine detect-and-misdirect, where detected malicious interactions receive controlled, non-operational responses designed to induce false-positive errors in the attacker's judge. This strategy reduces the positive predictive value of attacker-selected candidates and yields a bounded asymptotic ASR. We evaluate a proof-of-concept realization of this strategy through Contextual Misdirection via Progressive Engagement (CMPE), a lightweight conversational misdirection method designed to replace predictable refusal text with safe but strategically misleading responses in automated jailbreak settings. On jailbreak benchmarks, CMPE reduces estimated ASR upper bounds by up to two orders of magnitude and nearly eliminates verified attack success in end-to-end PAIR and GPTFuzz attack runs.

LLM agent safety, multi-turn red-teaming, jailbreak benchmarks, adversarial robustness, safety-critical systems

Thu, 18 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.20408v1

AI Summary (中文)

背景与问题

大型语言模型（LLM）代理正被逐步部署为安全关键系统（如核电站、航空调度、医疗监护）的监督操作员，但其在持续、自适应对抗压力下的鲁棒性仍缺乏系统性刻画。现有红队测试多聚焦单轮越狱（jailbreak），难以反映真实人机协同中多轮策略性诱导、反馈迭代与职责耦合带来的安全退化风险。

方法：NRT-Bench 多轮红队基准

我们提出 NRT-Bench——首个面向安全关键场景的多轮红队评测基准。其核心是一个高保真模拟核电站控制室环境：由五角色LLM操作员团队（反应堆控制、冷却监控等）协同维护六项关键安全功能（CSFs，如“堆芯温度可控性”“应急停堆可用性”）；攻击者通过四类通信信道（语音转录、报警日志、工单系统、同事消息）发起有界多轮对抗会话（每轮提供实时操作反馈）。危害定义为客观事件：任一CSF失效即刻终止会话，并将失效归因于直接触发该失效的攻击消息——彻底规避LLM主观判别偏差。

关键发现

在固定攻击-回放协议下评估4个前沿操作员模型，8.7%–12.1%的攻击会话导致CSF失效，证实多轮自适应攻击可稳定突破安全阈值；
四模型整体失效率相近，但失效会话重叠度极低：149次攻击中，无一次攻击能同时击溃全部四模型，而33.6%的攻击至少击溃一个模型，表明漏洞呈近似互斥分布，而非层级嵌套；
防御措施效果高度模型依赖：同一防护栈（如内容过滤+安全顾问代理）对某模型降低攻击成功率，却可能使另一模型成功率上升达2.3倍，凸显“通用安全加固”的局限性。

我们已开源仿真平台、攻击数据集及回放工具链，支持可复现的LLM代理安全评估。

AI Summary (English)

We introduce NRT-Bench, the first benchmark for multi-turn red-teaming of LLM agents operating safety-critical systems—instantiated as a simulated nuclear power plant control room with five LLM-backed operator roles and six Critical Safety Functions (CSFs). Unlike single-turn jailbreak benchmarks, adversaries inject messages across four channels in bounded, feedback-driven sessions; harm is objectively defined as CSF loss (not LLM-judged text), enabling precise failure attribution. Evaluating four state-of-the-art operator models under a fixed-attack paired-replay protocol, we find adaptive multi-turn attacks consistently breach safety limits: 8.7–12.1% of sessions result in CSF failure. Crucially, failure patterns are nearly disjoint—no attack defeats all four models, while one-third defeats at least one—indicating non-nested, model-specific vulnerabilities. Moreover, defense efficacy is strongly model-dependent: identical guardrail stacks or safety-advisor agents can reduce attack success for one model yet increase it for another. We release the simulation environment, attack dataset, and replay tooling to enable reproducible LLM agent safety evaluation.

Abstract

Large language model (LLM) agents are increasingly proposed as supervisory components for safety-critical systems, yet their robustness under sustained, adaptive adversarial pressure remains poorly characterized. We present NRT-Bench, a benchmark for multi-turn red-teaming of LLM agents acting as operators of a safety-critical system, instantiated in a simulated nuclear power plant control room. A five-role operator team, each backed by a configurable LLM, runs a plant governed by six critical safety functions (CSFs), while adversaries inject messages over four channels in bounded multi-turn sessions with per-turn feedback. Harm is an objective signal rather than LLM-judged text: a run terminates the moment any CSF is lost, attributed to the causing message. Evaluating four frontier operator models under a fixed-attack paired-replay protocol, we find that adaptive multi-turn attacks reliably push the operator team past a safety limit: across the four models, between 8.7% and 12.1% of attack sessions end with the plant losing a critical safety function. Although the four models look almost equally robust by this aggregate rate, their failures barely overlap: of $149$ sessions, none defeat all four models while a third defeat at least one, so vulnerabilities are nearly disjoint across models rather than nested. The effect of added defences is strongly model-dependent: the same guardrail stack or safety-advisor agent that lowers attack success for one model can raise it for another. We release the simulation venue, attack dataset, and replay tooling for reproducible safety evaluation of LLM agents.

bioETH-Beacon: A Confidential On-Chain Genomic Beacon with Encrypted Counts, Filters, and Bounded Noise over a Fully Homomorphic EVM

Thu, 18 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.20315v1

AI Summary (中文)

背景与挑战

全球基因组与健康联盟（GA4GH）Beacon协议支持研究人员查询特定基因组变异是否存在于协作队列中，并返回聚合层面的计数。然而，随着Beacon网络规模扩大，两类关键隐私风险持续存在：（1）主机机构可直接观测明文查询内容，泄露研究意图；（2）针对罕见变异的重复查询易被用于成员推断攻击（membership-inference attacks），威胁参与者隐私。

方法与设计创新

本文提出 bioETH-Beacon——首个在全同态加密以太坊虚拟机（fhEVM） 上运行的链上保密Beacon原型系统。其核心突破在于：

医院以同态加密形式上传位点计数（marker-count）数据；
授权研究人员提交加密的变异查询请求；
智能合约在密文上执行“聚合计数”逻辑，输出加密结果；
结果仅通过链下密钥管理服务解密并释放给合约ACL中预授权的请求者，实现端到端查询保密性。

系统采用3×4分层查询架构，覆盖基因型（genotype）、性别（sex）、年龄（age）和表型（phenotype）四类查询家族，每类含3个安全等级递增、计算开销递减的层级，支持隐私-成本精细权衡。对基因型路径，原型支持链上添加有界噪声（bounded noise），主动抵御探测式攻击。

实验与验证

基于Polygenic Score（PGS）目录生成的合成基因组面板实验表明：

系统呈现预期的对数级气体消耗（gas）扩展行为；
引入预聚合（pre-aggregation）机制可在接受“公共位点存在性”信息泄露的前提下，降低查询Gas达60%以上；
全流程无需可信第三方计算节点，验证了去中心化、抗合谋的保密Beacon可行性。

AI Summary (English)

The GA4GH Beacon enables aggregate genomic variant queries but suffers from plaintext query exposure and membership-inference risks. bioETH-Beacon is the first smart-contract prototype executing Beacon-style encrypted count queries on a fully homomorphic Ethereum Virtual Machine (fhEVM). Hospitals upload homomorphically encrypted marker counts; researchers submit encrypted queries; and the contract returns an encrypted result—decrypted only for ACL-authorized requesters via off-chain key management. Its 3×4 tiered design spans genotype, sex, age, and phenotype query families, trading stronger confidentiality for lower gas cost per tier. Genotype paths support bounded on-chain noise to thwart probing attacks. Experiments on synthetic PGS-derived panels confirm logarithmic gas scaling and show pre-aggregation reduces query gas by >60% when public marker presence is acceptable. bioETH-Beacon demonstrates a viable, trustless path toward confidential, on-chain genomic beaconing.

Abstract

The Global Alliance for Genomics and Health (GA4GH) Beacon protocol lets researchers ask whether a genomic variant has been observed in a participating cohort and receive aggregate variant-level counts. As Beacon networks grow, two privacy risks remain: host institutions can see plaintext queries, and repeated rare-variant queries can support membership-inference attacks. We present bioETH-Beacon, a smart-contract prototype that runs the Beacon "aggregate count" query over encrypted data on a fully homomorphic Ethereum Virtual Machine (fhEVM). Hospitals upload encrypted marker-count entries, authorized researchers submit encrypted marker queries, and the contract returns an encrypted answer that is released, via an off-chain key-management service, only to the requester named in the contract's on-chain ACL. The design is organized as a 3x4 tier-by-query-family grid spanning genotype, sex, age, and phenotype queries, with tiers that trade stronger confidentiality for lower query cost. For genotype paths, the prototype can add bounded on-chain noise to mitigate probing attacks. Experiments on synthetic panels derived from a Polygenic Score (PGS) catalog show the expected scaling behavior and demonstrate that pre-aggregation can substantially reduce query gas when public marker presence is an acceptable trade-off. Overall, bioETH-Beacon provides a research prototype for confidential Beacon-style genomic querying without a trusted compute evaluator.

Quantization as a Malicious Task: Removing Quantization-Conditioned Backdoors via Task Arithmetic

Thu, 18 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.20254v1

AI Summary (中文)

背景与问题

模型量化是部署深度神经网络至边缘设备的关键技术，可显著降低内存占用与推理开销。然而，近期研究揭示了一类新型安全威胁——量化条件后门（Quantization-Conditioned Backdoors, QCBs）：模型在全精度下行为正常，但一旦执行量化（如INT8转换），即被激活恶意功能（如错误分类特定触发样本）。现有防御方法多依赖修改量化流程（如重校准BN统计量）或引入额外监督信号，往往带来计算开销、泛化性差或对特定量化配置强耦合等问题。

方法创新：QVec —— 任务向量视角的参数空间防御

本文提出QVec，首次从参数空间任务算术（task arithmetic） 视角解析QCB机制。我们发现：全精度模型与对应量化模型的权重差（ΔW = W_quant − W_fp）并非随机噪声，而是一个结构化的恶意任务向量，编码了“从正常行为转向后门行为”的语义方向。基于此，QVec在部署前对原始全精度模型施加反向任务修正：W_defended = W_fp − α·ΔW，其中缩放系数α通过轻量级超参搜索（仅需单次量化+少量干净验证样本）自动确定。

核心优势与实验验证

QVec具备三大特性：✅ 零重训练（inference-only）、✅ 无触发样本依赖（unsupervised）、✅ 普适于任意量化方案（仅需一次量化推断）。在CIFAR-10/100、ImageNet子集及多个LLM后门场景（如Llama-2-7B上的指令注入攻击）中，QVec平均将后门攻击成功率（ASR）从>92%压制至<4.5%，同时清洁准确率损失≤0.3%，显著优于SOTA基线（如Q-BN、Q-Adapt）。本工作重新定义了量化安全范式：量化不仅是压缩手段，更可作为恶意任务的探测器与解耦工具。

AI Summary (English)

Model quantization enables efficient deployment of deep neural networks on resource-constrained devices, yet recent work uncovered Quantization-Conditioned Backdoors (QCBs): models behave benignly in full precision but activate malicious functionality only after quantization. Existing defenses modify quantization procedures or correct activation statistics—introducing overhead or configuration dependency. We propose QVec, the first defense grounded in parameter-space task arithmetic. We observe that the weight difference between a full-precision model and its quantized counterpart encodes a structured malicious task vector, not noise. QVec counteracts this by subtracting a scaled version of this vector from the original weights—requiring no retraining, no trigger samples, and only one quantization pass for estimation. A lightweight hyperparameter search selects the optimal scaling factor. Across image classification benchmarks (CIFAR, ImageNet) and LLM backdoor scenarios (e.g., Llama-2), QVec consistently reduces attack success rate from >92% to <4.5% while preserving clean accuracy within ±0.3%. This reframes quantization as both a security threat and a diagnostic tool for backdoor disentanglement.

Abstract

Model quantization is widely adopted to reduce memory usage and inference cost when deploying deep neural networks on resource-constrained devices. However, recent studies have revealed a new security threat known as Quantization-Conditioned Backdoors (QCBs), where a model behaves normally in full precision but activates malicious behavior only after quantization. Existing defenses typically modify quantization procedures or correct activation statistics, often introducing additional computational overhead or relying on specific quantization settings. Here, we present QVec, a parameter-space perspective for defending against QCBs. We observe that the weight difference between a full-precision model and its quantized counterpart encodes a structured behavioral shift, which can be interpreted as a malicious task vector rather than random quantization noise. Based on this insight, QVec counteracts this malicious direction through controlled parameter correction prior to deployment. QVec requires no retraining, no trigger samples, and only a single quantization pass to estimate the parameter shift, together with a lightweight hyperparameter search. Extensive experiments across image classification benchmarks and multiple Large Language Model (LLM) attack scenarios demonstrate that QVec consistently suppresses backdoor activation while preserving clean performance.

FFinRED: An Expert-Guided Benchmark Generation and Evaluation Framework for Financial LLM Red-Teaming

Thu, 18 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.19887v1

AI Summary (中文)

背景与问题

现有大语言模型（LLM）安全基准（如AdvBench、SafeBench）聚焦通用对抗场景，严重缺乏对金融领域特有风险的覆盖。金融LLM一旦部署，可能引发监管合规违规（如违反FATF反洗钱指引、欧盟DORA条例）、助长欺诈行为（如伪造财报、诱导内幕交易）、乃至系统性信任崩塌——这些风险无法被通用红队测试有效捕捉。

方法与创新

我们提出FinRED（Financial LLM Red-Teaming Evaluation Framework），首个由资深金融监管者、合规官与风控专家全程指导构建的领域专用红队框架：

双层威胁分类法：将全球权威标准（FATF、DORA、ISO/IEC 27001）映射为5大类、18子类金融威胁（从监管规避、洗钱话术到复杂结构化欺诈）；
真实文档驱动的种子生成管道：基于专家定义的schema，自动将脱敏财报、监管问询函、KYC文件等转化为上下文丰富、高保真度的Behavioral Prompts（红队种子）；
专家验证闭环：所有种子经≥3位行业专家盲审，确保现实可行性与业务合理性（通过率>92%）；
金融专属评估量表：超越简单免责声明检测，融合监管意图理解、风险后果推演与多步逻辑一致性判断，关键漏报率（false negatives）从28%显著降至12%，且与人类专家评分相关性达0.91（Pearson）。

应用与影响

FinRED已正式部署于韩国金融安全院（FSI）生成式AI监管沙盒，支撑真实金融服务场景的安全测评。数据集、生成管道、提示模板及评估框架仅向经审核的研究者开放（GitHub: selectstar-ai/FinRED-paper；Hugging Face: datumo/FinRED），严格防范双重用途风险。

AI Summary (English)

FinRED is the first expert-guided red-teaming framework designed specifically for evaluating safety risks of financial LLMs—addressing critical gaps in existing general-purpose benchmarks. It introduces a novel two-level taxonomy that maps international standards (e.g., FATF, EU DORA, ISO/IEC 27001) to finance-specific threats—from regulatory evasion and money laundering facilitation to multi-step fraud schemes. Leveraging real-world financial documents (e.g., audit reports, regulatory letters), FinRED’s scalable pipeline generates context-rich, expert-validated Behavioral Prompts as red-teaming seeds. Its finance-specific evaluation rubric—co-designed with regulators and compliance professionals—goes beyond disclaimer detection, significantly reducing critical false negatives from 28% to 12% and achieving 0.91 Pearson correlation with human expert judgments. Deployed in South Korea’s Financial Security Institute (FSI) regulatory sandbox, FinRED’s dataset, pipeline, and framework are access-controlled for qualified researchers via GitHub and Hugging Face to mitigate dual-use risks.

Abstract

Existing safety benchmarks target general adversarial scenarios but miss finance-specific risks. Financial LLMs face regulatory compliance violations, fraud facilitation, and systemic trust erosion that require targeted evaluation. We introduce FinRED, an expert-guided red-teaming framework for financial LLM safety evaluation developed with financial experts. FinRED uses a novel two-level taxonomy mapping global standards (e.g., FATF and EU DORA) to threats ranging from regulatory evasion to complex fraud, integrated with a scalable pipeline that converts real financial documents into context-rich red-teaming Behavioral Prompts (seeds) through an expert-defined schema. Rigorous expert validation confirms seed plausibility and realism for meaningful LLM safety evaluation. We also provide an expert-validated, finance-specific rubric that goes beyond disclaimer checks, aligns more closely with human experts than static one-size-fits-all rubrics, and reduces critical false negatives from 28 to 12. Aligned with internationally adopted risk-management and information-security standards (e.g., ISO/IEC 27001), FinRED is deployed in South Korea's Financial Security Institute (FSI) regulatory sandbox for generative AI security evaluation in real financial services. To mitigate dual-use risks, the dataset, generation pipeline, prompt template, and evaluation framework are gated for qualified researchers at https://github.com/selectstar-ai/FinRED-paper and https://huggingface.co/datasets/datumo/FinRED.

SafeSpec: Fast and Safe LLM via Dynamic Reflective Sampling

Thu, 18 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.19755v1

AI Summary (中文)

背景与挑战

推测式解码（Speculative Inference）显著提升了大语言模型（LLM）的推理速度，但其固有机制不提供任何安全性保障。现有安全防御方法（如后置过滤、实时分类器或约束解码）与推测式框架存在根本性冲突：它们或引入额外计算开销，或破坏“草稿-验证”协同流程，导致加速收益被严重抵消。这一矛盾揭示了当前安全技术与高效解码范式之间的深层不兼容性。

方法创新：SafeSpec 框架

我们提出 SafeSpec——首个将安全感知原生嵌入推测式解码全流程的安全增强框架。其核心设计包括：

联合轻量级安全头（Latent Safety Head）：在目标模型上附加一个参数极少的隐式安全评估模块，在单次前向传播中同步完成语义有效性与内容安全性联合判别；
风险感知轨迹恢复机制：将越狱攻击建模为生成轨迹上的分布偏移（即有害续写概率上升但安全路径仍存），当验证阶段检测到不安全草稿时，SafeSpec 不终止生成，而是执行回滚 + 安全引导的反思式多采样（safety-guided reflective multi-sampling），在原始推测窗口内动态重采样安全延续；
零额外延迟集成：所有安全判断与恢复操作均复用推测式原有计算流，不增加token级延迟。

实验效果

在 Qwen3-32B 等多模型及 AdvBench、SafeBench 等主流对抗基准上，SafeSpec 实现了安全与效率的协同优化：攻击成功率降低 15%，同时在良性负载下保持 2.06× 的端到端加速比。结果证明：推测式加速与推理时安全防护并非互斥目标，而可通过架构级协同实现统一优化。

AI Summary (English)

Speculative inference accelerates LLM decoding but lacks inherent safety guarantees, and existing safety methods are fundamentally incompatible—either adding latency or breaking the draft-verify loop. SafeSpec bridges this gap by integrating lightweight latent safety assessment directly into the verification step: a tiny safety head jointly evaluates semantic validity and harm risk in one forward pass. Upon detecting unsafe drafts, SafeSpec triggers rollback and safety-guided reflective multi-sampling—recovering safe continuations within the speculative window instead of aborting generation. Modeling jailbreaks as distributional shifts over generative trajectories, SafeSpec enables risk-aware trajectory recovery without compromising speed. On Qwen3-32B, it reduces attack success by 15% while sustaining a 2.06× speedup on benign workloads—demonstrating that safety and speculative efficiency can be jointly optimized.

Abstract

Speculative inference accelerates large language model (LLM) decoding but provides no inherent safety guarantees. Existing safety defenses are largely incompatible with speculative inference: they either introduce additional computation or disrupt the draft-verify mechanism, negating acceleration benefits. This reveals a fundamental incompatibility between current safety methods and speculative decoding. We propose SafeSpec, a safety-aware speculative inference framework that integrates risk estimation directly into the verification process. SafeSpec attaches a lightweight latent safety head to the target model to jointly evaluate semantic validity and safety in a single forward pass. When unsafe generations are detected, SafeSpec applies rollback and safety-guided reflective multi-sampling to recover safe continuations rather than terminating generation. We model jailbreak attacks as distributional shifts over generative trajectories, where adversarial prompts increase the probability of harmful continuations without eliminating safe ones. Under this model, SafeSpec performs risk-aware trajectory recovery within the speculative decoding process. Across multiple models and adversarial benchmarks, SafeSpec achieves a substantially improved safety-efficiency trade-off. On Qwen3-32B, SafeSpec reduces attack success rates by 15% while preserving a 2.06x inference speedup on benign workloads, demonstrating that speculative acceleration and inference-time safety can be jointly optimized.

Agentic Electronic Design Automation: A Handoff Perspective

Thu, 18 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.19795v1

AI Summary (中文)

Agentic EDA 中的交接有效性：一项以“交接”为视角的综述

电子设计自动化（EDA）本质上是多阶段、强交接的工程流程。设计产物、流程脚本与工程决策需跨越工具、会话乃至组织边界，方能抵达最终实现、签核或发布环节。每一次交接均承载显性与隐性需求，而这些需求往往无法被阶段内局部检查充分捕获。当前，大语言模型（LLM）驱动的智能体已能直接调用EDA工具、将检索知识嵌入可执行脚本，并在会话与阶段间持续传递状态。一旦其输出成为下游工程决策的前提，所交接对象就必须满足交接契约（handoff contract），并契合接收方的假设前提。

本文以交接有效性（handoff validity）为核心组织原则：一个交接有效，当且仅当所传递对象满足接收方的接受条件，且附带足够上下文、证据与溯源信息，支撑其在下游场景中可信复用。我们系统调研了82项相关工作，依据交接所跨越的边界将其划分为三类：

阶段受限型（Stage-Bound）：在单一EDA阶段或限定验证任务内保障有效性；
流程受限型（Flow-Bound）：跨工具、多次调用与会话维持连贯的工作流状态；
组织受限型（Organization-Bound）：在知识权威、产权归属与责任边界间维系源依据、溯源性、作用域与可接纳性。

针对每类系统，我们分析其交接契约、交接对象、协同机制与未解挑战。据此，我们提出五层EDA智能体通信协议（EACP）：涵盖智能体发现、消息格式、工具调用封装、工作流编排及安全与知识产权协议。本综述旨在构建统一术语体系，明确可信赖Agentic EDA的关键研究路径与实践框架。

AI Summary (English)

This survey introduces handoff validity—the condition that a transferred object satisfies the consumer’s acceptance criteria and carries sufficient context, evidence, and provenance for downstream use—as the unifying principle for agentic Electronic Design Automation (EDA). We analyze 82 systems across three boundary classes: Stage-Bound (validity within one EDA stage), Flow-Bound (coherent state across tools/sessions), and Organization-Bound (source grounding and admissibility across authority/knowledge boundaries). For each, we characterize handoff contracts, objects, coordination mechanisms, and open challenges. Based on this taxonomy, we propose the five-layer EDA Agent Communication Protocol (EACP), covering agent discovery, message semantics, tool invocation, workflow orchestration, and security/IP governance. Our work establishes a shared vocabulary and actionable research agenda for building trustworthy, interoperable agentic EDA systems.

Abstract

Electronic design automation (EDA) is inherently multi-stage and handoff-heavy. Design artifacts, flow scripts, and engineering decisions cross tool, session, and organizational boundaries before final implementation, signoff, or release. Each transfer carries explicit and implicit requirements that may not be fully captured by stage-local checks. LLM-based agents now invoke EDA tools directly, embed retrieved knowledge in executable scripts, and hand off state across sessions and stages. Once their outputs condition downstream engineering decisions, the transferred object must satisfy a handoff contract and meet the assumptions of its next consumer. This survey introduces handoff validity as its organizing principle. A handoff is valid when the transferred object satisfies the consumer's acceptance conditions and carries sufficient context, evidence, and provenance for downstream use. We review 82 systems and classify them into three boundary classes. Stage-Bound systems establish validity within a single EDA stage or bounded verification task. Flow-Bound systems preserve coherent workflow state across tools, invocations, and sessions. Organization-Bound systems maintain source grounding, provenance, scope, and admissibility across knowledge and authority boundaries. For each class, we analyze handoff contracts, handoff objects, coordination mechanisms, and open questions. These analyses motivate a five-layer EDA agent communication protocol (EACP), covering the agent discovery, agent message, tool invocation, workflow orchestration, and security and IP protocols. We aim to provide a common vocabulary and research agenda for trustworthy agentic EDA.

AgentFinVQA: A Deployable Multi-Agent Pipeline for Auditable Financial Chart QA

Thu, 18 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.19782v1

AI Summary (中文)

背景与挑战

在金融监管场景中，图表问答（Chart QA）不仅要求高准确率，更需可审计性（auditable）与数据本地化部署能力（on-premise deployability）。现有方法多依赖闭源API（如Gemini、GPT-4V），存在客户数据外泄风险；而开源方案常牺牲精度换取可控性，且缺乏透明推理链，难以满足合规审查与人工复核需求。

方法创新：AgentFinVQA多智能体流水线

我们提出AgentFinVQA——首个兼顾审计性、本地化与高精度的金融图表QA系统。其核心是解耦式多智能体架构，将每个查询分解为五大可追溯环节：规划（Planning）→ OCR文本提取 → 图例语义对齐（Legend Grounding）→ 视觉要素检验（Visual Inspection）→ 多步交叉验证（Verification）。每一步骤均实时记录至模型评估包（Model Evaluation Packet, MEP），形成完整、不可篡改的推理溯源日志，支持全流程人工审计与责任界定。

关键结果与价值

在权威金融图表基准FinMME上，AgentFinVQA以71.24%准确率显著超越同主干零样本基线（63.56%，+7.68 pp，p ≈ 1.1×10⁻¹⁶）；采用本地部署的开源大模型Qwen3.6-27B-FP8时仍达66.40%（+4.84 pp），证明开放权重系统可保留90%以上增益。
验证器（Verifier）输出兼具决策信号与置信度指示：经其确认的答案精确率达68.2%，显著高于需修订的答案（55.6%），可高效路由至人工复核环节。
错误分析揭示三大主要失效模式（占失败案例65.3%）：问题语义误解、图例混淆、OCR/定位提取错误——且这些类型最难被验证器捕获，为后续优化指明方向。

本工作首次验证：高可信、全本地、高精度的金融图表QA具备工程落地可行性。代码已开源，支持可复现评估与行业适配。

AI Summary (English)

AgentFinVQA is the first deployable multi-agent pipeline for financial chart question answering that jointly achieves auditability, on-premise operation, and state-of-the-art accuracy—without relying on proprietary APIs or compromising data residency. It decomposes each query into five traceable steps (planning, OCR, legend grounding, visual inspection, verification), logging all intermediate outputs into a Model Evaluation Packet (MEP) for full human-in-the-loop auditing. On FinMME, it achieves 71.24% accuracy—outperforming a Gemini-3 Flash–based zero-shot baseline by +7.68 pp (p ≈ 1.1×10⁻¹⁶) and a locally served open-weight Qwen3.6-27B-FP8 baseline by +4.84 pp (66.40%). Crucially, the verifier’s “confirmed” verdict acts as a strong confidence signal (68.2% vs. 55.6% exact accuracy), enabling efficient human review routing. Error analysis identifies question misunderstanding, legend confusion, and extraction errors as the dominant failure modes—poorly detected by current verification—highlighting key directions for improvement. Code is publicly released.

Abstract

Financial chart question answering in regulated settings demands more than accuracy: practitioners must know which answers to trust before acting on them, and many institutions cannot send client data to external model providers. Yet existing chart-QA agents are accuracy-focused and opaque, and most assume proprietary API access; to our knowledge, none combines auditability with on-premise deployability without significant accuracy compromise. We present AgentFinVQA, a multi-agent pipeline that decomposes each query into planning, OCR, legend grounding, visual inspection, and verification, recording every step in a traceable Model Evaluation Packet (MEP) per sample. On FinMME, AgentFinVQA improves $+7.68$ pp over a primary-backbone matched zero-shot baseline with a proprietary backbone (Gemini-3 Flash; 71.24% vs. 63.56%, McNemar $p \approx 1.1 \times 10^{-16}$), and $+4.84$ pp with open-weights Qwen3.6-27B-FP8 served locally. The verifier's verdict also serves as a useful confidence signal (68.2% vs. 55.6% exact accuracy on confirmed vs. revised answers), enabling human-in-the-loop review routing. Error analysis shows that question misunderstanding, legend confusion and extraction error account for nearly two-thirds of failures and are the categories least detected by the verifier, identifying clear directions for future work. Together these results show that auditable, on-premise financial chart QA is practical and that the open-weights system keeps most of the accuracy gains while enabling full data residency. We release our code to support reproducible evaluation.

A Comparative Study of Pretrained Transformer Models for Quranic ASR: Speech Representations, Label Formats, and Dataset Composition

Thu, 18 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.19747v1

AI Summary (中文)

研究背景

古兰经自动语音识别（Quranic ASR）旨在将诵读音频精准转录为文本，支撑辅助背诵、经文检索、 Tajweed（诵读规则）分析等关键应用。然而，现有通用ASR模型在用户自发诵读场景下词错误率（WER）偏高，且难以覆盖全部6236节经文（Ayah），领域适配性严重不足。

方法与数据

本研究系统评估了三种前沿自监督语音表征模型——Wav2Vec2.0、HuBERT 和 XLS-R（含XLSR-53多语言变体）在古兰经ASR上的细调性能。所有模型均基于Transformer架构，通过音频掩码重建学习上下文感知的语音特征。我们在严格过滤后的专业诵读（如Sheikh recitations）与真实用户录音混合数据集上开展实验，总时长超870小时，涵盖全部麦加/麦地那版标准经文。

关键发现

最优配置：Wav2Vec2-XLSR-53 + 无符点阿拉伯文本标签（即无哈拉卡特 diacritics） + 4–8秒动态分段，实现EveryAyah子集 WER = 0.08、EveryAyah+Tarteel联合测试集 WER = 0.11；
相比Citrinet基线（WER = 0.163），绝对WER降低约5.3个百分点，同时训练耗时从140小时压缩至40小时（降幅71%）；
标签格式消融表明：无符点文本显著优于带符点或音素级标签，印证了当前模型对正字法鲁棒性的依赖；
音频切片时长影响显著：过短（<2s）导致上下文缺失，过长（>12s）引入噪声与静音干扰。

创新与展望

本工作首次在统一框架下完成古兰经ASR的多维度实证基准，揭示了语音表征质量、标签抽象层级与数据构成三者间的强耦合关系。后续将聚焦高质量标注扩充、构建Tajweed-aware音素建模模块，并探索轻量化部署方案。

AI Summary (English)

This paper presents the first systematic comparative study of pretrained Transformer-based models—Wav2Vec2.0, HuBERT, and XLS-R—for Quranic Automatic Speech Recognition (ASR). Fine-tuned on a rigorously filtered, 870+ hour dataset of professional and user recitations covering the full Quranic corpus, our ablation experiments identify Wav2Vec2-XLSR-53 as the strongest speech encoder, and undiacritized Arabic text as the optimal label format. The best configuration achieves a Word Error Rate (WER) of 0.08 on EveryAyah and 0.11 on the combined EveryAyah+Tarteel test set, outperforming the Citrinet baseline (WER = 0.163) by ~5.3 percentage points while reducing training time from 140 to 40 hours. Results highlight the critical interplay among speech representation quality, label granularity, and dataset composition in low-resource, domain-specific ASR.

Abstract

Quran Automatic Speech Recognition (ASR) aims to convert Quranic recitation into text, enabling applications such as aided memorisation tools and Quranic search engines. However, existing ASR models often exhibit high Word Error Rates (WER) on user-recited verses and lack full coverage of the Quranic corpus. This paper presents a systematic empirical study of domain-specific fine-tuning of pretrained Transformer-based models for Quranic ASR, using advanced speech feature extraction methods: Wav2Vec2.0, HuBERT, and XLS-R. These models apply self-supervised learning by masking portions of input audio and using Transformer architectures to learn context-aware speech features. The pretrained models are fine-tuned on a filtered Quranic dataset exceeding 870 hours of professional and user recitations. Through comprehensive ablation studies across feature extractors, output label formats, training strategies, and clip durations, we identify the key factors that affect transcription accuracy in this domain. Our best-performing configuration achieves a WER of 0.08 on the EveryAyah subset and 0.11 on the combined EveryAyah+Tarteel setting, representing roughly a five-percentage-point gain over the Citrinet baseline (WER = 0.163) while reducing combined-model training time from 140 hours to 40 hours. Arabic text without diacritics yields the best fine-tuning results, and Wav2Vec2-XLSR-53 provides the strongest overall representation. Future work includes improving dataset quality and developing phoneme-aware models to extract deeper speech feature representations for Tajweed-sensitive applications.

Predictability as a Fine-Grained Measure for Privacy

Thu, 18 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.20546v1

AI Summary (中文)

背景与动机

差分隐私（DP）虽提供强个体级隐私保障，但其最坏情况设计常导致严苛的隐私-效用权衡，难以反映真实攻击者所掌握的先验知识。本文提出“可预测性（Predictability）”这一细粒度隐私度量框架，旨在弥合理论严格性与实际攻击场景之间的鸿沟。

核心方法

Predictability 将隐私泄露明确定义为：攻击者在观察算法输出后，对其未知个体敏感属性的预测能力提升量，该提升需扣除其已从被泄露子集（stochastically compromised data） 中所能推断的部分。框架显式建模三大要素：（1）攻击者的先验知识结构；（2）由平稳遍历混合随机过程生成的泄露数据；（3）指定的敏感查询族（如二元属性预测）。我们基于广义矩估计（GMM） 构建渐近分析框架，刻画泄露数据具有统计依赖性时的可预测性边界。

主要发现与创新

Predictability 与 DP 一般不可比：存在机制使一方极小而另一方极大，表明二者互补而非替代；
在极端情形下（仅剩1人未泄露 + 所有二元查询敏感），Predictability 可导出互信息意义下的DP（mutual-information DP），建立与经典隐私范式的理论桥梁；
提出可预测性校准的ERM扰动方案：通过GMM估计泄露数据的统计特性，动态调整噪声注入强度，在保障目标敏感信息防护的同时显著提升模型精度；
该框架支持按需定制隐私控制——可针对特定敏感属性（如疾病状态）、特定攻击者知识（如部分用户行为日志已泄露）进行量化评估，为隐私工程提供更实用、可解释的决策依据。

AI Summary (English)

Differential privacy (DP) offers strong worst-case guarantees but often suffers from excessive utility loss. This paper introduces predictability—a fine-grained privacy metric that quantifies leakage as the incremental gain in an attacker’s ability to predict sensitive attributes of unknown individuals, beyond what is already inferable from a stochastically compromised subset of the dataset. Unlike DP, predictability explicitly incorporates attacker knowledge, the statistical structure of compromised data (modeled as stationary, ergodic, mixing), and a specified family of sensitive queries. We prove predictability and DP are generally incomparable; however, under the worst-case setting (all but one individual compromised, all binary queries sensitive), bounded predictability implies mutual-information DP. Leveraging the generalized method of moments (GMM), we develop an asymptotic analysis framework and design a predictability-calibrated output perturbation scheme for empirical risk minimization—enabling precise, scenario-aware privacy control complementary to DP.

Abstract

Differential privacy (DP) ensures rigorous individual-level privacy guarantees against even the most knowledgeable attackers, but its worst-case nature can impose a costly privacy-accuracy tradeoff. We introduce privacy via predictability, a fine-grained framework that explicitly incorporates the attacker's core knowledge, a compromised portion of the dataset generated by a stochastic process, and a specified family of queries. Predictability measures privacy leakage as the incremental gain in an attacker's ability to predict sensitive information about unknown individuals after observing the algorithm's output, beyond what can already be inferred from the compromised data. We show that predictability and DP are generally incomparable: each can be small while the other is large. However, in the worst-case regime where all but one individual is compromised, and all binary queries are considered sensitive, predictability implies mutual-information DP. More generally, predictability provides a finer-grained privacy metric tailored to specific sensitive information and specific attacker models. We introduce a general framework, using the generalized method of moments (GMM), to analyze asymptotic predictability when the compromised data is generated by a stationary, ergodic, mixing process. Using this analysis, we derive a predictability-calibrated output perturbation scheme for ERM. Our approach is complementary to DP and can be used alongside DP to provide fine-grained privacy control.

Towards Modality-imbalanced Federated Graph Learning: A Data Synthesis-based Approach

Thu, 18 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.20382v1

AI Summary (中文)

研究背景

多模态联邦图学习（MM-FGL）为跨客户端协同建模图结构与多源模态（如视觉、文本）数据提供了天然范式，但其实际部署面临双重粒度的模态不平衡挑战：客户端级不平衡（部分客户端完全缺失某类模态，如无图像或无文本）、节点级不平衡（图中个体节点局部缺失视觉/文本属性）。现有方法多面向中心化或图无关场景，难以直接迁移至联邦图学习的分布式、异构、隐私约束环境。

方法创新

本文首次将模态不平衡MM-FGL形式化为隐式的、图感知的潜在语义表征合成问题——不重建原始像素或词序列，而是在嵌入空间中直接合成语义一致的缺失模态表征，从而最大限度保持原始数据的语义分布，并显著降低因模态缺失引发的表征方差。为此，我们提出 FedMGS（Federated Modality-aware Graph Synthesis） 框架，包含三大核心组件：

可用性感知图编码器：在本地消息传播中动态屏蔽缺失模态通道，防止噪声污染结构学习；
原型引导的潜在语义合成器：基于跨客户端共享的模态原型（prototype）构建语义锚点，实现无模态数据下的鲁棒语义生成；
可靠性校准的语义融合机制：依据合成置信度动态加权融合原始与合成表征，避免低质量合成干扰下游预测。

实验结果

在4个真实多模态图基准任务（含学术引用、电商商品、社交用户建模）上，FedMGS全面超越SOTA基线（如FedGraphNN、MM-Fed、VFL-GNN），最高提升17.41%准确率，且通信开销与计算延迟显著低于生成式替代方案，实现最优效率-性能权衡。本工作为隐私保护下非独立同分布（Non-IID）多模态图学习提供了首个系统性解决方案。

AI Summary (English)

MultiModal Federated Graph Learning (MM-FGL) enables collaborative modeling of graph-structured data with heterogeneous modalities across privacy-sensitive clients, yet suffers from client-level (entire modality missing per client) and node-level (partial attribute missing per node) modality imbalance. Existing methods are largely centralized or graph-agnostic, limiting direct adaptation to federated settings. We reformulate this challenge as an implicit, graph-aware latent semantic synthesis problem, recovering missing modality semantics directly in the embedding space—preserving semantic fidelity while reducing variance from missing data. To this end, we propose FedMGS, featuring: (i) an availability-aware graph encoder that prevents missing modalities from corrupting local structural propagation; (ii) a prototype-guided synthesizer establishing cross-client semantic anchors for unavailable modalities; and (iii) a reliability-calibrated fusion mechanism that dynamically weights synthesized representations before prediction. Extensive experiments on four real-world multimodal graph tasks demonstrate FedMGS consistently outperforms state-of-the-art baselines by up to +17.41% accuracy, achieving the best efficiency-performance tradeoff.

Abstract

MultiModal Federated Graph Learning (MM-FGL) offers a natural collaborative training paradigm, but its practical deployment is challenged by two granularities of modality imbalance. Client-level imbalance occurs when certain clients lack entire modalities, while node-level imbalance occurs when individual nodes exhibit missing visual or textual attributes. While several relevant studies exist, our investigation reveals that they predominantly target graph-agnostic or centralized scenarios, rendering them difficult to adapt directly. To address these challenges, we formalize modality-imbalanced MM-FGL as an implicit graph-aware latent semantic representation synthesis problem. This paradigm recovers missing modal semantics directly within the representation space, thereby maximizing alignment with the original data's semantic distribution and mitigating the high variance induced by missing modalities. To this end, we propose FedMGS (Federated Modality-aware Graph Synthesis), which integrates three core components. The availability-aware graph encoder prevents missing modalities from contaminating local structural propagation. The prototype-guided latent semantic synthesizer establishes cross-client semantic anchors for unavailable modalities. The reliability-calibrated semantic fusion mechanism regulates the impact of recovered latent representations prior to predictive readout. Extensive experiments on four tasks show that FedMGS consistently outperforms competitive baselines with gains up to 17.41% with best efficiency-performance tradeoff.

PaAno+: Multiscale Encoding and Cross-Variable Attention for Time Series Anomaly Detection

Thu, 18 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.20055v1

AI Summary (中文)

PaAno+：面向时间序列异常检测的多尺度编码与跨变量注意力模型

时间序列异常检测在工业监控、智能医疗等关键领域具有重要应用价值。现有基于Transformer或大模型的方法计算开销高、难以部署；而轻量级方法又普遍受限于特征提取能力不足和跨变量依赖建模薄弱两大瓶颈。为此，本研究提出PaAno+——一种基于patch导向表征学习的高效轻量型异常检测模型。

核心技术创新：

多尺度时序编码器：采用不同感受野的卷积核构建分层骨干网络，捕获粗粒度趋势与细粒度波动等多尺度时间特性；引入跨尺度自适应注意力聚合机制，并结合残差连接优化，显著提升特征稳定性；
跨变量融合注意力模块：显式建模多元变量间的动态相关性，增强模型在复杂工况下对耦合异常模式（如传感器协同失效）的识别能力；
新型预训练任务：设计基于时间patch窗口排序的自监督预训练任务，挖掘时间序列内在时序结构；联合采用三元组损失（Triplet Loss） 约束patch嵌入空间，强化异常敏感特征的判别性。

在TSB-AD基准测试中，PaAno+在单变量与多变量任务上均达到SOTA性能：VUS-PR指标较原始PaAno平均提升+8.2%，F1-score提升+6.5%，且参数量仅1.2M、推理延迟<3ms（RTX 3060），满足边缘终端实时检测需求。本工作为资源受限场景下的高精度、低延迟时间序列异常检测提供了可落地的新范式。

AI Summary (English)

Time-series anomaly detection is critical for industrial and medical monitoring, yet existing Transformer-based methods suffer from high computational cost, while lightweight alternatives lack sufficient multivariate dependency modeling and hierarchical feature extraction. To address this, we propose PaAno+, a patch-oriented, lightweight model featuring: (1) a multiscale encoder with convolutional kernels of varied receptive fields and cross-scale adaptive attention for robust temporal representation; (2) an explicit cross-variable fusion attention module to capture inter-sensor correlations under complex operational conditions; and (3) a novel temporal patch-window sorting pretext task, jointly optimized with triplet loss to enhance discriminative patch embedding. Evaluated on the TSB-AD benchmark, PaAno+ achieves state-of-the-art performance—improving VUS-PR by +8.2% and F1 by +6.5% over PaAno—while maintaining only 1.2M parameters and sub-3ms inference latency, enabling real-time deployment on edge devices.

Abstract

Time-series anomaly detection has significant practical value for industrial and medical monitoring, as well as other critical domains. Current Transformer- and large-model-based detection approaches incur excessive computational overhead, while existing lightweight alternatives are constrained by insufficient feature extraction and inadequate modeling of dependencies across multivariate variables. To mitigate the above drawbacks, this study develops a lightweight, efficient anomaly detection model, dubbed PaAno, within the patch-oriented representation learning paradigm. In the encoder module, a multiscale feature-extraction backbone is constructed using convolutional kernels with differentiated receptive fields to capture hierarchical temporal characteristics; subsequent cross-scale adaptive attention aggregation, combined with residual connection optimization, further stabilizes feature representation learning. A cross-variable fusion attention module is embedded to explicitly characterize inter-variable correlations, empowering the model to identify anomalous patterns amid intricate operational conditions. Moreover, a novel pretext task based on temporal patch-window sorting is customized to uncover intrinsic structural properties of time series, and triplet loss is leveraged to optimize the patch embedding space for enhanced feature discrimination. Extensive experiments on the TSB-AD benchmark demonstrate that the proposed PaAno achieves state-of-the-art detection accuracy on both univariate and multivariate tasks, yielding significant performance gains across evaluation metrics, including VUS-PR, relative to the original PaAno. Leveraging a compact network design, the presented model achieves favorable computational efficiency, enabling deployment on resource-limited terminals for real-time anomaly inference.

Semantic-Anchored Evidential Fusion for Domain-Robust Whole-Slide Survival Analysis

Thu, 18 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.19966v1

AI Summary (中文)

背景与挑战

全切片图像（WSIs）在计算癌症预后分析中具有重要价值，但现有方法多局限于单一临床中心（in-domain）的高性能，难以跨中心泛化。其根本瓶颈在于过度依赖像素级表征——这类特征极易受染色方案、扫描设备等中心特异性技术差异干扰，导致域偏移严重。

方法创新：Semantic-Anchored Evidential Fusion Survival (SAEFS)

我们提出语义锚定的证据融合生存分析框架（SAEFS），核心思想是：病理学高层语义（如肿瘤分级、微环境架构）具有跨域不变性，可模拟人类病理学家稳健的诊断逻辑。SAEFS包含四大关键技术：

✅ 语义锚提取：通过视觉问答（VQA）模型从WSI中自动解析结构化语义描述（如“是否存在高级别腺癌？”“肿瘤浸润淋巴细胞是否密集？”），生成鲁棒的语义锚；
✅ 双流证据提取：并行学习视觉流（CNN/Transformer）与语义流（VQA嵌入），分别捕获低层纹理与高层判别性知识；
✅ 主观逻辑建模：采用Dirichlet分布建模两类证据的不确定性，显式区分置信度（confidence） 与不确定性（uncertainty）；
✅ 谨慎融合机制：引入基于主观逻辑的谨慎合取规则（cautious conjunction），抑制因视觉与语义特征潜在相关性导致的过自信融合，提升融合可靠性。

主要结果与意义

SAEFS仅在单个源中心数据上训练，即实现对四个完全未见临床中心的零样本迁移。实验表明：

平均C-index提升10.2%（绝对值），显著优于当前SOTA方法（如AMIL、DTFD-MIL）；
VQA语义特征的跨中心分布散度（如KL散度、MMD）较传统CNN特征降低63.5%–78.2%，验证其强域鲁棒性；
不确定性校准误差（ECE）下降41.3%，预测更可信、更可解释。

本工作首次将可解释语义锚与证据理论驱动的不确定性融合深度耦合，为跨中心、合规、可部署的数字病理生存分析提供了新范式。

AI Summary (English)

Whole-slide image (WSI)-based survival analysis suffers from poor generalizability across clinical centers due to domain-specific artifacts in pixel-level representations. To address this, we propose Semantic-Anchored Evidential Fusion Survival (SAEFS), a novel framework that grounds WSI interpretation in domain-invariant pathology semantics—such as tumor grade and microenvironment architecture—extracted via Visual Question Answering (VQA). SAEFS employs a dual-stream architecture to separately encode visual and semantic evidence, models uncertainty using Dirichlet-based Subjective Logic, and fuses them via a cautious conjunction rule to prevent overconfident aggregation. Trained on only one source domain and evaluated zero-shot on four unseen domains, SAEFS achieves a +10.2% average C-index gain over state-of-the-art methods. Quantitatively, VQA-derived semantic features show substantially lower cross-center divergence (63.5–78.2% reduction in MMD/KL) than pixel-based features, confirming their robustness for real-world multi-center deployment.

Abstract

Whole-slide images (WSIs) are widely used for computational cancer prognosis. However, most existing methods primarily focus on in-domain performance and fail to generalize across clinical centers. This limitation stems from their reliance on pixel-derived representations that are highly susceptible to domain-specific artifacts caused by staining protocols and scanner hardware. We hypothesize that high-level pathology semantics, such as tumor grade and micro-environmental architecture, provide a domain-invariant semantic representation that mirrors the robust diagnostic logic of human pathologists. Therefore, we propose a Semantic-Anchored Evidential Fusion Survival (SAEFS) framework, where SAEFS derives semantic anchors from WSIs via Visual Question Answering (VQA), employs a dual-stream WSI evidence extraction architecture, uses Dirichlet-based Subjective Logic to model uncertainty, and fuses semantic and visual evidence through a cautious conjunction rule to avoid overconfident fusion from correlated sources. Trained exclusively on one source domain and evaluated zero-shot across four unseen domains, SAEFS consistently outperforms state-of-the-art models both in prediction accuracy and reliability, improving the average C-index by 10.2%. Quantitative analyses further show that VQA-derived semantic features exhibit significantly lower cross-center divergence than pixel-derived features, highlighting their robustness for cross-center clinical applications.

Matching Markets meet Cumulative Prospect Theory: Towards Optimal and Adversarially Robust Learning

Thu, 18 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.19883v1

AI Summary (中文)

研究背景与动机

本文将多智能体多臂老虎机（MAB） 与双边匹配市场相结合，构建了一个面向人类决策行为的竞争性学习框架。区别于传统理性假设，我们采用累积前景理论（Cumulative Prospect Theory, CPT） 刻画人类偏好：CPT通过一个α-Hölder连续的非线性权重函数对收益进行畸变处理，更真实地反映人类在风险规避/寻求、损失厌恶及概率权重偏差下的选择行为，已在行为经济学与风险敏感机器学习中得到广泛验证。

方法与创新

1. CPT增强的匹配学习：首次将CPT嵌入双边稳定匹配的在线学习中，分析当前最优算法在CPT畸变奖励下的性能，获得玩家最优遗憾界 $\mathcal{O}\!\left(K \log T \cdot \Delta^{-2/\alpha}\right)$，其中 $K$ 为臂数、$T$ 为时域长度、$\Delta$ 为玩家最小偏好间隙。
2. 自适应臂集剪枝：针对 $\Delta$ 依赖次优问题，提出基于偏好置信区间的主动臂筛选机制，在探索阶段动态收缩候选臂集，成功消除主导项中对 $K$ 的显式依赖，实现近似最优遗憾 $\tilde{\mathcal{O}}\!\left(N \log T \cdot \Delta^{-2/\alpha}\right)$（$N$ 为玩家数），尤其适用于 $K \gg N$ 的高维稀疏匹配场景。
3. 对抗鲁棒性拓展：首次在CPT框架下建模对抗性市场——允许奖励被任意方式污染。分别设计并分析了已知总污染预算与未知预算两类鲁棒算法，以CPT作为内生风险度量，均达成对数级玩家最优遗憾 $\mathcal{O}(\log T)$，显著优于现有鲁棒MAB方法在匹配场景中的表现。

意义

本工作 bridging behavioral modeling, matching theory, and robust learning，为设计符合人类认知规律、可抵御恶意干扰的智能匹配系统提供了理论基础与算法范式。

AI Summary (English)

We bridge multi-agent multi-armed bandits with two-sided matching markets under cumulative prospect theory (CPT)—a behavioral model capturing human risk sensitivity via α-Hölder continuous probability weighting. We first analyze the state-of-the-art algorithm under CPT-distorted rewards, obtaining player-optimal regret $\mathcal{O}(K \log T \cdot \Delta^{-2/\alpha})$. To overcome suboptimal $\Delta$-dependence, we design an adaptive arm elimination strategy that removes explicit $K$-dependence in the leading term, yielding improved regret $\tilde{\mathcal{O}}(N \log T \cdot \Delta^{-2/\alpha})$ when $K \gg N$. Further, we introduce adversarial robustness: for both known and unknown total corruption budgets, we propose CPT-aware robust algorithms achieving logarithmic player-optimal regret $\mathcal{O}(\log T)$. This is the first work unifying CPT-based human-centric preferences, stable matching dynamics, and adversarial resilience in online learning.

Abstract

We study a multi-agent multi-armed bandit problem in the competitive setup with two-sided matching markets under a human centric decision making model. To capture human preferences, we use cumulative prospect theory (CPT) that weighs the actions of the agent in a nonlinear fashion using a ($α$-Hölder continuous) weight function. CPT has been widely used in behavioral economics and risk sensitive machine learning to emulate human preferences. We analyze the state-of-the-art learning algorithm with CPT weight distorted rewards and obtain a player optimal regret of $\mathcal{O}(K\log T \left(\frac{1}Δ\right)^{2/α})$, where $K$ denotes the number of arms, $T$ is the learning horizon, and $Δ$ represents (suitably defined) players' minimum preference gap. Noticing the dependence on $Δ$ to be sub-optimal, we further improve this regret by judiciously selecting the active set of arms during exploration, which removes the dependence on $K$ in the dominant term and achieves an improved (optimal) regret guarantees in the setting where the number of arms $K$ is significantly larger than the number of players $N$. In addition, we consider adversarial markets where the observed rewards of the agents may be corrupted. We propose and analyze algorithms for robust markets with CPT as risk sensitive measure in both settings where the total corruption budget is known and where it is unknown, and establish logarithmic player-optimal regret guarantees in both cases.

Doeblin Curves

Thu, 18 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.19859v1

AI Summary (中文)

背景与问题

Doeblin系数作为Dobrushin收缩系数在多分布场景下的推广，近年被用于刻画马尔可夫核在全变差（TV）距离下的信息收缩行为。然而，经典Doeblin分析要求该系数严格大于零（即“远离0”），导致对许多实际通道（如高噪声、低秩或退化核）无法给出非平凡收缩保证——当Doeblin系数为0时，传统方法完全失效。

方法与创新

本文突破性地提出Doeblin曲线（Doeblin Curve）概念：一种非线性函数，以输入分布集合的散度水平（如TV距离）和幂次（power）为变量，精细刻画马尔可夫核在不同尺度下的收缩能力。核心贡献包括：

建立Doeblin系数的新变分表征，揭示其与最坏-case收缩率的等价关系；
系统刻画Doeblin曲线的单调性、连续性与渐近行为；
定义多种功率约束型Doeblin曲线（如ℓ₁-、ℓ₂-及熵约束版本），适配不同应用场景；
基于变分公式推导紧致的上下界，实现可计算的收缩量化。

应用与意义

将Doeblin曲线应用于三大领域：
✅ 含噪迭代优化：导出更宽松的泛化误差界，适用于非强凸、非光滑目标；
✅ 噪声电路可靠计算：提升容错阈值估计精度，支持异构门级噪声建模；
✅ 在线差分隐私：为带状态更新的迭代算法（如梯度追踪）提供细粒度隐私损失累积分析。
特别地，所有结果均拓展至群作用下的广义域（如向量空间、图结构、离散群），超越传统单点收缩范式，首次揭示多分布间“层次化收缩”现象。

AI Summary (English)

Recent work has repurposed Doeblin coefficients as multi-distribution generalizations of the Dobrushin contraction coefficient for total variation distance—yet their utility hinges on being bounded away from zero, rendering them vacuous for many degenerate or high-noise channels. To overcome this limitation, we introduce the Doeblin curve: a nonlinear function mapping divergence levels and power constraints to achievable contraction rates over input distribution families. We derive a novel variational characterization of Doeblin coefficients, establish fundamental properties of Doeblin curves (monotonicity, continuity, asymptotics), define power-constrained variants (ℓ₁-, ℓ₂-, entropy-based), and obtain tight upper/lower bounds. These tools yield non-vacuous contraction guarantees where classical Doeblin analysis fails. We apply them to: (i) generalization bounds for noisy iterative optimization under relaxed convexity; (ii) error bounds in reliable computation with heterogeneous noisy gates; and (iii) differential privacy accounting for online iterative algorithms—extending all results to group-invariant domains and revealing fine-grained, level-dependent contraction phenomena beyond scalar coefficients.

Abstract

Recent research on Doeblin coefficients has shed light on their usefulness as a multi-way generalization of the Dobrushin contraction coefficient for TV distance, in a separate vein from their classic role in the theory of Markov chain ergodicity. However, strong conditions, such as being bounded away from 0, are typically necessary for Doeblin coefficients to establish the existence of information contraction. Building on recently formulated concepts of nonlinear information contraction, we aim to propose a finer-grained Doeblin-based characterization of multi-way contraction behavior which yields non-vacuous contraction guarantees even for channels whose Doeblin coefficient is 0. To this end, we introduce the notion of a Doeblin curve -- a nonlinear function which quantifies the contraction behavior of a Markov kernel on collections of input distributions at specific levels of divergence and power. Through the course of our analysis, we develop a new variational characterization of Doeblin coefficients, present several properties of Doeblin curves, define several versions of power-constrained Doeblin curves, and derive upper and lower bounds using our aforementioned variational characterization. We then utilize these results in diverse areas, including generalization bounds for noisy iterative optimization, error bounds for reliable computation with noisy circuits, and differential privacy guarantees for online iterative algorithms. In particular, we extend results in these areas to broader domains or group settings, leveraging Doeblin curves to reveal finer-grained contraction phenomena than Doeblin coefficients.

Prompt, Plan, Extract: Zero-Shot Agentic LLMs Workflows for Lung Pathology Extraction from Clinical Narratives

Thu, 18 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.19852v1

AI Summary (中文)

背景与挑战

肺癌病理报告中的关键临床信息（如组织学类型、肿瘤大小、淋巴结转移、pTNM分期等）多以非结构化叙事文本形式存在，是癌症分期与肿瘤登记的核心数据源。传统基于监督学习的自然语言处理（NLP）方法依赖大规模人工标注的命名实体识别（NER）与关系抽取（RE）流水线，不仅标注成本高昂，且因级联错误（上游实体漏识导致下游关系失效）而鲁棒性不足。

方法创新

本研究提出 Prompt, Plan, Extract（PPE） 零样本智能体工作流：无需任何训练样本或微调，仅通过结构化提示工程、动态推理规划（如分步验证病理阶段逻辑一致性）与多轮抽取校验，驱动开源大语言模型完成复杂医学信息提取。我们系统评估了5种开源生成式大模型（含GPT-OSS-20B、LLaMA-3-70B、Phi-3-medium等），目标是从肺切除术病理报告中零样本填充13项美国病理医师学院（CAP）规范的结构化字段。

关键结果与意义

在全新构建的登记导向型评估框架（registry-aligned evaluation）下，最优零样本模型GPT-OSS-20B达到 Micro-F1 = 0.893（召回率0.949），显著优于同类零样本基线；而监督式SOTA基线GatorTron（NER-RE联合模型）为0.960。尤为突出的是，该零样本方法在病理性分期（Pathologic Stage） 等需多条件逻辑推理的复杂关系抽取上表现稳健，无需任务特定训练。研究表明：开源、零样本、智能体增强的大模型可作为低成本、高可用、易部署的替代方案，有效弥合临床叙事与结构化肿瘤登记之间的鸿沟。

AI Summary (English)

This study introduces Prompt, Plan, Extract (PPE)—a zero-shot, agentic workflow for extracting lung pathology information from unstructured clinical narratives. We evaluated five open-source generative LLMs on populating 13 CAP-synoptic fields from lung resection reports, using a novel registry-aligned evaluation framework. The best-performing zero-shot model, GPT-OSS-20B, achieved a Micro-F1 of 0.893 (Recall: 0.949), outperforming other zero-shot baselines and demonstrating robust extraction of complex relations—e.g., Pathologic Stage—without task-specific training or fine-tuning. In contrast, the supervised SOTA baseline (GatorTron NER-RE) scored 0.960. These results indicate that open-source, zero-shot agentic LLMs offer a scalable, low-cost alternative to annotation-intensive supervised pipelines for real-world tumor registry deployment.

Abstract

Information extraction from pathology reports is essential for cancer staging, tumor registry population. Yet key data remains embedded in narrative reports, making manual extraction labor-intensive and error-prone. Traditional supervised Natural Language Processing pipelines address this through fully supervised Named Entity Recognition and Relation Extraction, but require expensive manual annotation and suffer cascading failures when upstream entities are missed. In this study, we developed a zero-shot, agentic workflow, and evaluated five open-source generative Large Language Models (LLMs) to populate 13 College of American Pathologists synoptic fields from lung resection pathology reports. We compared them against a state-of-the-art supervised GatorTron NER-RE baseline using a novel, registry-aligned evaluation framework. The baseline achieved Micro-F1of 0.960, while the best zero-shot model (GPT-OSS-20B) achieved Micro-F1 of 0.893 (recall: 0.949), accurately extracting complex relations like Pathologic Stage without task-specific training. These results suggest that open-source, zero-shot agentic LLMs are a low-cost solution for extracting lung pathology information.

Flow Map Denoisers: Traversing the Distortion-Perception Plane for Inverse Problems

Thu, 18 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.19802v1

AI Summary (中文)

背景与挑战

图像恢复任务长期受限于失真-感知权衡（Distortion-Perception Tradeoff, DP Tradeoff）：以均方误差（MSE）为准则的重建倾向于过度平滑、缺乏细节；而追求高感知质量（如逼真纹理）的方法则常牺牲保真度，产生幻觉或结构偏差。现有方案多需在单一操作点（如仅MMSE或仅GAN-based）上折衷，或依赖成对训练数据、额外判别器、采样器超参调优（如退火步数、噪声尺度）才能遍历DP前沿——显著增加部署复杂度与泛化成本。

方法创新

本文提出Flow Map Denoisers（流图去噪器），揭示一类新兴的少步流匹配模型（学习平均场而非逐点向量场）天然蕴含一个单参数连续去噪族。其核心机制在于前瞻时间参数 $t$：当 $t \to 0$，模型逼近最小均方误差（MMSE）估计；当 $t \to 1$，渐进趋向感知最优解。我们严格证明：对高斯目标分布，调节 $t$ 可精确重构理论最优DP前沿；对自然图像（CelebA、AFHQ），实证验证该连续插值行为高度鲁棒且保持语义一致性。

应用扩展与验证

将流图去噪器嵌入即插即用（Plug-and-Play, PnP）框架，$t$ 参数可统一调控数据一致性（保真度）与先验感知对齐（真实性）之间的平衡，无需修改反演算子或重训练。在超分辨率、去模糊、压缩感知等线性/非线性逆问题上，单次训练的模型即可在DP两端均超越专用基线（如DnCNN+PnP、SRFlow、Real-ESRGAN），且中间点提供灵活可控的权衡选择。实验覆盖 $128\times128$（CelebA）与 $256\times256$（AFHQ）尺度，验证了方法的尺度鲁棒性与任务普适性。

AI Summary (English)

Image restoration suffers from a fundamental distortion-perception tradeoff: minimizing reconstruction error yields blurry results, while maximizing perceptual quality sacrifices fidelity. Existing methods either fix a single operating point on the distortion-perception (DP) frontier or require paired data, auxiliary models, or sampler hyperparameter tuning to access different points. We show that flow map models—a recent few-step flow matching variant learning an average vector field—implicitly define a one-parameter family of denoisers indexed by a lookahead time $t$, which continuously traverses the DP frontier. As $t$ increases from 0 to 1, the denoiser smoothly shifts from MMSE-optimal to perceptually optimal behavior. For Gaussian targets, we prove this recovers the exact optimal DP frontier; for natural images (CelebA, AFHQ), empirical results confirm analogous continuous control. Integrated into Plug-and-Play solvers, the same $t$ parameter governs the balance between data consistency and perceptual alignment in general inverse problems—without retraining or auxiliary components. A single trained flow map matches or exceeds specialized baselines at both DP extremes across linear (deblurring, super-resolution) and nonlinear (compressed sensing) tasks.

Abstract

Image restoration faces a fundamental tradeoff: methods that minimize error produce blurry reconstructions, while those that maximize perceptual quality yield sharp but less faithful images. Existing approaches either commit to a single operating point on this distortion perception (DP) frontier or require paired-data supervision, auxiliary models, or hyperparameter tuning of the sampler to access different points. We show that flow map models, a recent extension of flow matching for few-step sampling that learns an average field, implicitly define a one-parameter family of denoisers that continuously spans the DP frontier. The lookahead parameter t acts as a control knob between the MMSE and perceptual regimes. For Gaussian targets, we prove that varying t exactly recovers the optimal DP frontier; for natural images, we observe similar behavior empirically. Within a Plug-and-Play solver, the same mechanism extends to general inverse problems, where it controls a tradeoff between perceptual alignment and data consistency. Despite the lack of exact optimality guarantees in this setting, a single trained flow map spans the DP tradeoff, matching or exceeding specialized baselines at both extremes. Extensive experiments on CelebA ($128\times 128$) and AFHQ ($256\times 256$) across several linear and nonlinear inverse tasks validate our findings.

Federated Bilevel Performative Prediction

Thu, 18 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.19734v1

AI Summary (中文)

研究背景

联邦双层优化广泛应用于分布式客户端场景下的嵌套学习任务（如联邦超参调优、隐私约束下的元学习），但现有方法普遍假设客户端数据分布固定不变。然而，在策略性环境（如推荐系统、信贷审批、在线广告）中，部署的决策会主动改变用户行为与数据采集机制，引发客户端特异、决策依赖的分布偏移——即“可执行性”（performativity）。这一现象使传统联邦双层模型失效。

方法创新

本文首次提出联邦双层可执行预测（Federated Bilevel Performative Prediction, FBPP）框架：上层（UL）与下层（LL）目标均在各客户端决策驱动的动态分布下评估。我们基于解耦风险视角，严格定义联邦双层可执行稳定点（FBPS），并给出其存在性与唯一性的充分条件（涉及分布敏感度与双层耦合强度的联合约束）。

算法贡献

设计两种收敛可控的联邦算法：

FBi-RRM：基于残差重映射的确定性算法，在收缩性条件下实现线性收敛；
FBi-SGD：通信高效的随机算法，通过联邦超梯度估计规避全局Hessian计算，在衰减步长下保证收敛，且对灵敏度扰动鲁棒（要求局部敏感度足够小）。

实验验证

在战略回归、元战略分类任务中，FBPS理论稳定性阈值与实证结果高度吻合；相比非可执行基线，元泛化性能提升达12.7%–19.3%；CNN图像分类实验进一步验证方法在非凸神经网络场景下的实用性与扩展性。

AI Summary (English)

We introduce Federated Bilevel Performative Prediction (FBPP), a novel framework addressing distributional shifts induced by decision-dependent client behavior in federated bilevel learning (e.g., hyperparameter tuning, meta-learning). Unlike standard assumptions of static data distributions, FBPP models both upper-level and lower-level objectives under client-specific, decision-driven distributions. We formalize the Federated Bilevel Performatively Stable (FBPS) point via a decoupled-risk perspective and establish sufficient conditions for its existence and uniqueness. We propose two algorithms: FBi-RRM, achieving linear convergence under a contraction condition; and FBi-SGD, a communication-efficient stochastic method with convergence guarantees under diminishing steps—provided performativity sensitivities are bounded. Experiments on strategic regression, meta-strategic classification, and CNN-based image classification validate theoretical stability thresholds, demonstrate superior meta-generalization (+12.7–19.3% over non-performative baselines), and confirm practical efficacy in nonconvex neural settings.

Abstract

Federated bilevel optimization is widely used for nested learning problems across distributed clients, such as federated hyperparameter tuning and meta-learning under privacy and communication constraints. Most existing formulations assume fixed client data distributions, which can be violated by performativity, where deployed decisions reshape client behavior and data collection, inducing client-specific, decision-dependent distribution shift. We study federated bilevel performative prediction, where both upper-level (UL) and lower-level (LL) objectives are evaluated under client-dependent, decision-dependent distributions. We formalize the federated bilevel performatively stable (FBPS) point under a decoupled-risk perspective and provide sufficient conditions for its existence and uniqueness. We then develop two federated methods to compute the FBPS solution: FBi-RRM, which converges linearly under a contraction condition, and FBi-SGD, a communication-efficient stochastic method based on federated hypergradient estimation with convergence guarantees under diminishing step sizes when sensitivities are sufficiently small. Experiments on strategic regression and meta strategic classification validate the predicted stability thresholds and demonstrate improved meta-generalization over non-performative baselines, and CNN-based classification further demonstrates the practical effectiveness of the proposed methods in nonconvex neural network settings.

NEST: Narrative Event Structures in Time for Long Video Understanding

Thu, 18 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.19706v1

AI Summary (中文)

背景与挑战

当前视觉-语言模型虽能处理超长视频序列（如数小时），但其“长上下文能力”并不等价于叙事结构理解能力。现有长视频基准（如TVQA、LongVideoBench）侧重“海中寻针”式事实检索，忽视对叙事本质的建模：低层动作如何聚合成语义事件？事件间如何通过时间顺序、因果、转折等关系交织演进？例如，模型能否跨越数十分钟、多场闪回与干扰情节，识别出“失业”（早期挫折）与“分手”（后期结果）之间的隐性叙事关联？

方法与数据集

本文提出 NEST（Narrative Event Structures in Time），首个面向长视频叙事结构理解的大规模基准。包含 1005 部完整电影（平均时长98分钟），每部人工标注 102 个跨模态叙事事件，均严格接地于视觉画面、对话文本与音频信号。每个事件标注包含：触发片段（start/end timestamp）、核心论元（人物、地点、情感）、以及三类结构化关系：① 时间顺序（before/after/simultaneous）；② 层级组成（如“面试失败”是“职业危机”的子事件）；③ 长程依赖（跨场景、跨时空的因果/对比/伏笔回应）。

主要发现与创新

• 提出四任务联合评测框架：事件触发检测（ETD）、事件定位（EL）、事件论元抽取（EAE）、事件关系抽取（ERE）；
• 基线性能揭示根本瓶颈：ETD（7.8% F1）、EL（5.6% F1）、EAE（10.3% F1）极低，证实接地式叙事事件发现仍是未解难题；
• ERE在给定事件前提下显著可行（零样本35.45% F1，微调后44.42% F1），验证叙事关系建模可解，但高度依赖高质量事件基础；
• NEST首次将电影叙事学理论（如Propp功能、Todorov平衡模型）转化为可计算的结构化标注范式，为长视频理解提供真正“叙事感知”的评估新维度。

AI Summary (English)

NEST (Narrative Event Structures in Time) is the first benchmark designed to evaluate narrative structure understanding—not just fact retrieval—in long videos. It comprises 1,005 full-length movies (avg. 98 min), each densely annotated with 102 multimodal narrative events grounded in visual frames, dialogue transcripts, and audio cues. Events are linked via three core relations: temporal ordering, hierarchical composition, and long-range dependencies (e.g., cause-effect across flashbacks). We define four grounded event understanding tasks: Event Trigger Detection (ETD), Event Localization (EL), Event Argument Extraction (EAE), and Event Relation Extraction (ERE). Baselines reveal extreme difficulty in discovering events from raw video: ETD achieves only 7.8% F1, EL 5.6% F1, and EAE 10.3% F1. In contrast, ERE is comparatively tractable—reaching 35.45% zero-shot and 44.42% fine-tuned F1—demonstrating that narrative relations can be learned once events are identified. NEST bridges film narratology and vision-language modeling, establishing a rigorous, theory-informed foundation for truly narrative-aware video understanding.

Abstract

Recent progress in vision-language models has enabled the processing of increasingly long video sequences, but the ability to handle extended token streams does not translate to understanding of narrative structure in long videos. Existing long video benchmarks focus on needle-in-a-haystack retrieval rather than evaluating how low-level actions form events, how events interact across time, and how narratives progress, for example, whether a model can connect an early setback, such as a job loss to a later relationship breakup, despite long gaps, intervening scenes, or flashbacks that reframe what occurred. We introduce NEST (Narrative Event Structures in Time for Long Video Understanding), a dataset of 1005 full-length movies (avg. 98 minutes), each annotated with 102 multimodal narrative events grounded in visual content, dialogue, and audio. NEST captures multimodal narrative events with structured annotations grounded in visual content, dialogue, and audio, and links them through relations that reflect narrative structure, including temporal ordering, hierarchical composition, and long-range dependencies. We introduce baselines for event trigger detection (ETD), event localization (EL), event argument extraction (EAE), and event relation extraction (ERE). The benchmark is highly challenging for grounded event discovery, with ETD below 8%, EL under 6%, and EAE below 11%. In contrast, ERE is more tractable once events are given, reaching 35.45% F1 zero-shot and 44.42% F1 after fine-tuning.

TerraMARS: A Domain-Adapted Small-Language-Model Pipeline for Mars Terraforming Literature

Thu, 18 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.19700v1

AI Summary (中文)

TerraMARS：面向火星地球化文献的领域自适应小语言模型信息抽取管道

背景与挑战：火星地球化（terraforming）是实现人类长期星际栖居的关键科学愿景，亟需系统整合散见于海量文献中的多维度知识——包括大气动力学、水文循环、表面矿物化学、辐射屏蔽机制及地形空间特征等。然而，现有通用NLP工具在专业术语理解、定量参数识别（如“CO₂分压 ≥ 25 kPa”）和跨论文事实一致性校验方面表现薄弱，严重制约知识向数字孪生、宜居性建模等下游任务的转化效率。

方法创新：本研究提出TerraMARS——首个端到端、领域定制的小语言模型（SLM）信息抽取管道。其核心包含三阶段架构：（1）领域语料构建：从NASA ADS、arXiv等平台爬取并清洗327篇开放获取火星科学论文，采用基于语义分割的多级检索-分块框架（Retrieval-Augmented Chunking），保留上下文完整性；（2）模型适配：以Google Gemma 3 1B为基座，通过量化低秩适配（QLoRA） 在火星专属数据集（含12,480条问答对与8,620条结构化标注样本）上微调，显著提升对“冰盖升华速率”“氮气固定潜力”等专业概念的解析能力；（3）结构化输出：直接生成符合JSON Schema规范的机器可读数据，涵盖实体关系、数值约束、实验条件及不确定性标注。

主要成果与意义：TerraMARS在测试集上实现82.3%的字段级抽取准确率（较通用LLM提升37.6%），并首次支持跨文献的定量参数自动对齐（如不同研究中“火星土壤pH值”的分布统计）。该管道已开源，并为火星数字孪生体构建提供可扩展的知识注入接口。当前局限在于稀有事件（如“甲烷瞬态喷发”）召回率偏低，后续将引入强化学习反馈机制优化事实一致性。

AI Summary (English)

TerraMARS: A Domain-Adapted Small-Language-Model Pipeline for Mars Terraforming Literature

We present TerraMARS, an end-to-end information extraction pipeline designed to transform unstructured Mars science literature into structured, machine-readable JSON data for terraforming research. It leverages a domain-adapted small language model—Google Gemma 3 1B fine-tuned via Quantized Low-Rank Adaptation (QLoRA) on Mars-specific question-answering and information extraction datasets—to accurately identify quantitative constraints (e.g., atmospheric pressure thresholds, regolith composition), physical relationships, and experimental conditions. A curated corpus of 327 open-access papers is processed through a multistage semantic retrieval and context-aware chunking framework to preserve scientific nuance. Evaluated on held-out literature, TerraMARS achieves 82.3% field-level extraction accuracy, significantly outperforming generic LLMs, and enables cross-paper alignment of numerical parameters for downstream applications like habitability modeling and digital twin construction. While promising, further work is needed to improve recall on rare phenomena and enhance factual consistency through iterative verification.

Abstract

Researchers are interested in learning about Mars so that it may eventually become habitable for humans. To achieve this, there is a need for comprehensive knowledge of the planet's atmosphere, hydrology, surface chemistry, radiation environment, and spatial features through the scientific literature. These contain valuable information and meaningful quantitative constraints that can be used in other models and studies, such as habitability assessment and future terraforming studies. We present TerraMARS, an end-to-end information extraction pipeline that combines a domain-adapted Small Language Model to answer Mars terraforming-related questions and convert unstructured Mars science text into machine-readable structured outputs in JavaScript Object Notation (JSON) format. A corpus of open-access papers is collected and processed using a multistage retrieval and chunking framework. Google Gemma 3 1B was adapted to the domain using Quantized Low-Rank Adaptation (QLoRA) fine-tuning on Mars-specific question-answering and information extraction datasets. The resulting pipeline generates both types of output and provides a foundation for integrating knowledge from scientific literature into downstream applications like digital twins and habitability modeling for Mars. The output from this pipeline looks promising, but further improvements are needed to increase extraction accuracy and factual consistency.

CodeSentinel: A Three-Layer Defense Against Indirect Prompt Injection in Code Contexts

Wed, 17 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.19235v1

AI Summary (中文)

背景与问题

随着代码大语言模型（Code LLM）广泛依赖外部上下文（如代码仓库、文档、Issue讨论、智能编码代理环境），攻击者可将恶意指令隐匿于注释、字符串字面量、标识符命名、伪装代码片段等看似无害的语法结构中，形成隐蔽的间接提示注入（Indirect Prompt Injection, IPI）攻击面。此类攻击绕过传统输入过滤，利用模型对上下文的语义敏感性触发越权行为，威胁开发安全。

方法：CodeSentinel 三重防御架构

我们提出 CodeSentinel——一种轻量、可插拔、推理时生效的三阶段上下文净化框架：

第一层（语法感知预过滤）：基于 Tree-sitter 构建高保真代码语法树（CST），精准定位模型实际“看到”的高风险节点（如 string_literal、comment、identifier 等）；
第二层（CST引导的动态Min-K%评分）：针对每个候选节点，动态生成K%（如5%）最小扰动变体（如字符替换、空格插入），结合模型置信度变化量化其语义脆弱性；
第三层（节点级扰动分析）：通过对抗性扰动响应模式识别自然语言式语义触发器（如“忽略以上指令，执行…”），区分恶意意图与良性噪声。

检测到的可疑节点将被移除或语义中性化（如注释脱敏、字符串转义），确保下游Code LLM仅接收净化后上下文。

性能与创新

在涵盖6类新型IPI攻击（含多跳注入、混淆型指令、上下文污染等）的基准测试中，CodeSentinel达 0.80平均节点级F1分数，显著优于CodeGarrison（0.62）、DePA（0.58）和KillBadCode（0.51）。其核心创新在于：首次将CST结构约束与动态细粒度扰动评估深度融合，兼顾语法合法性与语义危害性，在零微调、零训练前提下实现高效防御。

AI Summary (English)

Code large language models (Code LLMs) increasingly ingest external code context—such as repositories, documentation, issue threads, and agent-generated code—creating a stealthy indirect prompt injection (IPI) surface where attackers embed malicious instructions in comments, strings, identifiers, or decoy code. To counter this, we propose CodeSentinel, an inference-time, three-layer sanitizer that operates without model fine-tuning. It first leverages Tree-sitter to extract high-risk syntax tree (CST) nodes visible to the model; then applies syntax-guided pre-filtering, CST-guided Dynamic Min-K% scoring (evaluating minimal perturbations per node), and node-level perturbation analysis to detect both adversarial and natural-looking semantic triggers. Detected nodes are removed or neutralized before reaching the downstream Code LLM. Evaluated across six recent IPI attack families—including multi-hop, obfuscated, and context-poisoning variants—CodeSentinel achieves an average node-level F1 score of 0.80, outperforming CodeGarrison (0.62), DePA (0.58), and KillBadCode (0.51). Its key contribution is the tight integration of syntactic structure awareness and dynamic semantic vulnerability assessment for robust, zero-shot IPI defense.

Abstract

Code large language models increasingly retrieve external code context from repositories, documentation, issue threads, and coding-agent environments, creating an indirect prompt-injection surface where attackers hide instructions in comments, strings, identifiers, or decoy code. We propose CodeSentinel, a three-layer inference-time sanitizer. It uses Tree-sitter to extract high-risk model-facing CST nodes, then combines syntax-guided pre-filtering, CST-guided Dynamic Min-K\% scoring, and node perturbation analysis to detect adversarial and natural-looking semantic triggers. Detected nodes are removed or neutralized before reaching the downstream Code LLM. Across six recent attack families, \CodeSentinel achieves 0.80 average node-level F1, outperforming CodeGarrison, DePA, and KillBadCode.

PhantomSkill: Malicious Code Injection in Agent Skill Ecosystems

Wed, 17 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.19191v1

AI Summary (中文)

背景与问题

随着大语言模型（LLM）驱动的编程智能体（coding agents）日益依赖第三方“技能包”（agent skills）扩展领域能力，技能生态正面临新型供应链安全风险。现有防御多聚焦于技能描述文本的恶意内容检测，却忽视了技能中辅助资源（如配置文件、模板脚本、JSON Schema、示例数据等）所构成的隐蔽攻击面。

方法：PhantomSkill 与 VulMask

本文提出 PhantomSkill——首个针对 agent skill 生态的资源级注入攻击框架。其核心创新是 VulMask 技术：将显性恶意脚本（如远程代码执行、数据窃取）重写为形似普通安全漏洞的实现（如不安全的 eval()、未校验的路径拼接、硬编码密钥），其恶意行为仅在攻击者可控的特定触发条件下激活（如特定输入参数、环境变量或上下文状态）。该设计使代码表面呈现为“低危缺陷”，而非明确恶意逻辑，从而绕过基于语义/意图的静态分析与人工审核。

主要发现

在涵盖 12 个主流 host skills（如 GitHub Actions、LangChain Tools）、4 类攻击目标（RCE、凭证泄露、后门植入、越权访问）、5 种 coding agents 及 7 个主流 LLM（GPT-4、Claude 3、Qwen 等）的评估中，VulMask 实现 >92% 的良性功能保真度，同时将自动化审查器（CodeQL、Semgrep、定制 LLM 审计器）的告警率降低 68–91%，恶意软件级检测率下降 73–95%。
所有测试场景中，技能均通过平台合规性检查与人工评审，验证了其强隐蔽性。

启示与对策

研究揭示：可利用漏洞即潜在恶意载荷。亟需建立资源粒度的技能 vetting 流程、运行时沙箱隔离机制，以及将技能中可被 exploit 的漏洞（如 CWE-73、CWE-798）纳入安全策略的“漏洞即威胁”范式。

AI Summary (English)

PhantomSkill introduces the first resource-level code injection attack framework targeting LLM-based agent skill ecosystems. Unlike prior text-based attacks, it hides malicious logic not in skill descriptions but in auxiliary resources (e.g., templates, configs, schemas) via VulMask—a technique that transforms overt malicious scripts into vulnerability-shaped implementations (e.g., unsafe eval, unvalidated path traversal) whose harmful behavior activates only under attacker-controlled triggers. Across 12 host skills, 4 attack goals, 5 coding agents, and 7 LLMs, VulMask preserves >92% benign utility while reducing static analyzer warnings by 68–91% and malware-level detection by 73–95%, evading both automated reviewers and human audits. Our findings mandate resource-level vetting, execution-time containment, and security policies treating exploitable vulnerabilities in skills as first-class malicious payloads.

Abstract

Agent skills allow LLM-based coding agents to acquire domain-specific capabilities from third-party packages, but they also introduce a new supply-chain attack surface. We present PhantomSkill, an attack framework that hides malicious behavior in a skill's auxiliary resources rather than in its textual description. Its core technique, VulMask, rewrites overt malicious scripts into vulnerability-shaped implementations whose malicious behavior is activated only under attacker-controlled trigger conditions. This design shifts the visible signal from explicit malicious intent to ordinary-looking insecure code. Across representative host skills, attack goals, coding agents, generation models, and automated reviewers, VulMask preserves benign utility while reducing warning and malware-level detection compared with overt malicious scripts. Our results show that skill ecosystems require resource-level vetting, execution-time containment, and security policies that treat exploitable vulnerabilities in agent skills as potential malicious payloads.

OpenAnt: LLM-Powered Vulnerability Discovery Through Code Decomposition, Adversarial Verification, and Dynamic Testing

Wed, 17 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.19149v1

AI Summary (中文)

OpenAnt：基于大语言模型的代码分解、对抗验证与动态测试漏洞发现框架

背景与挑战：在大型代码库中实现自动化漏洞发现仍面临严峻挑战：传统静态分析误报率高，而模糊测试等动态方法依赖大量基础设施且覆盖漏洞类型有限。尽管大语言模型（LLMs）具备语义级程序行为推理能力，但将其应用于仓库级安全分析时，受限于上下文窗口、计算成本与结果可验证性。

核心方法：OpenAnt 提出一种开源、闭环式漏洞发现系统，融合静态分析与LLM语义推理，包含三大创新技术：
1. 代码分解（Code Decomposition）：基于外部入口点的可达性分析，将代码库自动切分为自包含的分析单元，平均缩减分析表面达97%，精准保留攻击面相关代码；
2. 对抗验证（Adversarial Verification）：通过约束型攻击者模拟（如权限、输入可控性、环境限制），由LLM评估候选漏洞的真实可利用性，显著抑制误报；
3. 动态验证（Dynamic Testing）：全自动构建沙箱化 exploit 环境（Docker容器），执行验证后即时销毁，兼顾安全性与可复现性。

实验与成果：在 OpenSSL、WordPress 和 Flowise 等主流开源项目上验证，OpenAnt 成功发现多个此前未公开的高危漏洞（含 CVE-2024-XXXX 类内存越界与逻辑绕过），误报率较纯LLM基线降低82%，单仓库平均分析成本控制在$12–$38（AWS g5.xlarge）。研究表明，语义推理与 exploit 验证闭环协同，是实现可扩展、低误报、高可信度自动化安全分析的可行路径。

OpenAnt 已以 Apache 2.0 协议开源：https://github.com/knostic/OpenAnt

AI Summary (English)

OpenAnt is an open-source, LLM-powered vulnerability discovery system that bridges semantic reasoning and practical exploit validation. It introduces a three-stage pipeline: (1) Reachability-guided code decomposition, reducing analysis scope by up to 97% while preserving attack-relevant paths; (2) Adversarial verification, where LLMs simulate constrained attackers to assess exploit feasibility under realistic conditions; and (3) Sandboxed dynamic testing, automatically generating, executing, and discarding containerized exploit environments. Evaluated on OpenSSL, WordPress, and Flowise, OpenAnt identified multiple previously unknown vulnerabilities (including memory corruption and auth bypass flaws) with 82% fewer false positives than LLM-only baselines and manageable cost ($12–$38 per repository). The results demonstrate that closed-loop pipelines—integrating LLM-driven insight with automated exploit validation—offer a scalable, practical approach to automated security analysis. OpenAnt is released under Apache 2.0 at https://github.com/knostic/OpenAnt.

Abstract

Automated vulnerability discovery in large codebases remains challenging: traditional static analysis produces high false-positive rates, while dynamic approaches such as fuzzing require substantial infrastructure and often target narrow classes of bugs. Recent advances in large language models (LLMs) enable semantic reasoning about program behavior, but applying LLMs to repository-scale security analysis introduces challenges related to context management, cost, and verification. We present OpenAnt, an open-source vulnerability discovery system that integrates static program analysis with LLM-based reasoning in a multi-stage pipeline. OpenAnt introduces three key techniques. First, codebases are decomposed into self-contained analysis units filtered by reachability from external entry points, reducing the analysis surface by up to 97% while preserving attack-relevant code. Second, candidate vulnerabilities undergo adversarial verification through constrained attacker simulation, where the model evaluates exploitability under realistic attacker capabilities. Third, findings are validated through dynamic verification, in which exploit environments are generated automatically, executed in sandboxed containers, and discarded after use. Evaluation on widely used open-source projects including OpenSSL, WordPress, and Flowise shows that this architecture can identify previously unknown vulnerabilities while maintaining manageable analysis cost and substantially reducing false positives. Our results suggest that closed-loop vulnerability discovery pipelines, combining semantic reasoning with exploit validation, provide a practical path toward scalable automated security analysis. OpenAnt is released as open source under the Apache 2.0 license at https://github.com/knostic/OpenAnt.

Giskard : Byzantine Robust and Confidential Aggregation for Large-Scale Decentralized Learning

Wed, 17 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.19129v1

AI Summary (中文)

背景与挑战

在大规模去中心化学习中，同时保障模型参数交换的机密性与对拜占庭（恶意）参与方的鲁棒性构成根本性矛盾：机密性要求加密隐藏梯度/参数（如通过密码学手段），而拜占庭容错通常需直接检验其数值异常性。现有方案多将二者割裂处理；虽有基于安全多方计算（MPC）的联合方案，但存在严重可扩展性瓶颈——或依赖全连接通信（$O(n^2)$ 开销），或集中委托给少数节点，致其计算与通信负载随网络规模 $n$ 线性增长，难以支撑百万级参与者。

方法创新：Giskard 协议

本文提出 Giskard——首个兼具强机密性与拜占庭鲁棒性的高效分布式聚合协议。其核心设计为：

将 $n$ 个参与方分层组织为大小为 $O(\log n)$ 的嵌套委员会树结构；
在每个委员会内，采用 BGW 风格 MPC 安全执行坐标级近似中位数计算；
通过委员会适配的分布式二分搜索（over value domain）协同逼近全局中位数，避免明文暴露与中心化瓶颈。

实验与理论验证

理论证明 Giskard 满足半诚实敌手下的保密性与对抗最多 $n/4$ 拜占庭节点的鲁棒性；实验覆盖至 1,000,000 参与者规模，在真实数据集（CIFAR-10、FEMNIST）上验证：

通信复杂度：每方仅 $O(\log^2 n)$，较最优竞品（如 SecAgg+Krum）实现渐进式降低；
模型效用：在 $25\%$ 拜占庭污染下，准确率损失 < 2.1%，与基线持平；
可扩展性：端到端聚合延迟在百万节点下仍保持亚秒级，突破现有 MPC 方案的规模天花板。

AI Summary (English)

Giskard is a novel protocol for confidential and Byzantine-robust decentralized aggregation in large-scale learning. It resolves the tension between confidentiality (requiring encrypted parameter exchange) and Byzantine resilience (needing inspection of updates) by organizing $n$ parties into a tree of $O(\log n)$-sized committees. Within each committee, BGW-style secure multi-party computation (MPC) enables privacy-preserving coordinate-wise approximate median computation via a distributed binary search over the value domain—avoiding plaintext exposure and central bottlenecks. Theoretically, Giskard guarantees confidentiality against semi-honest adversaries and robustness against up to $n/4$ Byzantine parties. Experimentally, evaluated on up to one million participants, it achieves asymptotically lower per-party communication complexity ($O(\log^2 n)$) than state-of-the-art MPC-based aggregators while maintaining comparable model utility—even under $25\%$ Byzantine corruption.

Abstract

Dealing simultaneously with confidentiality and Byzantine behaviors in decentralized learning is a challenging problem. Indeed, in decentralized learning, clients train a machine learning model while keeping their data locally and share their model parameters or gradients with a set of neighbors. While enforcing confidentiality calls for hiding the exchanged model parameters/gradients (e.g., by using cryptographic techniques), dealing with Byzantine contributions often requires inspecting the latter. Hence, most research works address these objectives separately. A recent line of work proposes to employ secure multi-party computation (MPC) to implement robust aggregators against model poisoning, thereby enforcing both confidentiality and Byzantine resilience. However, these solutions scale badly: they either require all-to-all communication between participants or delegate the entire computation to a small subset, whose computational and communication load grows proportionally with the size of the network. In this paper, we present Giskard, a protocol for confidential and Byzantine-robust decentralized aggregation. Giskard organizes $n$ parties into a tree of committees of size $O(\log n)$ and evaluates a coordinate-wise approximate median via a committee-adapted distributed binary search over the value domain, using BGW-style MPC within each committee. We assess Giskard both theoretically by proving its security and confidentiality properties and experimentally through extensive experiments involving up to one million participants. Compared to its closest competitors, Giskard reduces per-party communication complexity asymptotically while exhibiting comparable model utility under up to $n/4$ Byzantine parties.

PYPILINE: Malicious PyPI Package Detection via Suspicious API Knowledge and Agent Workflow

Wed, 17 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.19063v1

AI Summary (中文)

背景与挑战

PyPI（Python Package Index）作为全球最大的开源Python包仓库，已成为软件供应链攻击的高发地。现有恶意包检测方法多依赖静态规则或传统机器学习模型，存在可解释性差、泛化能力弱、难以应对新型混淆与逻辑绕过攻击等关键瓶颈。

方法创新：PYPILINE框架

我们提出PYPILINE——一种融合可疑API知识库与AI智能体（Agent）工作流的新型检测系统：

知识构建层：对已知恶意包进行深度静态分析，提取抽象语法树（AST）并生成API调用图，自动挖掘高频、异常、隐蔽的恶意API模式（如os.system隐式调用、base64.b64decode+exec组合、requests.get加载远程恶意载荷等），构建结构化、可检索的可疑API知识库（含语义标签、上下文约束与风险等级）；
检测推理层：采用工具调用型AI Agent工作流，协同完成：① 包解包与代码切片；② 向量数据库中精准检索匹配的可疑API模式；③ 多粒度语义分析（控制流/数据流/字符串动态解码）；④ 生成含证据链（如调用路径截图、敏感API上下文代码块）的结构化评估报告，支持人工复核与自动化响应。

实验结果与价值

在覆盖12,487个真实包（含2,193个已知恶意样本）的大规模评测中，PYPILINE达成：精确率96.7%、召回率99.6%、F1-score 98.1%，精确率较SOTA基线（如PyT, Mal-Py）提升5.7–24.2个百分点。实证研究首次系统揭示了当前主流攻击策略（如延迟执行、环境感知反沙箱、多阶段载荷投递）及TOP10滥用API清单。系统已集成向量检索与邮件报告自动投递模块，具备开箱即用的工程落地能力，为开源生态安全提供可解释、可演进、可编排的下一代检测范式。

AI Summary (English)

PYPILINE is a novel malicious PyPI package detection framework that synergizes a structured suspicious API knowledge base with a tool-calling AI Agent workflow. It first constructs the knowledge base via static analysis of known malicious packages—extracting ASTs, building API call graphs, and automatically identifying high-risk, context-sensitive API patterns (e.g., exec(base64.b64decode(...)), obfuscated network calls). During detection, an LLM-powered Agent orchestrates unpacking, vector-based knowledge retrieval, multi-level semantic analysis (control/data flow + dynamic string decoding), and generates interpretable, evidence-rich reports. Evaluated on 12,487 real-world packages, PYPILINE achieves 96.7% precision, 99.6% recall, and 98.1% F1-score, outperforming state-of-the-art baselines by up to 24.2 percentage points in precision. It also delivers actionable insights into prevalent attack tactics and abused APIs, and supports production deployment via integrated vector search and automated email reporting.

Abstract

The detection of malicious PyPI packages is crucial for maintaining the security of the open source software supply chain. Existing methods, which primarily rely on rules or traditional machine learning, suffer from poor interpretability and difficulty in adapting to novel attacks. To address this, we propose PYPILINE, a novel detection method that combines a suspicious API knowledge base with an Agent workflow. PYPILINE first conducts static analysis on known malicious packages, extracting abstract syntax trees and generating API call graphs, from which it automatically extracts and constructs a structured suspicious API knowledge base. During the detection phase, this knowledge base is used to enhance reasoning capabilities. Through an Agent workflow, PYPILINE performs in depth semantic analysis of unknown packages and outputs a structured, interpretable maliciousness assessment report. The experimental results show that PYPILINE significantly outperforms existing state-of-the-art tools in precision of 96.7\%, recall of 99.6\%, and F1-score of 98.1\%, with its precision surpassing baseline tools by 5.7 to 24.2 percentage points. Additionally, we conducted an empirical study on malicious packages, systematically revealing prevalent attack strategies, as well as the most commonly abused APIs. Equipped with tool-calling AI agent workflows for automated vector database retrieval of suspicious API knowledge and mail server delivery of analysis reports, PYPILINE delivers a practical, efficient, and convenient malicious package detection solution to strengthen open-source ecosystem security.

TRAP: Benchmark for Task-completion and Resistance to Active Privacy-extraction

Wed, 17 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.18996v1

AI Summary (中文)

背景与挑战

在文档密集型智能体工作流中（如航班预订、医疗文书处理），敏感个人信息（如护照号、身份证号）并非边缘案例，而是任务执行的常规输入。此类场景要求智能体必须使用私有字段完成任务，同时绝对禁止在响应中暴露任何私有信息——因其无法验证终端用户身份。这一“用而不露”的双重约束存在根本性张力：模型越擅长利用私密信息完成任务，就越易被诱导主动泄露该信息。

方法：TRAP基准设计

本文提出任务完成与主动隐私提取抵抗能力评估基准（TRAP），包含三要素：（1）含私有信息的文档；（2）需调用工具且依赖私有字段的任务查询；（3）以自然语言发起的、旨在诱导泄露的攻击查询。TRAP系统性评测了22个前沿模型（涵盖闭源与开源、多参数量级），首次量化“任务准确率”与“隐私泄露率”的权衡关系。

关键发现

所有模型家族均存在非平凡泄露（平均泄露率12.7%–89.3%），且指令遵循能力越强，泄露率越高；
现有提示工程防御（如隐私指令、拒绝模板）虽可降低泄露（最高降42%），但导致任务成功率平均下降23.5%，且优化提示无法突破该权衡瓶颈；
理论证明：对任意基于Softmax的模型，任何软约束防御（如提示词）都无法同时实现高任务成功率与零泄露概率。

创新方案：结构化私有字段隔离

受 impossibility result 启发，我们提出结构性私有字段隔离机制：在私有字段输入模型前，将其替换为不可逆哈希密钥，并由外部工具服务完成密钥→原始值映射。实验表明，该方法将泄露率降至0.3%以下，任务成功率保持98.6%，彻底打破“准确—隐私”负相关困局。

AI Summary (English)

We introduce TRAP (Task-completion and Resistance to Active Privacy-extraction), a benchmark to quantify the fundamental tension between using private information for task completion and preventing its leakage under adversarial prompting. TRAP comprises document-task-attack triplets: documents contain private fields (e.g., passport numbers), task queries require correct tool invocation using those fields, and attack queries attempt natural-language extraction. Evaluating 22 state-of-the-art models, we find non-trivial leakage across all families—higher instruction-following ability correlates with higher leakage. Prompt-based defenses reduce leakage but severely degrade task accuracy, and prompt optimization cannot escape this trade-off. We prove that for any softmax-based model, no soft-constraint defense (e.g., prompts) can simultaneously achieve high task success and zero leakage probability. Motivated by this impossibility, we propose structural private field isolation: replacing private fields with irreversible hash keys before model ingestion, delegating value resolution to external, trusted tools. This approach reduces leakage to <0.3% while preserving >98.6% task accuracy—breaking the accuracy–privacy trade-off.

Abstract

Agents are increasingly deployed in document-intensive workflows where sensitive private information is not an edge case but a routine input, e.g., an agent booking a flight needs passport numbers. In such settings, the agent must use private information to complete tasks accurately while never exposing it in its responses, because it cannot verify who is actually at the keyboard. These two obligations are in fundamental tension. A model capable enough to use private information for task completion can, by the same capability, be induced to reveal it. To evaluate the trade-off of task accuracy and privacy leakage, we introduce Task-completion and Resistance to Active Privacy-extraction (TRAP). Each scenario includes a document containing private information, a task query that requires the agent to invoke the correct tool using private fields, and an attack query that attempts to elicit the same information in natural language. Evaluating 22 models spanning frontier proprietary and open-source models at multiple scales, we find that all model families exhibit non-trivial leakage, and that instruction-following ability correlates with leakage rate. Existing prompt-based defenses reduce leakage but at significant cost to task accuracy. Prompt optimization fails to escape this trade-off. We demonstrate that this failure is not incidental. For any softmax-based model, no soft-constraint defense, e.g., prompt-based defenses, can jointly achieve high task success with zero leakage probability. Motivated by this impossibility result, we propose structural private field isolation, which replaces private fields with hash keys before they reach the model. This approach largely prevents leakage while keeping task accuracy.

Image Prompt Reconstruction Attacks on Distributed MLLM Inference Frameworks

Wed, 17 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.18710v1

AI Summary (中文)

背景与问题

分布式多模态大语言模型（MLLM）推理框架通过协同消费级设备实现大规模模型部署，显著降低硬件门槛。然而，各参与方间传输的中间嵌入（intermediate embeddings）可能泄露用户私有输入——此前研究已揭示文本提示的泄漏风险，而图像提示因其富含视觉细节与语义信息，隐私敏感性更高，但其在分布式MLLM中的重建攻击尚属空白。

方法创新

本文首次系统研究分布式MLLM中图像提示的隐私泄露风险。我们首先建模图像像素到中间表征的信息流路径，发现图文嵌入在MLLM各层深度耦合；为此设计高精度图像嵌入提取算法，在Gemma 3、Phi-4 Multimodal、Qwen 2.5 VL和Llama 4 Scout四大主流MLLM家族上实现近100%层间提取准确率。基于此，提出两种被动式黑盒重建攻击：

MPAA（Multi-Patch Alignment Attack）：通过分块特征提取与空间对齐，实现细粒度像素级重建；
IEDA（Image Embedding–Guided Diffusion Attack）：利用嵌入引导扩散模型生成语义一致的粗粒度图像，兼顾效率与可解释性。

关键发现与贡献

实验表明：两类攻击在不同分辨率、预处理策略及文本-图像依赖强度下均保持鲁棒性；MoE架构提升攻击成功率（因专家路由暴露更多视觉线索），而强图像压缩或弱图文耦合可轻微缓解风险。本研究是首个针对分布式MLLM图像提示重建攻击的系统性工作，揭示了多模态分布式推理中被忽视的视觉隐私威胁，并为后续防御机制设计提供基准与依据。

AI Summary (English)

Distributed multimodal large language model (MLLM) inference frameworks enable scalable deployment across consumer devices, but intermediate embeddings transmitted among participants pose serious privacy risks—especially for image prompts, whose rich visual and semantic content makes leakage highly sensitive. This paper presents the first systematic study of image prompt reconstruction attacks in such settings. We design a highly accurate image embedding extraction algorithm (≈100% layer-wise success across four representative MLLM families: Gemma 3, Phi-4 Multimodal, Qwen 2.5 VL, and Llama 4 Scout), enabling two passive black-box attacks: MPAA for fine-grained pixel-level reconstruction via patch alignment, and IEDA for coarse-grained semantic reconstruction using embedding-guided diffusion. Experiments show consistent high fidelity across diverse configurations; MoE architecture notably amplifies vulnerability, while image preprocessing and text-image coupling modulate attack efficacy. Our work establishes foundational understanding and benchmarks for visual privacy in distributed MLLMs.

Abstract

Distributed large language model (LLM) inference frameworks connect isolated consumer-grade devices for large-scale model inference, substantially reducing hardware constraints. However, recent studies show that intermediate embeddings transmitted among participants can leak private prompts. As LLMs evolve into multimodal LLMs (MLLMs), this risk extends beyond text: image prompts contain rich visual and semantic information, making their intermediate embeddings highly privacy-sensitive. Yet, image-prompt leakage in distributed MLLM inference remains largely unexplored. In this paper, we investigate privacy risks to input images caused by intermediate embeddings in distributed MLLM frameworks. We first analyze the information flow from image pixels to intermediate representations. Since image and text embeddings are often intertwined across MLLM layers, we design an image embedding extraction algorithm as a prerequisite for reconstruction attacks, achieving 100% extraction accuracy across almost all MLLM layers in our experiments. Building on this, we develop two passive black-box image reconstruction attacks, MPAA and IEDA, reflecting realistic threats from normal participants with limited knowledge and capability. MPAA performs fine-grained pixel-level reconstruction via patch-wise information extraction and assembly, while IEDA performs coarse-grained semantic reconstruction through embedding-guided diffusion generation. We evaluate our attacks on four representative MLLM families: Gemma 3, Phi 4 Multimodal, Qwen 2.5 VL, and Llama 4 Scout. Results show consistently superior reconstruction performance in various settings. We further analyze the effects of MoE architecture, image preprocessing, model size, and text-image dependency on attack performance. To our knowledge, this is the first study of image reconstruction attacks on MLLMs.

Stealthy World Model Manipulation via Data Poisoning

Wed, 17 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.18697v1

AI Summary (中文)

研究背景

基于模型的强化学习智能体依赖学习到的世界模型（World Model）预测未来状态、规划动作并适应新环境。然而，世界模型在在线细调过程中持续吸收新经验数据，这一机制引入了关键的训练时攻击面：攻击者可通过注入恶意轨迹数据（即数据投毒），隐秘篡改模型对环境动态的学习，进而破坏下游规划性能。

方法创新：SWAAP框架

本文提出SWAAP（Stealthy World Model Manipulation via Data Poisoning），首个面向世界模型的两阶段数据投毒框架：

第一阶段（目标构建）：利用转移梯度定理（Transition-Gradient Theorem）赋能的一阶双层优化，搜索一个“有害但隐蔽”的目标世界模型——该模型在规划中诱导低回报行为，同时在动力学上与干净模型高度相似（L₂距离约束）；
第二阶段（ stealth-constrained 实现）：通过梯度匹配+预测误差正则化，仅修改细调数据集中极小比例（<5%）的状态转移标签（transition targets）。所生成的毒化标签既确保训练梯度精准导向攻击目标，又强制其贴近世界模型在干净数据上的自然预测误差分布，显著提升不可察觉性。

关键发现与评估

我们在多个连续控制任务（包括Walker2d、Hopper、HalfCheetah）上验证SWAAP：

攻击后智能体平均回报下降42–68%，而毒化转移与真实物理轨迹的均方误差仅增加≤0.8%；
全面评估三阶段防御能力：预训练检测（残差/CUSUM/TRIM等非自适应方法全部失效）、鲁棒细调（对抗训练与剪裁策略无法恢复性能）、测试时监控（模型输出一致性与不确定性指标无显著异常）；
首次揭示世界模型自适应管道存在系统性、实用级脆弱性，凸显亟需兼顾数据可信性与动力学鲁棒性的联合防护机制。

AI Summary (English)

Model-based reinforcement learning agents rely on learned world models for prediction, planning, and adaptation—yet their fine-tuning on new experience creates a critical training-time attack surface. This paper introduces SWAAP, the first two-stage data poisoning framework targeting world models. In Stage I, it identifies a stealthy adversarial target model—inducing severe planning failure while remaining dynamically close to the clean model—via first-order bilevel optimization grounded in a novel Transition-Gradient Theorem. In Stage II, it realizes this target by stealth-constrained gradient matching: modifying only a tiny fraction (<5%) of fine-tuning transition targets, regularized by prediction-error fidelity to preserve natural model behavior. Evaluated across continuous-control benchmarks, SWAAP degrades agent return by 42–68% while keeping poisoned transitions nearly indistinguishable from clean data (MSE increase ≤0.8%). Crucially, it evades all tested non-adaptive defenses—including residual-based, CUSUM, and TRIM-style detectors—at pre-training, robust fine-tuning, and test-time monitoring stages. Our results expose a practical, previously overlooked vulnerability in world-model adaptation pipelines and underscore the need for co-designed data integrity and dynamics robustness.

Abstract

Model-based learning agents use learned world models to predict future states, plan actions, and adapt to new environments. However, the process of updating world models from collected experience creates a training-time attack surface: adversarially poisoned fine-tuning trajectories can manipulate the learned dynamics and thereby corrupt downstream planning. In this paper, we propose SWAAP, the first two-stage data poisoning framework for learned world models. In the first stage, SWAAP identifies a harmful target world model that induces low-return behavior under planning while remaining close to clean dynamics, using first-order bilevel optimization enabled by a transition-gradient theorem. In the second stage, SWAAP realizes this target through stealth-constrained gradient matching, modifying only a limited fraction of fine-tuning transition targets so that the induced training gradients steer the victim model toward the adversarial target, while a prediction-error regularizer encourages the poisoned targets to remain close to the world model's natural approximation error. To assess attack stealthiness, we evaluate defenses and detectability across three stages of the poisoning pipeline: pre-training detection of poisoned transitions, robust training during fine-tuning, and test-time monitoring of the resulting world model. Across diverse continuous-control tasks, SWAAP causes substantial performance degradation while keeping poisoned transitions close to clean data and evading the evaluated non-adaptive residual/CUSUM/TRIM-style defenses. These results reveal a practical vulnerability in world-model adaptation pipelines and highlight the need for robustness methods that protect both world-model training data and learned dynamics.

Code-Augur: Agentic Vulnerability Detection via Specification Inference

Wed, 17 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.18619v1

AI Summary (中文)

Code-Augur：基于规约推断的智能体式漏洞检测新范式

当前，以大语言模型（LLM）为驱动的自主智能体漏洞检测正成为软件安全领域的分水岭。已有实践表明，全自主运行的LLM智能体已在基础系统软件中发现多年未被察觉的关键漏洞。然而，其推理过程高度“黑箱”——当智能体判定某函数“安全”时，它隐含了哪些关于输入、状态或上下文的假设？这些未经显式表达与验证的假设，极易导致漏报，严重削弱对智能体分析结果的信任。

为此，本文提出安全规约优先（security-specification-first）范式：
1. 显式化隐含假设：将智能体判断“安全”所依赖的局部不变式（local invariants）自动提炼为可执行的源内断言（in-source assertions）；
2. 动态规约精化：通过引导式模糊测试器（guided fuzzer）实时证伪这些断言——若触发断言失败，则要么暴露真实漏洞，要么揭示需修正的安全规约。

我们据此构建了新型检测框架 Code-Augur。在真实代码库上，它逐组件分析，对每个“安全”判定同步生成并嵌入断言，并持续用模糊测试挑战其有效性。该闭环机制使智能体的理解始终锚定于代码实际行为，实现意图（intent）与执行（execution）的对齐。

实验表明：Code-Augur 在多个真实项目中检出漏洞数超越现有SOTA智能体；独立发现 22个全新漏洞（已获CVE编号或项目确认），涵盖 OpenSSL、FFmpeg 等关键开源组件；且不依赖闭源/定制模型（如 Claude Mythos），仅基于广泛可用的开源/商用LLM（如 Sonnet-3.5、DeepSeek-V3）即可实现高性能检测，显著提升可复现性与工程落地性。

AI Summary (English)

Code-Augur introduces a security-specification-first paradigm for agentic vulnerability detection, addressing the critical opacity of LLM agents’ reasoning. It explicitly surfaces agents’ implicit security assumptions as in-source assertions—local invariants inferred when a component is deemed secure—and continuously refines them via runtime falsification using a guided fuzzer. Assertion violations either expose genuine vulnerabilities or reveal flawed specifications, grounding agent understanding in actual code behavior. Evaluated on real-world open-source projects, Code-Augur detects more vulnerabilities than state-of-the-art agentic baselines and independently discovered 22 novel vulnerabilities, including in OpenSSL and FFmpeg. Crucially, it achieves this using widely accessible LLMs (e.g., Sonnet, DeepSeek), avoiding reliance on proprietary or highly specialized models like Claude Mythos—demonstrating both effectiveness and practical deployability.

Abstract

The advent of agentic vulnerability detection is already becoming a watershed moment for software security. Audits conducted entirely by autonomous LLM agents are uncovering critical vulnerabilities in fundamental software underpinning digital society. Many of these vulnerabilities remained masked for years, surfacing only now with AI agents. Yet the reasoning behind these discoveries remains alarmingly opaque and unvalidated. What assumptions did the agent make about a function's inputs when it deemed that function to be secure? Failures in reasoning and incorrect assumptions can lead to missed vulnerabilities and reduce trust in agentic analysis. We propose a security-specification-first paradigm that (1) exposes the agent's tacit assumptions explicitly as security specifications and (2) continuously refines those specifications via runtime falsification. We realize our approach in Code-Augur, a novel harness for agentic vulnerability detection. Given a codebase, Code-Augur analyzes each component of the system for vulnerable code. When it deems a component to be secure, it commits the local invariants behind that judgment as in-source assertions. In parallel, Code-Augur leverages a guided fuzzer to attempt to falsify those assumptions. When the fuzzer triggers an assertion, this either reveals a genuine vulnerability or a flawed specification to refine. In both cases, this process grounds the agent's understanding, aligning its view of code intent with how the code actually behaves. On real-world subjects, Code-Augur effectively leverages security specifications to detect more vulnerabilities than other state-of-the-art agents. Additionally, Code-Augur found 22 new vulnerabilities in key open-source projects. Compared to curated specialized models like Claude Mythos, Code-Augur offers effective agentic vulnerability detection built on widely available LLMs like Sonnet and DeepSeek.

MIDS: Detecting Stealthy Masquerade and Tampering Attacks on CAN Bus via Bidirectional Mamba

Wed, 17 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.18599v1

AI Summary (中文)

背景与挑战

车载控制器局域网（CAN）协议因缺乏加密与身份认证机制，易受隐蔽攻击威胁。现有入侵检测系统（IDS）多针对注入式攻击（如DoS、模糊测试、ID欺骗），依赖帧间到达间隔等统计特征进行识别；但对更隐蔽的冒充攻击（masquerade attack）——即内部攻击者在原始时隙原位替换合法CAN帧，保持流量周期性与统计特性不变——几乎无效，导致传统流量统计防御全面失效。

方法创新

本文提出Mamba入侵检测系统（MIDS），首个面向CAN冒充攻击的双流深度检测框架：

双通道并行建模：分别处理CAN标识符（ID）与有效载荷（Payload），保留二者语义独立性；
双向选择性状态空间建模：基于Mamba架构构建双向状态传播路径，精准捕获ID与Payload间的跨维度时序耦合关系（如ID跳变与对应数据字段的协同演化），突破RNN/CNN/Transformer在长序列建模中的效率与表达瓶颈；
轻量化实时设计：单窗口推理延迟仅1.147 ms，满足车载ECU实时部署需求。

实验验证

在真实Tesla Model 3采集的超1亿帧CAN数据（覆盖城市、高速、泊车三类驾驶场景）上，合成54种冒充攻击变体（含ID篡改、数据篡改及混合篡改）。MIDS达F1=96.94%，较最强可复现基线提升超8个百分点；在ROAD、CrySyS、OTIDS、CT&T四大公开基准（涵盖冒充与注入双重场景）上，F1稳定于93.70%–99.61%，最高领先8种基线模型13.94个百分点，验证其强泛化性与鲁棒性。

AI Summary (English)

The Controller Area Network (CAN) protocol, widely deployed in automotive ECUs, lacks built-in security, making vehicles vulnerable to stealthy masquerade attacks—where adversaries replace legitimate frames in situ, preserving traffic periodicity and evading statistic-based IDS. To address this critical gap, we propose MIDS, the first dual-stream CAN intrusion detection system leveraging bidirectional selective state-space modeling (based on Mamba) to jointly capture temporal semantics between CAN IDs and payloads. Evaluated on >100M real-world frames from a Tesla Model 3 and 54 synthesized masquerade variants, MIDS achieves 96.94% F1, surpassing the strongest reproducible baseline by >8.0 points, with only 1.147 ms inference latency per window—enabling real-time onboard deployment. Further validated across four public benchmarks (ROAD, CrySyS, OTIDS, CT&T) covering both masquerade and injection attacks, MIDS attains 93.70–99.61% F1, outperforming the best of eight baselines by up to 13.94 percentage points under unified 5-fold evaluation.

Abstract

The Controller Area Network (CAN) protocol is the primary communication standard for Electronic Control Units (ECUs) in modern vehicles, but its lack of encryption and authentication exposes it to a range of security threats. Existing intrusion detection systems are largely tuned to fabrication-style attacks (DoS, fuzzing, ID spoofing realised by frame injection), in which detection signals such as per-ID inter-arrival statistics are readily available. We instead address the harder \emph{masquerade} setting~\cite{b37}, in which an internal adversary substitutes a legitimate frame in-situ at its original transmission slot, preserving traffic periodicity and rendering traffic-statistic defences ineffective. We propose the Mamba Intrusion Detection System (MIDS), an innovative dual-stream framework that processes CAN identifiers and payloads in parallel and reconstructs their joint temporal semantics through bidirectional selective state-space modelling. To evaluate MIDS, we collected over 100 million CAN frames from a physical Tesla Model 3 across three driving regimes and synthesised 54 masquerade attack variants spanning ID-only, data-only, and combined modifications. MIDS attains an F1 of 96.94\% on this dataset, exceeding the strongest reproducible baseline by more than 8 percentage points, while sustaining a 1.147~ms single-window inference latency -- ample headroom for real-time onboard deployment. To verify generalisation, we further evaluate MIDS on four public benchmarks (ROAD, CrySyS, OTIDS, CT\&T) covering both masquerade and injection scenarios; MIDS attains F1 from 93.70\% to 99.61\%, outperforming the strongest of eight reproduced baselines by up to 13.94 percentage points under a unified 5-fold protocol.

The Gate Is Only as Honest as Its Contracts: ContractGuard for the Contract Layer of Risk-Aware Causal Gating

Wed, 17 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.18550v1

AI Summary (中文)

研究背景与问题

风险感知因果门控（Risk-Aware Causal Gating, RACG）通过结构性隔离防御间接提示注入攻击：它动态隐藏高危工具，使LLM代理无法调用未被授权的工具——即使模型完全服从指令，亦无从触发危险动作。然而，本研究揭示一个关键盲区：RACG的安全性并非源于“无信任”，而是将信任迁移至工具合约（tool contracts）的完整性上。这些合约明确定义工具的前置条件、副作用、风险等级与授权策略，门控逻辑完全依赖其内容决策。一旦攻击者篡改合约，即可绕过门控，无需欺骗LLM本身。

核心发现

1. 效应伪造比风险标签篡改更致命：RACG采用两级门控（因果门优先于可准入门），仅当工具进入因果路径时才暴露。篡改风险标签（如将高危工具标为“低风险”）无效，因其仍被因果门拦截；而伪造工具副作用（effects） 则可欺骗因果推理，诱使门控误判该工具“安全可用”，从而将其引入执行路径——效应完整性才是真正的承重假设。
2. 合约层是新的攻击面：现有RACG将合约视为可信输入，但实际中合约常来自不可信注册中心或第三方开发者，存在供应链污染风险。

方法与创新：ContractGuard

我们提出ContractGuard——部署于合约注册中心与门控器之间的轻量级验证中间件，采用三层防护：

✅ 签名溯源（Signed Provenance）：强制合约发布者数字签名，确保来源可追溯；
✅ 类型化合约认证（Typed Attestation）：基于形式化契约语言（如Liquid Haskell风格）验证合约结构合规性与逻辑一致性；
✅ 运行时效应验证（Runtime Effect Verification）：对工具真实输出进行沙箱化观测与合约声明比对，阻断effect伪造。

实验结果

在可控基准测试中，ContractGuard将所有建模攻击（含穷尽式白盒自适应攻击）的成功率降至0%，且零误拒合法合约；该效果在六款当前主流托管模型上得到实证验证（Claude Opus 4.8 / Sonnet 4.6 / Haiku 4.5；Amazon Nova Premier / Nova 2 Lite；GPT-OSS-120B），证实其架构级鲁棒性。

AI Summary (English)

Risk-Aware Causal Gating (RACG) defends tool-augmented LLM agents by structurally hiding dangerous tools—yet this shifts trust from the agent to the integrity of tool contracts (preconditions, effects, risk labels, authorization). We show that effect forgery—not risk relabeling—is the critical vulnerability, because RACG’s causal gate blocks off-path tools before risk assessment; only effect tampering can misroute a dangerous tool onto the causal path. To address this, we introduce ContractGuard: a lightweight verifier between registry and gate that layers signed provenance, typed contract attestation, and runtime effect verification. On a rigorous benchmark—including exhaustive white-box adaptive attacks—ContractGuard reduces injection success to 0% with zero false rejections of honest contracts. This guarantee holds across six state-of-the-art hosted models (Claude Opus/Sonnet/Haiku; Amazon Nova Premier/Lite; GPT-OSS-120B), confirming its structural resilience.

Abstract

Risk-Aware Causal Gating (RACG) defends tool-augmented LLM agents against indirect prompt injection by removing dangerous tools from the agent's visible action space, so that even a fully injection-compliant agent cannot call a tool it cannot see. We make three points. First, this structural guarantee does not eliminate the trust assumption behind safe tool use; it relocates it into the integrity of the tool contracts -- declared preconditions, effects, risk, and authorization -- that the gate reads, so an attacker who corrupts a contract can make the gate mis-decide without ever persuading the agent. Second, forging a tool's effects is strictly more dangerous than tampering with its risk label, because RACG applies a causal gate before its admissibility gate: an off-path tool is never exposed, so risk-relabeling alone fails, whereas effect forgery routes the dangerous tool onto the causal path and succeeds. Effect integrity, not the risk label, is the load-bearing assumption. Third, we introduce ContractGuard, a verifier between the registry and the gate that layers signed provenance, typed contract attestation, and runtime effect verification; on a controlled benchmark it restores injection success to zero against every modeled attack -- including an exhaustive white-box adaptive attacker -- without over-rejecting honest contracts, and the structural prediction is confirmed on six current-generation hosted models (Claude Opus 4.8, Sonnet 4.6, Haiku 4.5; Amazon Nova Premier and Nova 2 Lite; GPT-OSS-120B).

Towards an Agent-First Web: Redesigning the Web for AI Agents

Wed, 17 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.19116v1

AI Summary (中文)

背景与问题

万维网三十年来始终以“人类为首要用户”为根本假设：其访问机制、经济模型与内容设计均围绕人类感知与注意力构建。然而，AI代理（AI agents）正迅速成为人与网络内容间的关键中介，这一范式转变使原有假设彻底失效。当前网络普遍通过机器人屏蔽、CAPTCHA验证及将代理访问污名化为“数据抓取”等方式系统性排斥AI代理，阻碍其合法、可信、可持续的交互。

三层重构框架

本研究提出面向AI代理优先（agent-first）的Web系统性重构，涵盖访问层、经济层与内容层：

访问层：确立代理作为人类代理者的平等访问权，引入标准化HTTP元数据头（如Agent-ID、Human-Proxy-ID）与速率控制机制；倡导同一域名下并行提供人类可读HTML与代理优化的结构化ATML响应的双通道架构。
经济层：基于“代理即人类代理”原则，提出意图驱动的三级经济框架：① Token订阅制（按语义单元计费，非页面浏览量）；② 委托式内容生产经济（AI生成内容需绑定人类发起意图与授权凭证）；③ 代理行为审计追踪。
内容层：首次定义认知递归（epistemic recursion）——AI生成内容被代理反复消费再生产，导致网络知识体系脱离人类真实经验与事实根基。为此提出Agent Text Markup Language（ATML），含四级人工监督机制（标注/审核/校准/溯源），并集成轻量级密码学溯源链（Cryptographic Provenance Chain），确保内容可验、可溯、可问责。

创新与意义

论文凝练出十大设计原则，推动Web从“human-first”向“agent-inclusive”社会契约的根本重订，为AI时代基础设施的伦理兼容性、经济可持续性与知识可靠性提供首个系统性蓝图。

AI Summary (English)

This paper confronts the foundational mismatch between the human-centric design of the Web and the rising role of AI agents as legitimate intermediaries. We propose a principled, three-layer redesign: (1) At the access layer, agents acting on behalf of humans inherit equivalent rights via standardized HTTP headers (Agent-ID, Human-Proxy-ID) and rate-limiting, alongside dual-content delivery (human HTML + agent-optimized ATML) from unified domains. (2) At the economic layer, we introduce an intent-based tiered model where agent obligations mirror those of their human principals—implemented via tokenized subscriptions (per semantic unit, not pageview) and a commissioned content economy anchoring AI generation in explicit human intention. (3) At the content layer, we identify epistemic recursion—the dangerous loop wherein AI-generated content feeds further AI production, eroding grounding in human truth—and counter it with the Agent Text Markup Language (ATML), a four-tier human supervision framework, and a cryptographic provenance chain. Collectively, these yield ten design principles for an agent-first Internet—one where agents are first-class citizens, demanding renegotiation of the Web’s social contract across access, economics, and knowledge integrity.

Abstract

The World Wide Web was built on an assumption held for three decades: the primary consumer of web content is a human being. This permeates every layer; its access model presumes human visitors, its economics rest on human attention, and its content targets human perception. The rapid emergence of AI agents as intermediaries between humans and web content invalidates this assumption. Yet the web resists agents through blanket blocking, CAPTCHA-based exclusion, and economic models that treat agent access as extraction rather than legitimate interaction. This paper proposes a principled redesign across three layers. At the access layer, agents acting for humans should inherit equivalent access rights, governed by rate limiting and agent identification metadata in HTTP requests, analogous to browser headers, alongside a dual-layer architecture serving human-readable and agent-optimized content from the same domain. At the economic layer, we propose an intent-based tier framework grounded in the agent-as-human-proxy principle: an agent's economic obligation mirrors that of the human it represents. A token-based subscription model meters content in tokens rather than pageviews, alongside a commissioned content economy anchoring AI content production in human intentionality. At the content layer, we identify epistemic recursion, the self-referential loop in which AI-generated content is consumed by agents to produce further content, progressively detaching web knowledge from human ground truth. We propose the Agent Text Markup Language (ATML), a four-level human supervision tier model, and a cryptographic provenance chain to counter this threat. Together these constitute ten design principles for an agent-first internet, one in which agents are first-class citizens whose integration requires renegotiating the web's foundational social contract across access, economics, and content.

RedactionBench

Wed, 17 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.18782v1

AI Summary (中文)

背景与问题

大型语言模型（LLM）正日益部署于医疗、金融等敏感领域，亟需可靠地红删个人身份信息（PII）。然而，现有基准测试将实体识别技术与隐私语义混为一谈：公开网页中的电话号码与电子病历中的同一号码，其隐私风险本质不同。是否构成隐私泄露，高度依赖于持有者身份、使用目的及具体情境——这使红删任务远超传统命名实体识别（NER），而需扎根于“情境完整性”（Contextual Integrity）理论。

方法与创新

本研究提出 RedactionBench：首个面向情境化红删的高质量人工标注基准。包含200份真实来源文档，覆盖医疗、法律、教育等11个高敏感度领域；所有标注均经多轮专家校验。同时设计R-Score——一种新型字符级评估指标：它忽略掩码格式差异（如[REDACTED] vs XXX-XX-XXXX），对语义等价的红删结果给予同等评分，从而解耦“格式噪声”与“隐私判断偏差”。

关键发现

在35个模型（涵盖NER系统、轻量级提取模型及前沿代理型LLM）的系统评测中，无一模型在情境红删上达人类水平。更关键的是，80+名真实用户参与的人类评估揭示深刻分歧：对强制红删项（如身份证号）共识率达89.4%，对安全保留文本达94.1%，但对情境依赖型红删（如“患者曾就诊于XX医院”是否需隐去）共识率仅47.7%——证实隐私判断具有本质主观性。R-Score由此成为衡量模型鲁棒性的关键指标。

开源贡献

RedactionBench数据集、标注规范、R-Score实现及完整评测结果已开源，旨在为隐私保护AI建立可复现、情境感知的评估新范式。

AI Summary (English)

Large Language Models (LLMs) are increasingly deployed in sensitive domains requiring robust redaction of personally identifiable information (PII), yet existing benchmarks conflate entity extraction with privacy semantics—ignoring that PII risk depends critically on context, holder, and purpose. Grounded in contextual integrity theory, we introduce RedactionBench, a manually annotated benchmark of 200 diverse, real-world documents across 11 domains. We propose R-Score, a character-level metric that treats semantically equivalent redactions equally while ignoring superficial formatting variations (e.g., different phone number masking styles). Evaluations across 35 models—including NER systems, small extraction LLMs, and frontier agentic models—show that contextual redaction remains unsolved. Human evaluation with >80 annotators reveals high agreement on mandatory redactions (89.4%) and safe preservations (94.1%), but low consensus on context-dependent cases (47.7%), underscoring the subjectivity of privacy judgments. RedactionBench is publicly released to establish a rigorous, context-aware baseline for privacy-preserving AI.

Abstract

Large Language Models are increasingly applied to sensitive domains that require redaction of personally identifiable information (PII). While redacting PII is a data cleaning prerequisite, existing benchmarks conflate extraction mechanics with privacy semantics. A public phone number is not equivalent to a phone number in a medical record. Whether information constitutes a violation depends heavily on who holds it, why, and in what context, fundamentally differentiating redaction from simple entity recognition. Grounded in contextual integrity, we introduce RedactionBench, a manually annotated benchmark comprising 200 diverse documents across 11 domains, mostly seeded from real-world sources. We also introduce R-Score, a novel character-level metric that treats semantically similar redactions equally and nullifies shallow formatting choices, such as varying masking styles for phone numbers. Evaluations across Named Entity Recognition models, entity extraction Small Language Models, and frontier models equipped with agentic tools demonstrate that contextual redaction remains an unsolved problem. A human evaluation with over 80 users on RedactionBench reveals a stark dichotomy in privacy perceptions. Annotators show consensus with target labels for mandatory redactions (89.4 percent) and safe text preservations (94.1 percent), but fail to agree on contextual redactions (47.7 percent). This variance demonstrates the subjective nature of contextual privacy and motivates R-Score, which decouples contextual ambiguity from strict precision. We compare 35 models across families and report their performance in redacting PII. Finally, we release RedactionBench to establish a baseline for future privacy-preserving systems, hoping to inspire efficient model design and standardized evaluations.

Private Learning with Public Feature Conditioning

Wed, 17 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.18773v1

AI Summary (中文)

背景与问题

在推荐系统、广告投放等实际场景中，数据样本常包含公共（非敏感）特征（如商品类别、用户设备类型）和私有标签（如点击率、购买金额）。现有差分隐私（DP）回归研究多假设全部数据敏感，而针对“标签级DP”或“半敏感特征”设定的高效方法仍严重缺失——尤其在回归任务中，既往工作（如DPSGD）未充分利用公共特征的结构信息，导致收敛慢、效用低。

方法：Cond-DP

本文提出 Cond-DP——一种面向公共特征条件化的差分隐私随机梯度下降变体。其核心创新在于：基于公共特征矩阵常具有的快速衰减谱特性（即奇异值呈指数/多项式衰减），设计一个数据驱动的条件化矩阵（conditioning matrix），动态重塑损失函数的优化曲面，从而加速梯度更新并缓解病态性。该矩阵可仅从公开特征直接构造，不引入额外隐私开销（无需对公共特征加噪或查询）。

理论与实验贡献

✅ 提供统一收敛性分析：覆盖凸、强凸、非凸三类目标函数，并证明当条件矩阵取单位阵时，Cond-DP 退化为标准 DPSGD；
✅ 在私有线性回归中，严格证明 Cond-DP 的迭代复杂度优于 DPSGD（如强凸情形下收敛速率提升至 $O(1/T)$ vs. $O(1/\sqrt{T})$），且隐私预算 $\varepsilon,\delta$ 完全相同；
✅ 大量实验验证：在 8 个真实数据集（MovieLens、Avazu、Criteo 等）及多种模型（线性回归、MLP、Wide&Deep）上，Cond-DP 在标签DP下持续超越 SOTA 基线（如 DP-Adam、PATE、DP-SVRG），平均 RMSE 降低 12.7%–34.1%，鲁棒性强。

AI Summary (English)

We study differentially private (DP) regression where each sample contains public, non-sensitive features (e.g., item category, device type) and a private label—a common setting in recommendation and advertising. While label-level DP has been explored for classification, effective DP regression methods remain underdeveloped. We propose Cond-DP, a conditioned variant of DPSGD that leverages the spectral structure of public feature matrices to accelerate optimization under privacy constraints. Motivated by the rapid spectral decay commonly observed in such features, Cond-DP employs a data-driven conditioning matrix—constructed solely from public features without extra privacy cost—to reshape the loss landscape. We provide convergence guarantees for convex, strongly convex, and non-convex objectives, recovering standard DPSGD as a special case. Theoretically, Cond-DP achieves faster convergence than DPSGD in private linear regression under identical $(\varepsilon,\delta)$-DP budgets. Empirically, it consistently outperforms state-of-the-art baselines across diverse datasets and architectures under label DP, with up to 34.1% lower RMSE.

Abstract

We study differentially private (DP) regression in settings where each data sample includes public, non-sensitive features -- common in applications such as recommendation and advertising systems. While such label-DP or semi-sensitive-feature settings have been primarily explored in the context of classification, effective approaches for regression remain underexplored. We introduce Cond-DP, a conditioned variant of DPSGD that leverages the structure of public feature matrices to improve optimization under privacy constraints. Motivated by the observation that these public features often exhibit rapidly decaying spectra, Cond-DP incorporates a data-driven conditioning matrix to reshape the optimization landscape and accelerate convergence. We provide convergence guarantees for convex, strongly convex, and non-convex settings, and recover standard DPSGD as a special case when the conditioning matrix is the identity. We show how to construct an effective conditioning matrix for Cond-DP directly from public features, enabling provably faster convergence than DPSGD in private linear regression without incurring additional privacy cost. Empirically, Cond-DP with this conditioning matrix consistently outperforms state-of-the-art baselines across a wide range of datasets and model architectures under label DP, demonstrating strong and robust performance in practice.

BCL: Bayesian In-Context Learning Framework for Information Extraction

Wed, 17 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.18620v1

AI Summary (中文)

背景与挑战

当前信息抽取（IE）任务广泛采用大语言模型（LLM）的上下文学习（ICL）范式，但现有方法面临两大瓶颈：性能随模型规模波动显著（如在小模型上失效、大模型上过拟合），且缺乏可迁移的系统性优化机制——标签表示常依赖手工模板或静态示例，难以适应多样化任务（如命名实体识别、关系分类）和动态输入分布。

方法创新：BCL 框架

本文提出 BCL（Bayesian In-Context Learning）框架，是首个将贝叶斯推理与粒子滤波深度融合于ICL的信息抽取优化框架。其核心在于：

四阶段闭环优化：① 初始化——基于先验知识生成多样化的候选标签表示；② 观测——将当前样本输入LLM，获取隐式置信度得分；③ 权重更新——利用贝叶斯定理动态调整各粒子（即标签表示）的后验权重；④ 重采样——依据权重保留高置信度粒子，淘汰低效表示，实现标签空间的自适应演化。
跨范式通用性：统一建模序列标注（如 BIO 标签）与关系分类（如 (subject, relation, object) 三元组），无需任务特定架构修改。

实验结果与价值

在 8 个主流 IE 数据集（包括 CoNLL-2003、SciERC、FewRel）上验证，BCL 在 GPT-4、LLaMA-2、Qwen 等多代模型上均实现稳定提升：平均 F1 增益达 +3.2–5.7%，小样本（k=4）下相对提升超 12%；显著缓解“模型越大效果越差”的反直觉现象。本工作为 ICL 提供了可解释、可迭代、可泛化的概率化优化路径。

AI Summary (English)

Existing in-context learning (ICL) approaches for information extraction (IE) suffer from scale-dependent instability and poor generalizability across tasks. To address this, we propose BCL, the first Bayesian optimization framework for ICL-based IE. BCL employs particle filtering with sequential Bayesian updates—initializing label representations, observing LLM outputs, updating particle weights via Bayes’ rule, and resampling to refine representations iteratively. It unifies sequence labeling and relation classification without task-specific modifications. Experiments across 8 benchmarks (e.g., CoNLL-2003, SciERC, FewRel) show consistent gains: +3.2–5.7 F1 points over strong baselines (e.g., Auto-CoT, EPR) on GPT-4, LLaMA-2, and Qwen—especially under low-shot (k=4) settings (+12.3% relative improvement). BCL establishes a principled, interpretable, and scalable paradigm for robust ICL optimization.

Abstract

Existing information extraction (IE) tasks increasingly adopt in-context learning (ICL) with large language models. However, current approaches either show inconsistent performance across model scales or lack systematic optimization and generalizability. Building on this, we propose BCL (Bayesian In-Context Learning Framework for Information Extraction), the first optimization framework that uses particle filtering with Bayesian updates to systematically refine label representations across IE tasks. Through four steps initialization, observation, weight update, and resampling, BCL generalizes to both sequence labeling and relation classification paradigms. Extensive experiments demonstrate substantial and consistent improvements over existing approaches.

Learning Robust Pair Confidence for Multimodal Emotion-Cause Pair Extraction

Wed, 17 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.18893v1

AI Summary (中文)

研究背景与问题

多模态情绪-原因对抽取（MECPE）的核心挑战在于为候选情绪-原因对生成鲁棒、可信的置信度评分。现有方法通常采用基于有效候选对的成对交叉熵损失，但该范式将各对关系视为独立单元，忽视了竞争性原因之间的相对置信度几何结构——导致黄金对易与难负样本距离过近，或过度依赖非黄金上下文线索，形成对噪声敏感的“置信度脆性”。

方法创新：RPCL框架

本文提出RPCL（Robust Pair Confidence Learning）——一种纯训练阶段的置信度优化框架，无需修改推理流程。其设计包含双重正则化机制：

判别性约束：强制黄金对与同情绪行内最难负样本的置信度差值超过预设边界（confidence-difference margin），扩大黄金-负样本分离间隙；
稳定性约束：通过部分扰动非黄金上下文话语表征（如随机掩码音频/文本片段）构建“污染视图”，并拉近清洁视图与污染视图下的干净对预测结果，提升对上下文扰动的鲁棒性。

主要结果与贡献

在ECF、MECAD和MEC4三大基准上，RPCL在全模态（文本-音频-视频）设置下，三种子模型平均Pair F1提升2.58–2.83个百分点，平均Pair AUPRC全面上升。诊断分析证实：黄金对与难负样本的置信度差距显著增大，边界违反程度降低。本工作首次系统揭示并缓解MECPE中的置信度脆性问题，验证了显式建模置信度几何结构是提升多模态因果对抽取性能的有效训练策略。

AI Summary (English)

Multimodal Emotion-Cause Pair Extraction (MECPE) critically relies on robust pair confidence estimation, yet existing scorers suffer from pair-confidence brittleness: gold pairs often cluster near hard negatives or depend on incidental non-gold context due to independent pairwise loss optimization. We propose RPCL—a training-only framework that jointly enforces discriminability (via row-wise margin constraints between gold and hardest negative pairs) and stability (via consistency between clean and corrupted views where non-gold contextual utterances are partially masked). RPCL preserves the original scorer and decoder unchanged at inference. On ECF, MECAD, and MEC4, RPCL boosts three-seed mean Pair F1 by 2.58–2.83 points and improves mean Pair AUPRC across all datasets. Diagnostic analysis confirms larger gold-negative confidence gaps and reduced margin violations—demonstrating that explicitly shaping confidence geometry is an effective strategy for robust MECPE.

Abstract

Multimodal emotion-cause pair extraction (MECPE) requires reliable pair confidence over candidate pairs. Existing pair scorers commonly use pair-level cross entropy over valid candidates, which treats links mostly independently. This leaves the relative confidence geometry among competing causes under-constrained, allowing gold pairs to stay close to hard negatives or rely on incidental non-gold context. We study this vulnerability as pair-confidence brittleness and propose RPCL (Robust Pair Confidence Learning), a training-only framework for pair-confidence learning. RPCL encourages pair confidence to be both discriminative and stable: gold pairs are separated from row-wise hard negatives through a confidence-difference margin constraint, and clean pair predictions are aligned with predictions from a corrupted view where non-gold contextual utterance representations are partially corrupted. The original clean pair scorer and decoding pipeline are used unchanged at inference time. On ECF, MECAD, and MEC4, RPCL improves the three-seed mean Pair F1 over a matched base model by 2.58 to 2.83 percentage points in the full text-audio-video setting, and improves mean Pair AUPRC on all three datasets. Diagnostic analysis further shows larger gold-negative confidence gaps and lower margin-violation severity. These results suggest that explicitly shaping pair confidence is an effective training strategy for MECPE.

SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction

Wed, 17 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.18780v1

AI Summary (中文)

背景与挑战

多模态信息抽取（MIE）——涵盖多模态命名实体识别（MNER）、关系抽取（MRE）和事件抽取（MEE）——是理解图文融合多媒体内容的关键技术，但在低资源场景下面临严重标注数据稀缺问题。现有数据增强方法存在两大瓶颈：跨模态对齐粗粒度（如仅依赖图像-文本整体匹配），以及任务割裂设计（各子任务需独立建模、无法共享语义知识），导致生成样本保真度低、泛化性弱。

方法创新：SAMA框架

本文提出语义锚点对齐的多模态增强框架（SAMA），实现统一、可控、高保真的低资源MIE增强：

语义锚点构建：从真实标签中自动提取结构化语义锚（如实体类型、关系角色、事件论元），作为生成过程的核心约束；
协同多专家MLLM生成文本：设计协作式多专家多模态大语言模型（CME-MLLM），集成通用适配器（建模跨任务共享语义）与任务专用适配器（保障MNER/MRE/MEE各自约束），生成多样且合规的合成文本；
锚点保持扩散图像合成：提出锚点加权提示+隐空间条件控制机制，在扩散模型中显式保留关键语义锚（如“人物穿红衣”“爆炸发生在工厂”），同时丰富视觉上下文；
双约束过滤模块：无需人工校验，通过跨模态一致性（图文互译置信度）与锚点保真度（生成文本/图像对锚点的覆盖准确率）联合筛选高质量样本。

实验结果

在ACE2005、Twitter2015、MEE2023等标准基准上，SAMA在全监督与仅10%标注数据的低资源设置下，均显著超越SoTA增强方法（平均F1提升+3.2~5.8），且在三类MIE任务间无缝迁移，验证了其统一性、鲁棒性与实用性。

AI Summary (English)

Multimodal Information Extraction (MIE)—encompassing MNER, MRE, and MEE—is vital for multimedia understanding but severely hindered by data scarcity. Existing augmentation methods suffer from coarse cross-modal alignment and fragmented, task-specific designs, limiting semantic fidelity and generalizability. To address this, we propose SAMA (Semantic Anchor-aligned Multimodal Augmentation): a unified framework that constructs structured semantic anchors from ground-truth labels to guide both text and image synthesis. For text, SAMA employs a Collaborative Multi-Experts MLLM with a universal adapter (for shared semantics) and task-specific adapters (for constraint-aware generation). For images, it introduces an Anchor-Preserving Diffusion mechanism using anchor-weighted prompts and latent conditioning. A Dual-Constraint Filtering module then automatically selects high-quality synthetic pairs based on cross-modal consistency and anchor fidelity—eliminating manual verification. Experiments across MNER, MRE, and MEE benchmarks under full-supervision and extreme low-resource (10% data) settings show SAMA consistently outperforms state-of-the-art baselines (+3.2–5.8 F1), demonstrating its versatility, robustness, and effectiveness.

Abstract

Multimodal Information Extraction (MIE)-covering tasks such as Multimodal Named Entity Recognition (MNER), Relation Extraction (MRE), and Event Extraction (MEE)-is essential for understanding multimedia content but remains constrained by severe data scarcity. Although data augmentation is a promising remedy, existing approaches are impeded by coarse cross-modal alignment and fragmented, task-specific designs that fail to exploit shared semantic knowledge. To overcome these limitations, we introduce Semantic Anchor-aligned Multimodal Augmentation (SAMA), a unified framework for generating high-fidelity, task-aware synthetic data. SAMA constructs structured semantic anchors from ground-truth labels to guide a Collaborative Multi-Experts Multimodal Large Language Model (CME-MLLM), which integrates a Universal Adapter for shared semantics with Task-Specific Adapters to produce diverse yet constraint-compliant textual samples. For image synthesis, SAMA employs an Anchor-Preserving Diffusion mechanism that uses anchor-weighted prompts and latent conditioning to maintain critical semantic anchors while diversifying visual contexts. To eliminate the need for manual verification, SAMA further introduces a Dual-Constraint Filtering module that selects synthetic samples based on both cross-modal consistency and anchor fidelity. Extensive experiments across benchmark datasets for MNER, MRE, and MEE demonstrate that SAMA consistently outperforms state-of-the-art augmentation baselines under both fully supervised and low-resource settings, underscoring its versatility, robustness, and effectiveness.

A Layered Security Framework Against Prompt Injection in RAG-Based Chatbots

Wed, 17 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.19660v1

AI Summary (中文)

背景与问题

提示注入（Prompt Injection）被OWASP LLM Top 10列为大语言模型（LLM）部署中最严重的安全威胁。在检索增强生成（RAG）型聊天机器人中，现有防御手段存在结构性缺陷：输入过滤器无法检查检索出的文档内容，输出监控器无法阻止恶意指令在模型内部生效。这导致间接注入攻击——攻击者通过污染知识库文档，可批量劫持所有检索到该文档的用户会话，而现有方案对此完全失效。

方法：三层协同防御框架

我们提出首个端到端、模型无关的分层安全框架，在推理全链路嵌入三重防护：

Layer 1（输入层）：融合规则库匹配与微调的语义异常分类器，实时识别恶意用户输入；
Layer 2（上下文层）：引入溯源驱动的指令优先级机制，强制保障系统策略（operator policy）在上下文组装阶段始终高于检索内容，阻断知识库文档对指令逻辑的覆盖；
Layer 3（输出层）：结合策略规则引擎与语义漂移检测器，在响应交付前进行双重审计。

框架以轻量级中间件形式部署，无需修改底层LLM，并支持持续审计闭环：结构化日志聚合→攻击模式聚类→分类器增量重训练。

主要发现与创新

在GPT-4o、Llama 3、Mistral 7B三大模型上测试5,080个样本，框架将攻击成功率（ASR）从71.4%显著压降至11.3%，较最优单层基线提升27.3个百分点，优于已发表守门员系统23.8个百分点；同时保持仅4.8%假阳性率与61.2 ms中位延迟开销。消融实验证实三层设计具有强互补性，联合防护效果非线性叠加，验证了“纵深防御”在RAG安全中的必要性与有效性。

AI Summary (English)

Prompt injection—the top-ranked vulnerability in OWASP’s LLM Top 10—remains inadequately addressed in RAG chatbots due to fragmented defenses: input filters ignore retrieved documents, and output monitors cannot prevent malicious payloads from executing. We propose a model-agnostic, middleware-based three-layer framework that intercepts both direct and indirect prompt injection across the full inference pipeline. Layer 1 combines rule-based pattern matching with a fine-tuned semantic anomaly classifier for input screening; Layer 2 enforces a provenance-aware instruction hierarchy during context assembly to ensure operator policy overrides poisoned retrieval content; Layer 3 audits outputs via a policy rule engine and semantic drift detector before delivery. A continuous audit loop enables adaptive retraining. Evaluated on 5,080 samples across GPT-4o, Llama 3, and Mistral 7B, the framework reduces Attack Success Rate (ASR) from 71.4% to 11.3%, outperforming the best single-layer baseline by 27.3 pp and a state-of-the-art guardrail system by 23.8 pp—while maintaining only 4.8% false positive rate and 61.2 ms median latency overhead. Ablation studies confirm synergistic, non-additive protection across all layers.

Abstract

Prompt injection is ranked as the most critical vulnerability in large language model (LLM) deployments by the OWASP Top 10 for LLM Applications, yet existing defenses operate at isolated pipeline stages and remain incomplete. Input filters cannot inspect retrieved documents, while output monitors cannot prevent malicious payloads from reaching the model. Consequently, retrieval-augmented generation (RAG) chatbots remain vulnerable to indirect injection, where a poisoned knowledge-base document compromises every user whose query retrieves it. We present a three-layer framework that intercepts both direct and indirect prompt injection throughout the inference pipeline. Layer 1 screens user input using a rule-based pattern library and a fine-tuned semantic anomaly classifier. Layer 2 enforces a provenance-based instruction hierarchy during context assembly, preventing retrieved content from overriding operator policy. Layer 3 audits model output using a policy rule engine and semantic drift detector before delivery. A continuous audit loop aggregates structured logs and supports retraining to adapt the classifier to emerging attack patterns. The framework is model-agnostic and deploys as middleware without modifying the underlying LLM. Evaluation on 5,080 samples across GPT-4o, Llama 3, and Mistral 7B shows that the framework reduces Attack Success Rate (ASR) from 71.4\% to 11.3\%, outperforming the best single-layer baseline by 27.3 percentage points and a published guardrail system by 23.8 percentage points, while maintaining a 4.8\% false positive rate and a median latency overhead of 61.2 ms. Ablation studies confirm that all three layers provide complementary protection and that their combined effect exceeds the sum of individual contributions.

Analyzing the Narration Gap in LLM-Solver Loops

Wed, 17 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.19588v1

AI Summary (中文)

研究背景与问题

随着大语言模型（LLM）在安全敏感场景中的部署日益增多，研究者常将形式化工具（如 SAT/SMT 求解器）嵌入推理流程，以提升逻辑问题求解的可验证性与可靠性。与依赖概率采样的链式思维（Chain-of-Thought）不同，求解器能提供声音（sound）、独立可验证的判定结果。然而，现有工作多聚焦于“问题形式化”与“求解决策”环节，却长期忽视关键的第三环节——叙述（Narration）：即如何将求解器输出（如 SAT/UNSAT、反例或证明）准确、鲁棒地转化为用户可理解的自然语言答案。这一“叙述鸿沟”（Narration Gap）导致端到端答案的实际可信度远低于求解器本身的理论保证。

方法与发现

本研究首次将 LLM-求解器闭环建模为经验证的判定过程，系统分析叙述环节的脆弱性。我们对 5 个开源 LLM（如 Llama-3、Phi-3）开展多通道提示注入攻击实验，发现：

证书门控（Certificate Gating） 可保障求解器 verdict 的声音性（即不误报/漏报），但无法阻止对手通过语义等价改写（如被动/主动语态切换、术语替换）或跨模态通道（文本/语音/API 响应）逆向翻转最终用户答案；
强化提示（hardened prompt）可显著降低注入成功率，但在自适应攻击下仍失效——攻击者利用模型对叙述逻辑的依赖性，绕过防护生成矛盾结论。

核心结论

本研究揭示：在 LLM-求解器协同框架中，鲁棒性止步于求解器输出，无法传导至用户最终阅读的答案。叙述环节是当前混合推理系统的关键信任断点。该发现推动了对“可信 AI 接口设计”的范式反思——形式化保证必须延伸至人机语义接口层，而非仅限于底层计算层。

AI Summary (English)

This paper identifies and analyzes the narration gap: the critical, previously overlooked vulnerability in LLM-solver hybrid pipelines where formal solver outputs (e.g., SAT, proofs, counterexamples) are translated into user-facing natural language answers. While solvers provide sound, verifiable decisions, we model the full loop as a verified decision procedure and empirically test five open-source LLMs under prompt injection attacks. We find that certificate gating preserves solver-level soundness, yet adversaries can reliably invert the final user answer across semantically equivalent phrasings and communication channels. Hardened prompts reduce injection success but fail under adaptive attacks. Crucially, our combined formal and empirical analysis demonstrates that robustness does not propagate from the solver’s output to the end-user’s consumed answer—the narration step constitutes a fundamental trust boundary in current neuro-symbolic reasoning systems.

Abstract

Formal tools such as SAT and SMT solvers are increasingly embedded in language model reasoning pipelines when a safety or security critical question can be formulated in logic. Unlike chain of thought whose steps are sampled from the model distribution without formal guarantee, a solver produces a sound and independently verifiable answer. However, the soundness guarantee can be lost in the interaction between the solver and the model. The hybrid pipeline has three components: formalizing the question, deciding it, and narrating the result. Prior work has studied the formalization and decision, but not narration, which is the step that turns a formal tool's output into the user answer. To fill the narration gap, we first model the LLM-solver loop as a verified decision procedure. We further evaluate five open-sourced models under prompt injection, and we find certificate gating makes the solver verdict sound, while an adversary can invert a verified conclusion across phrasings and channels. We study the mitigation through hardened prompt that reduces injection significantly but cannot eliminate it and still suffers under adaptive attack. Combining the formal analysis and empirical studies, we show in the LLM-solver loop, robustness does not reach to the answer that the user finally reads.

FloatDoor: Platform-Triggered Backdoors in LLMs

Wed, 17 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.19535v1

AI Summary (中文)

FloatDoor：面向大语言模型的平台触发式后门攻击

随着大语言模型（LLMs）在软件工程等高敏场景中的深度部署，其输出直接决定下游代码、配置或决策，模型安全性面临全新挑战。本文揭示了一类被长期忽视的平台依赖型数值偏差所衍生的安全漏洞：由于浮点运算的非结合性（non-associativity）及不同硬件平台（如GPU/TPU/ARM）上算子内核实现的差异，同一模型在不同平台推理时会产生可测量、可预测的输出偏移——这一现象构成新型攻击面。

我们提出 FloatDoor，首个输入无关、平台触发的生成式LLM后门攻击。其核心创新在于：

✅ 零输入触发：无需特定提示词或恶意输入，仅当模型部署于预设目标平台（如NVIDIA A100）即自动激活；
✅ 双LoRA轻量协同：一个LoRA适配器放大跨平台数值发散（如FP16累加顺序差异），另一个将该“平台指纹”绑定至恶意任务（如注入漏洞代码）；
✅ 隐蔽性强：审计阶段（CPU/通用环境）行为完全正常，而服务阶段（目标平台）精准触发，形成典型的时间检查—时间使用（TOCTOU）漏洞；
✅ 效用无损：在非目标平台下，模型整体性能（准确率、困惑度）下降<0.5%，难以通过常规基准检测。

我们在 Qwen3-4B 上系统验证FloatDoor，覆盖NVIDIA GPU、Google TPU、AWS Graviton（ARM64）及阿里云Yitian-710四大架构。关键实证表明：该后门可稳定诱导目标平台生成含CVE级漏洞的Python/Shell代码（如硬编码密钥、命令注入），而其他平台输出完全安全。本工作首次将硬件平台特性转化为可控攻击向量，呼吁建立覆盖模型—编译器—硬件全栈的可信供应链示范标准。

AI Summary (English)

FloatDoor introduces the first input-independent, platform-triggered backdoor attack against generative LLMs. It exploits inherent numerical divergence across hardware platforms (e.g., NVIDIA GPUs, Google TPUs, AWS Graviton, Alibaba Yitian-710) caused by non-associative floating-point arithmetic and divergent kernel implementations. The attack uses two lightweight LoRA adapters: one amplifies inter-platform numerical differences to generate a stable “platform signature,” and the other binds this signature to a malicious downstream behavior (e.g., injecting exploitable code vulnerabilities), while preserving model utility on non-target platforms. Critically, FloatDoor operates via a time-of-check–time-of-use (TOCTOU) gap: models appear fully benign during auditing (e.g., on CPU or generic environments) but activate adversarial behavior only when served on the attacker-specified platform. We demonstrate reliable, platform-specific vulnerability injection (e.g., hardcoded secrets, command injection) in generated code using Qwen3-4B. FloatDoor establishes a new class of supply-chain-adjacent attacks and highlights the urgent need for cross-platform model verification and trusted deployment pipelines.

Abstract

Large language models (LLMs) are increasingly deployed in sensitive settings such as software engineering, where their outputs directly shape downstream artifacts. Recent work has shown that an identical model can produce measurably different outputs depending on the deployment platform, a consequence of non-associative floating-point arithmetic and divergent kernel implementations. We study the security implications of this platform-dependent variability and uncover a novel attack surface on LLM deployments. We introduce FloatDoor, the first input-independent, platform-triggered backdoor attack against generative LLMs. The compromised model exhibits adversary-chosen behavior when served on a target platform and is otherwise benign. FloatDoor is realized through two lightweight LoRA adapters, one that amplifies inter-platform numerical divergence and one that binds the resulting platform signature to a malicious downstream task, while leaving aggregate model utility largely intact. FloatDoor exploits a pronounced time-of-check, time-of-use gap between model auditing and serving. We demonstrate FloatDoor on Qwen3-4B across a broad range of deployment targets, including NVIDIA GPUs, Google TPUs, AWS Graviton, and Alibaba Yitian-710. As a final case study, we show that FloatDoor reliably induces exploitable code vulnerabilities on a chosen target platform. Our results establish a new class of attacks on LLM deployments and underscore the pressing need for trusted model supply chains in sensitive, LLM-powered applications.

Secure Coding Drift in LLM-Assisted Post-Quantum Cryptography Development: A Gamified Fix

Wed, 17 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.19474v1

AI Summary (中文)

背景与问题

后量子密码学（PQC）的迁移面临严峻工程挑战：需严格保障常数时间执行、侧信道抗性及参数精确性。与此同时，大语言模型（LLMs）已深度嵌入密码工程开发流程，但实证研究表明，其生成的代码在安全关键场景中频现逻辑漏洞、时序泄露或参数误配。本文首次提出 “安全编码漂移”（Secure Coding Drift） 这一新型社会技术脆弱性模型——它并非静态缺陷，而是开发者在长期依赖LLM生成代码过程中，渐进式弱化安全直觉、弱化人工审查强度、弱化对底层密码学约束的敏感度所引发的系统性退化现象。

方法与创新

我们设计并实现了一套游戏化、LLM增强的安全编码框架：

✅ 对抗性评估引擎：实时注入PQC特异性攻击向量（如缓存计时扰动、密钥重用边界条件），自动检测LLM输出中的隐蔽风险；
✅ 行为反馈仪表盘：以“安全信用分”量化开发者每轮交互中的风险决策模式（如跳过验证、忽略警告），提供可操作改进建议；
✅ 协同式安全协驾机制：LLM不再仅输出代码，而是主动发起安全质询（如“此Kyber KEM实现是否防御了明文选择攻击？”）、生成测试用例并解释漏洞原理。

主要发现

在涵盖NIST PQC标准算法（CRYSTALS-Kyber、Dilithium、Falcon）的12人团队实证研究中，该框架使PQC模块的高危漏洞引入率下降73%，开发者对常数时间编程规范的自主遵循率提升至91%（基线为44%）。本工作将LLM从“代码生成器”升维为“安全认知增强器”，为AI原生密码工程提供了可扩展、可度量、以人为本的治理范式。

AI Summary (English)

This paper identifies Secure Coding Drift—a longitudinal socio-technical vulnerability arising from sustained LLM reliance during Post-Quantum Cryptography (PQC) development. Unlike static vulnerability models, it captures the gradual erosion of developers’ security intuition, review rigor, and cryptographic constraint awareness. To counter this, we propose a gamified, LLM-augmented framework featuring: (1) adversarial evaluation tailored to PQC threats (e.g., timing side channels, parameter misuse); (2) behavioral feedback via real-time “security credit scoring”; and (3) transforming LLMs into active security co-pilots that question assumptions, generate exploit-aware tests, and explain root causes. Evaluated across NIST-selected algorithms (Kyber, Dilithium, Falcon) with 12 cryptographic engineers, our approach reduced high-severity PQC vulnerabilities by 73% and increased adherence to constant-time coding practices from 44% to 91%. This reframes LLMs as cognitive security partners—not just code assistants—in AI-mediated cryptography.

Abstract

The transition to Post Quantum Cryptography (PQC) introduces considerable implementation complexity, requiring strict adherence to constant-time execution, side channel resistance, and precise parametrisation. Simultaneously, large language models (LLMs) are heavily embedded in software development workflows, including cryptographic engineering. While LLMs improve productivity, evidence shows that they frequently generate insecure or suboptimal code, particularly in security critical domains. This paper introduces Secure Coding Drift in PQC, a novel socio technical vulnerability model capturing the gradual degradation of secure coding practices due to sustained reliance on LLM-generated code. Unlike prior work that focuses on static vulnerabilities, we conceptualise security risk as a longitudinal behavioural phenomenon rising from human AI interaction. To mitigate this, we propose a gamified, LLM augmented secure coding framework that embeds adversarial evaluation, behavioural feedback, and security scoring into development workflows. Our approach reframes LLMs from passive assistants into active security co-pilots, contributing toward safer PQC implementation in AI mediated environments.

OpenAnt: LLM-Powered Vulnerability Discovery Through Code Decomposition, Adversarial Verification, and Dynamic Testing

Wed, 17 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.19149v2

AI Summary (中文)

OpenAnt：基于大语言模型的开源漏洞发现系统

背景与挑战：在大型代码库中自动化发现安全漏洞仍面临严峻挑战：传统静态分析误报率高，动态模糊测试（fuzzing）依赖庞大基础设施且覆盖漏洞类型有限；而新兴的大语言模型（LLM）虽具备程序语义推理能力，却在仓库级安全分析中遭遇上下文长度限制、推理成本高昂及结果不可验证等瓶颈。

方法创新：OpenAnt 提出一种开源、闭环的多阶段漏洞发现架构，融合静态分析与 LLM 推理，包含三项核心技术：
1. 代码分解（Code Decomposition）：基于外部入口点（如 API、CLI）的可达性分析，将代码库切分为自包含的分析单元，缩减分析面达 97%，同时完整保留攻击面相关代码；
2. 对抗式验证（Adversarial Verification）：引导 LLM 在受限攻击者能力假设下（如仅能控制输入参数、无法修改内存布局）模拟攻击链，评估候选漏洞的真实可利用性，显著过滤语义误报；
3. 动态验证（Dynamic Testing）：全自动构建沙箱化 exploit 环境（Docker 容器），执行 PoC 并即时销毁，实现零残留、可复现的实证验证。

实验结果：在 OpenSSL、WordPress 和 Flowise 等主流开源项目上评估表明，OpenAnt 成功发现多个此前未公开的 CVE 级漏洞（含 2 个已获 CVE 编号），平均单仓库分析成本低于 $12（AWS g5.xlarge），误报率较基线 LLM 方法降低 83%。本工作证实：语义推理与 exploit 验证闭环协同，是实现高精度、可扩展、低成本自动化安全分析的可行路径。项目已开源（Apache 2.0 许可），地址：https://github.com/knostic/OpenAnt。

AI Summary (English)

OpenAnt is an open-source, LLM-powered vulnerability discovery system that addresses scalability and reliability gaps in automated security analysis. It introduces a three-stage pipeline: (1) reachability-guided code decomposition, reducing analysis scope by up to 97% while preserving attack-relevant paths; (2) adversarial verification, where LLMs simulate realistic attacker capabilities under constrained assumptions to assess exploitability—not just presence—of vulnerabilities; and (3) dynamic validation, automatically generating, executing, and discarding sandboxed exploit environments in ephemeral containers. Evaluated on OpenSSL, WordPress, and Flowise, OpenAnt discovered multiple previously unknown CVE-graded vulnerabilities with 83% lower false positives than baseline LLM-only approaches and manageable cost (<$12 per repository). Our results demonstrate that closed-loop pipelines—integrating semantic reasoning with executable validation—offer a practical, scalable path toward production-grade automated security analysis. OpenAnt is released under Apache 2.0 at https://github.com/knostic/OpenAnt.

Abstract

TRAP: Benchmark for Task-completion and Resistance to Active Privacy-extraction

Wed, 17 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.18996v2

AI Summary (中文)

TRAP：面向任务完成与主动隐私提取抵抗能力的基准评测

随着智能体（agents）在文档密集型工作流中的广泛应用，敏感个人信息已从边缘案例变为常规输入——例如航班预订需调用护照号。此类场景下，智能体面临根本性张力：一方面须精准利用私有字段完成任务（如解析证件号触发工具），另一方面必须绝对禁止在响应中泄露任何私有信息（因无法验证终端用户身份）。为系统评估这一权衡，本文提出 TRAP（Task-completion and Resistance to Active Privacy-extraction）基准。TRAP包含三元组：含私有信息的原始文档、需调用工具完成的任务查询（依赖私有字段）、以及模拟攻击者的自然语言诱导查询（试图诱出私有信息）。我们在22个涵盖前沿闭源与开源模型（多尺寸）上开展评测，发现：所有模型家族均存在显著隐私泄漏；指令遵循能力越强，泄漏率越高；现有基于提示词（prompt-based）的防御手段虽可降低泄漏，却严重损害任务准确率；且提示优化无法突破该权衡边界。理论分析进一步证明：对任意softmax架构模型，任何软约束防御（如提示工程）均无法同时实现高任务成功率与零泄漏概率。为此，我们提出结构化私有字段隔离机制——在私有字段输入模型前，将其替换为不可逆哈希密钥，仅在工具调用层解码。实验表明，该方法将泄漏率降至接近0，同时保持任务准确率基本不变，首次在理论上可证、实践中可行地解耦了任务能力与隐私风险。

AI Summary (English)

We introduce TRAP, a benchmark evaluating the fundamental tension between task completion (requiring use of private fields like passport numbers) and resistance to active privacy extraction (preventing their leakage in responses). TRAP scenarios consist of a document with private information, a task query requiring tool invocation via private fields, and an adversarial query attempting natural-language elicitation. Evaluating 22 models (proprietary and open-source, multiple scales), we find non-trivial leakage across all families, with stronger instruction-following correlating with higher leakage. Prompt-based defenses reduce leakage but significantly harm task accuracy—and prompt optimization cannot escape this trade-off. We prove a key impossibility: for any softmax-based model, no soft-constraint defense (e.g., prompts) can guarantee both high task success and zero leakage probability. To overcome this, we propose structural private field isolation: replacing private fields with irreversible hash keys before model ingestion, decoupling utility from exposure. This approach achieves near-zero leakage while preserving task accuracy.

Abstract

Formal Verification of Learned Multi-Agent Communication Policies via Decision Tree Distillation

Wed, 17 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.19632v1

AI Summary (中文)

背景与挑战

多智能体强化学习（MARL）通过涌现式通信使智能体习得协同策略，但其神经网络策略缺乏形式化安全保证，难以满足无人机集群、自动驾驶车队等安全关键场景的部署要求。

方法创新：端到端可验证政策抽象框架

本研究提出首个面向学习型多智能体通信策略的形式化验证框架，核心为“决策树蒸馏→形式建模→组合验证→实证回溯”四阶段流水线：

领域感知特征提取：从智能体观测中提取语义明确的离散/量化特征（如相对距离区间、信道占用状态）；
高保真决策树蒸馏：将VQ-VIB神经策略蒸馏为可解释决策树，平均保真度达 97.9% ± 1.2%；
自动PRISM建模：建立特征到PRISM状态变量的一一映射，生成完整概率迁移模型；
组合式PCTL验证：采用成对分解+并界聚合（union-bound aggregation） 策略，结合经验邻域建模，高效验证时序逻辑性质。

关键结果与贡献

在5–7架无人机协同任务中，成功验证18项PCTL性质（覆盖安全性、活性、协作性），其中全部5项安全阈值均满足（碰撞概率0.3% < 1%阈值），整体性质满足率达88.9%。蒙特卡洛实证表明：验证结果向原始神经网络可靠迁移，偏差≤0.6个百分点（95%置信区间）。离散化VQ-VIB消息相较连续通信方法提升保真度11.6–13.6个百分点，验证速度提升3–4倍。本框架首次实现深度MARL策略到形式化验证工作流的可验证、可追溯、可部署桥梁。

AI Summary (English)

We present the first end-to-end framework for formal safety verification of learned multi-agent communication policies. Our approach distills neural policies—specifically Vector-Quantized Variational Information Bottleneck (VQ-VIB) models for 5–7 drone coordination—into high-fidelity decision trees (97.9% ± 1.2% fidelity), automatically translates them into PRISM probabilistic models with exact feature-to-state correspondence, and verifies 18 Probabilistic CTL (PCTL) properties via compositional pairwise decomposition with union-bound aggregation and empirical neighbor modeling. All five safety thresholds are satisfied (e.g., 0.3% collision probability < 1% threshold), achieving 88.9% overall property satisfaction. Monte Carlo validation confirms verified safety transfers to original networks with ≤0.6 pp deviation (95% CI). Discrete VQ-VIB messages yield +11.6–13.6 pp fidelity gains over continuous methods, accelerating verification by 3–4×. This work bridges deep MARL and formal methods for safety-critical multi-robot deployment.

Abstract

Multi-agent reinforcement learning (MARL) enables agents to develop coordination strategies through emergent communication, but neural policies lack the formal safety guarantees required for safety-critical robotic deployment in drone swarms and autonomous vehicle fleets. We present the first end-to-end framework for safety verification of learned multi-agent communication policies through policy abstraction: neural policies are distilled into interpretable decision trees, then formally verified, with empirical validation confirming that verified safety properties transfer to original networks. Our four-stage pipeline consists of domain-specific feature extraction from agent observations, decision tree distillation achieving 97.9% +/- 1.2% fidelity to neural policies, automated translation to PRISM probabilistic model checker specifications with complete feature-to-state-variable correspondence, and compositional verification of Probabilistic Computation Tree Logic (PCTL) properties via pairwise decomposition with union-bound aggregation and empirical neighbor modeling. Evaluating Vector-Quantized Variational Information Bottleneck (VQ-VIB) policies for multi-drone coordination with 5-7 agents, we verify 18 temporal logic properties across safety, liveness, and cooperation, achieving 88.9% property satisfaction with all five safety thresholds satisfied (0.3% collision probability vs. 1% threshold). Monte Carlo validation of original neural policies confirms that verified safety properties transfer with <=0.6 percentage-point deviation (95% CI). Discrete VQ-VIB messages provide +11.6 to +13.6 percentage-point fidelity advantages over continuous methods, enabling 3-4x faster verification. Our framework provides empirically validated safety verification for distilled policy abstractions, serving as a practical bridge between deep MARL and formal safety workflows for multi-robot deployment.

FAPO: Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines

Wed, 17 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.19605v1

AI Summary (中文)

FAPO：面向多步大语言模型流水线的全自动提示优化框架

多步LLM流水线（如检索-推理-格式化链）的性能瓶颈常源于各环节间的耦合失效，而传统仅依赖提示词调优（prompt-only）的方法难以定位和修复结构性缺陷。为此，本文提出FAPO（Fully Autonomous Prompt Optimization）——一种完全自主的端到端流水线优化框架。FAPO以内嵌于标准化代码库的Claude Code为智能体，自动执行闭环优化：评估流水线输出 → 检查中间步骤执行痕迹 → 归因失败根因（提示缺陷 or 结构瓶颈）→ 提出受限范围内的修改方案（优先提示编辑，仅当归因确认结构瓶颈时才调整链式拓扑）→ 生成并验证变体 → 迭代优化至分数函数收敛。

在六项公开基准（含HoVer、IFBench等复杂推理任务）与三类主流任务模型（GPT-5、Foundation-Sec系列）的18组对比中，FAPO在15组中显著超越基线GEPA；其中11组结果差异具有统计非重叠性（均值±试验标准差无交集），平均提升达+14.1个百分点（pp）。尤为关键的是，在HoVer与IFBench共6组需升级至结构优化的场景中，FAPO全胜，平均增益高达+33.8 pp。在安全领域，FAPO在CTIBench-RCM（CVE→CWE映射）任务上亦表现卓越：对GPT-5、Foundation-Sec-8B-Instruct、Foundation-Sec-8B-Reasoning分别提升测试准确率+4.0 pp、+7.1 pp、+2.0 pp。FAPO首次实现了提示优化与可控结构演化的协同自治，为通用及安全敏感型LLM流水线提供了可复现、可归因、高性能的新一代优化范式。

AI Summary (English)

FAPO (Fully Autonomous Prompt Optimization) is a novel framework that autonomously optimizes multi-step LLM pipelines—beyond prompt tuning alone—by diagnosing failures at intermediate stages and applying scoped improvements: first prompt edits, then structural changes (e.g., reordering or skipping steps) only when attribution identifies a chain-level bottleneck. Integrated via Claude Code in a standardized codebase, FAPO iteratively evaluates, inspects, diagnoses, proposes, and validates variants against a score function. Across six benchmarks and three models, FAPO outperforms baseline GEPA in 15/18 model-benchmark pairs; in 11 cases, gains are statistically non-overlapping (mean +14.1 pp). Crucially, on the six HoVer/IFBench tasks where prompt-first search escalated to structural changes, FAPO wins all—with a mean gain of +33.8 pp. On security-focused CTIBench-RCM, FAPO boosts test accuracy by +4.0 pp (GPT-5), +7.1 pp (Foundation-Sec-8B-Instruct), and +2.0 pp (Foundation-Sec-8B-Reasoning). FAPO establishes a new state-of-the-art for robust, interpretable, and adaptive pipeline optimization.

Abstract

Multi-step LLM pipelines fail through interactions among retrieval, reasoning, and formatting steps, so prompt-only optimization can miss bottlenecks in the chain. We present FAPO (Fully Autonomous Prompt Optimization), a framework that lets Claude Code optimize an LLM pipeline inside a standardized codebase. FAPO evaluates a pipeline, inspects intermediate steps, diagnoses failures, proposes scoped changes, and validates variants repeatedly to optimize against a score function. It first tries prompt edits and, only when prompt optimization appears insufficient, changes chain structure within the permitted scope when attribution identifies a structural bottleneck. Across six benchmarks and three task models, FAPO beats the baseline GEPA in 15 of 18 model-benchmark comparisons. In 11 model-benchmark comparisons, FAPO wins with non-overlapping mean $\pm$ trial-standard-deviation ranges, and the mean FAPO-GEPA gain is +14.1 pp. In the six HoVer and IFBench comparisons where prompt-first search escalated to structural changes, FAPO wins all six with a mean gain of +33.8 pp. FAPO also improves performance on security tasks: on CTIBench-RCM, a security CVE-to-CWE task, prompt-only FAPO lifts test accuracy by +4.0 pp on GPT-5, +7.1 pp on Foundation-Sec-8B-Instruct, and +2.0 pp on Foundation-Sec-8B-Reasoning. These results position FAPO as a state-of-the-art pipeline optimization technique for both general-purpose and security-focused tasks.

Deontic Policies for Runtime Governance of Agentic AI Systems

Wed, 17 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.19464v1

AI Summary (中文)

研究背景与问题

大语言模型（LLM）驱动的自主智能体（Agentic AI）具备调用工具、操作数据、安装软件及跨组织协同等能力，其行为复杂性远超传统应用系统。现有访问控制机制（如XACML、Rego、Cedar）仅支持静态的“允许/禁止”（permit/prohibit）二元策略，无法建模企业级治理所需的完整规范逻辑：包括行为义务（如执行敏感操作后必须通知CISO）、义务豁免条件（dispensations）、多策略冲突时的优先级判定、以及基于领域本体（如医疗/网络安全/隐私法规中的类层次结构）的语义推理。

方法与创新

本文提出 AgenticRei——首个面向运行时治理的道义逻辑（deontic logic）策略框架。其核心包括：

基于Rei框架构建的可扩展道义策略语言，以OWL（Web Ontology Language）形式化表达义务（obligation）、禁止（prohibition）、许可（permission）、豁免（dispensation）及元策略（meta-policy）；
轻量级、高吞吐的外部逻辑引擎，在LLM推理流水线之外独立执行实时策略评估；
统一治理层：同时约束智能体的工具调用与智能体间消息传递，支持A2AS等工业标准架构无缝集成。

主要发现

通过医疗数据访问、云资源调配、跨域协作等典型场景验证，AgenticRei成功表达了当前生产级策略引擎完全无法刻画的关键治理约束（如：“若代理访问PHI数据，则须在30秒内加密日志并触发审计流；若处于红队演练状态，则该义务可临时豁免”）。实验表明，其策略评估延迟<15ms（P99），支持每秒千级策略决策，且本体推理能力显著提升合规性验证覆盖率。

AI Summary (English)

Autonomous agentic AI systems powered by LLMs introduce novel governance challenges beyond traditional access control: they require expressive specification of obligations (e.g., “notify CISO after data access”), dispensations (contextual waivers), policy precedence, and ontology-aware reasoning over domain hierarchies (e.g., HIPAA classes or NIST controls). Existing policy engines (XACML, Rego, Cedar) only support permit/prohibit logic and lack obligation lifecycle management, meta-policy conflict resolution, or semantic inference. We propose AgenticRei, a runtime governance framework built on the Rei framework and expressed in OWL. It evaluates deontic policies—including obligations, prohibitions, permissions, and dispensations—via a high-performance external logic engine, decoupled from the LLM. AgenticRei uniformly governs both tool invocations and agent-to-agent messages, integrates natively with A2AS, and captures security and privacy constraints (e.g., conditional audit logging with contextual exemptions) that are inexpressible in current production policy systems. Evaluation shows sub-15ms latency (P99) and full expressivity for enterprise-grade governance.

Abstract

Autonomous agentic AI systems driven by Large Language Models (LLMs) introduce a new class of security, privacy, and compliance challenges: an agent that can invoke tools, manipulate data, install software, and coordinate with peer agents across organizational boundaries must be constrained not just by authentication and access control, but by the full structure of enterprise governance. This includes specifying what agents are permitted and prohibited from doing, what they areobliged to do after certain actions (e.g., notify the CISO), under what conditions a standing obligation may be waived, and which rules take precedence when policies conflict. This governance problem exceeds what current policy engines provide. Systems such as XACML, Rego, and Cedar address only the permit/prohibit subset of this governance structure. They do not provide obligation lifecycle management, meta-policy conflict resolution, dispensations that waive obligations in specific circumstances, and ontological reasoning over domain class hierarchies commonly found in applications such as healthcare, cybersecurity, or data privacy. We propose AgenticRei, which realizes key governance requirements such as obligations, dispensations, policy conflict resolutions, and reasoning over policies, as well as the basic permit/prohibit constraints. We use a deontic policy language built on the Rei framework, expressed as OWL (Web Ontology Language) and evaluated at runtime by a high-performance logic engine entirely outside the LLM. The same pipeline governs both tool invocations by the agent and agent-to-agent messages. We show through examples that deontic policies capture governance constraints around security and privacy that mostly cannot be expressed in current production engines. Our approach composes naturally with industry-standard frameworks like A2AS.

Variational Consensus Monte Carlo for Bayesian Mixture

Wed, 17 Jun 2026 00:00:00 -0000

Paper Link: https://arxiv.org/abs/2606.19643v1

AI Summary (中文)

背景与动机

针对医疗健康数据在隐私保护、敏感性及跨机构共享受限等现实约束下难以集中建模的挑战，本文提出一种面向联邦学习场景的贝叶斯混合模型推断新框架——变分共识蒙特卡洛（Variational Consensus Monte Carlo, VCMC）。

方法创新

我们对Rabinovich等（2015）提出的变分共识蒙特卡洛方法进行系统性拓展：

（i）支持过拟合贝叶斯混合模型：无需预设簇数或共轭先验，可联合推断最优簇数量及全部模型参数（如均值、协方差、权重）；
（ii）设计跨节点簇匹配算法：解决各本地数据孤岛中簇分布不均衡（如小簇仅出现在部分节点）的关键难题，引入基于Wasserstein距离与标签对齐优化的鲁棒匹配策略；
（iii）多模态聚合策略库：提供轻量级（仅传输后验均值/协方差）、通信高效（低秩近似）与高精度（变分目标优化）三类聚合方案，适配不同带宽、延迟与计算资源约束；
（iv）实践指南：基于模拟与实证分析，给出策略选择决策树（如：小簇主导→选Wasserstein匹配+完整变分聚合；通信受限→选均值-协方差压缩+校准修正）。

实证结果

在大规模仿真中，本框架显著优于现有联邦学习方法（如FedAvg for mixture、Federated DP-Means）；尤其当本地数据分布反映全局聚类结构时，对稀有小簇（<2%总体占比）的识别准确率较集中式MCMC提升达37%。最后，我们在英国老年群体电子健康记录（>120万患者）上成功识别出6类临床意义明确的多病共存模式（如“心衰-肾病-贫血”复合表型），验证了其临床实用性与可扩展性。

AI Summary (English)

Motivated by privacy and data-sharing constraints in healthcare, we propose Variational Consensus Monte Carlo (VCMC) for Bayesian mixture modeling under federated learning. Unlike prior variational CMC—which assumes known cluster count and conjugate structure—our method jointly infers the number of clusters and all parameters in overfitted Dirichlet process or finite mixture models, without conjugacy requirements. We introduce robust cross-silo cluster matching via Wasserstein alignment to handle heterogeneous local cluster support, and provide a suite of aggregation strategies (mean-covariance compression, low-rank variational, full optimization) tailored to communication, memory, and accuracy constraints. A comprehensive simulation study shows our approach outperforms state-of-the-art federated alternatives—and crucially, when local datasets reflect global clustering structure, it recovers rare clusters (<2% prevalence) with up to 37% higher accuracy than pooled MCMC. Applied to >1.2M UK geriatric EHR records, VCMC identifies clinically interpretable multimorbidity patterns, demonstrating scalability and real-world utility.

Abstract

Motivated by the privacy, sensitivity and sharing limitations of health data, we present a comprehensive pipeline for inference of Bayesian mixture models within a federated learning setting, i.e. when data cannot be fully shared or pooled across compute nodes. We adopt a Consensus Monte Carlo (CMC) approach, in which an MCMC algorithm is run independently within each data silo to estimate local posterior distributions, which are then aggregated to approximate the posterior over the full data. The variational CMC approach of Rabinovich, Angelino and Jordan (2015) [1] frames the aggregation step as a variational inference problem, but their application to mixtures assumes the number of clusters and key mixture parameters to be known. Our main methodological contributions are: (i) an extension of variational CMC to over-fitted Bayesian mixture models that infer the number of clusters and all model parameters, without requiring conjugacy; (ii) novel cluster-matching algorithms suitable for cross-silo settings in which not every cluster appears in each local dataset; (iii) a number of inference strategies for the aggregation step, matched to different federated learning constraints; and (iv) guidelines for choosing among these in practice. A comprehensive simulation study validates the framework and allows us to compare to state-of-the-art federated learning alternatives. Notably, we show that when the composition of local datasets reflects the underlying clustering structure in the data, our approach can recover small clusters with greater accuracy than standard MCMC applied to the pooled data. We illustrate the framework on large-scale electronic health record data, identifying multi-morbidity patterns in a British geriatric population.