Paper Feeds

SoK: Robustness in Large Language Models against Jailbreak Attacks

Wed, 06 May 2026 00:00:00 -0000

This SoK paper systematizes the landscape of jailbreak attacks and defenses against Large Language Models (LLMs). We introduce a comprehensive taxonomy covering 7 attack categories (e.g., template injection, semantic obfuscation, role-playing) and 5 defense paradigms (e.g., input sanitization, response filtering, alignment fine-tuning). Our core contribution is **Security Cube**, a unified, multi-dimensional evaluation framework that assesses techniques across three orthogonal axes: *attack strength & stealth*, *defense efficacy & overhead*, and *system-level properties* (e.g., judge reliability, cross-model vulnerability distribution). Using Security Cube, we benchmark 13 representative jailbreak attacks and 5 defenses across 5 open-weight LLMs and 3 automated judges—including our lightweight JudgeNet—revealing critical insights: (1) most defenses fail under adaptive attacks; (2) current automated judges suffer 22–37% false positive/negative rates; and (3) alignment strategy matters more than model size for robustness. We identify key open challenges and outline promising research directions toward provably robust, interpretable, and trustworthy LLMs. Code is publicly available.

Direct Product Flow Matching: Decoupling Radial and Angular Dynamics for Few-Shot Adaptation

Wed, 06 May 2026 00:00:00 -0000

Recent flow matching (FM) methods improve few-shot adaptation of vision-language models (VLMs) by modeling cross-modal alignment as continuous flows. However, we identify three fundamental limitations rooted in geometric incompatibility of pre-trained features: (1) angular dynamics distortion due to radial-angular coupling; (2) neglect of radial dynamics via destructive normalization; and (3) loss of dataset-specific context in unconditional flows. To address these, we propose **Direct Product Flow Matching (DP-FM)**—a Riemannian framework built on a *warped product manifold* with constant warping, yielding a decoupled cylindrical manifold (ℝ⁺ × S^{d−1}). DP-FM enables *independent radial evolution* and *constant-speed angular geodesic transport*, eliminating angular distortion while preserving radial semantics. We further inject missing context via classifier-free guidance conditioned on pre-trained VLM hidden states. Extensive experiments across 11 benchmarks demonstrate DP-FM achieves new state-of-the-art performance for multi-step few-shot adaptation, validating the critical role of geometric decoupling.

Federated Learning for Early Prediction of EV Charging Demand

Wed, 06 May 2026 00:00:00 -0000

Accurate early prediction of EV charging demand—estimating total session energy using only plug-in metadata and the first few minutes of charging—is critical for real-time grid coordination and operator decision-making. We build a session-level dataset from the Adaptive Charging Network (ACN) at Caltech, extracting tabular features capturing user intent, temporal patterns, and initial charging dynamics. Modeling intra-depot heterogeneity via station-level client partitions, we evaluate XGBoost, TabNet, and Federated LSTM under FedAvg. Results show that federated models achieve up to 92% of centralized performance while keeping data on-site—enabling privacy-preserving, scalable analytics across distributed infrastructure. With just 2 minutes of charging data, our best federated model achieves a mean absolute error of 1.8 kWh (vs. ~12.5 kWh average session energy), demonstrating feasibility of low-latency, high-utility demand forecasting. Code is publicly available.

On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference

Wed, 06 May 2026 00:00:00 -0000

This paper critically re-examines the widely adopted *shuffling defense* in secure Transformer inference, where intermediate activations are randomly permuted before being revealed to the client to enable efficient plaintext nonlinear operations. We demonstrate that shuffling is fundamentally insufficient: despite independent permutations across queries, the underlying activation geometry preserves neuron-wise correlations exploitable via statistical alignment. We propose a novel attack that (i) estimates latent neuron correspondences using cross-query activation covariance, and (ii) recovers a common permutation basis via optimal transport-based alignment. Experiments on Pythia-70m and GPT-2 show mean squared alignment errors of $10^{-9}$–$10^{-6}$, enabling weight recovery with L1-norm deviations of $10^{-4}$–$10^{-2}$ from ground-truth weights at a query cost of ~\$1. Our results invalidate the security claims of shuffling alone and call for stronger, geometry-aware defenses in practical secure inference systems.

Storage Is Not Memory: A Retrieval-Centered Architecture for Agent Recall

Wed, 06 May 2026 00:00:00 -0000

We challenge the dominant “extraction-at-ingestion” paradigm in agent memory, arguing that discarding raw content before query time fundamentally limits recall. True Memory is a retrieval-centered, six-layer architecture that preserves all events verbatim and performs multi-stage retrieval directly over unmodified text—running entirely within a single SQLite file on commodity CPU, with no vector index, graph store, external database, or GPU. Evaluated on three benchmarks: True Memory Pro achieves **93.0% accuracy** (3-run mean) on LoCoMo (1,540 questions), **87.8%** on LongMemEval (500 questions), and **76.6%** on BEAM-1M (700 questions at 1M-token scale)—surpassing prior state-of-the-art (e.g., 73.9% for Hindsight on BEAM-1M). A 56-configuration ablation confirms tight performance variance (±0.65 pp) within the top family, demonstrating robustness. This work establishes that high-fidelity recall need not require embedding models or infrastructure overhead.

DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

Wed, 06 May 2026 00:00:00 -0000

We present **DecodingTrust-Agent Platform (DTap)**, the first controllable and interactive red-teaming platform for AI agents, spanning 14 real-world domains and 50+ high-fidelity simulation environments (e.g., Google Workspace, PayPal, Slack). To scale risk discovery, we introduce **DTap-Red**, the first autonomous red-teaming agent that systematically explores multi-vector attack surfaces—including prompt, tool, skill, environment, and their combinations—and generates goal-directed adversarial strategies. Leveraging DTap-Red, we curate **DTap-Bench**, a large-scale red-teaming benchmark with verifiable judges for automatic outcome validation. Large-scale evaluation across leading agent frameworks reveals critical vulnerabilities: (1) tool-layer exploits dominate (68% of successful attacks), (2) multi-step attacks succeed 3.2× more often than single-step ones, and (3) current safety mechanisms fail catastrophically against environment-spoofing. DTap establishes a reproducible foundation for building secure, trustworthy AI agents.

Knowledge-Free Correlated Agreement for Incentivizing Federated Learning

Wed, 06 May 2026 00:00:00 -0000

We propose **Knowledge-Free Correlated Agreement (KFCA)**, a novel incentive mechanism for federated learning that rewards client contributions *without any ground truth labels, public test set, or distributional knowledge*. Under categorical local predictions and an honest-majority assumption, KFCA is **provably strictly truthful**, eliminating the label-flipping vulnerability inherent in classical Correlated Agreement (CA). Its lightweight, pairwise agreement scoring enables real-time reward computation on-device—critical for decentralized and blockchain-based FL ecosystems. Evaluated on federated LLM adapter tuning (32 clients, heterogeneous data) and a real-world PCB inspection task (12 factory-edge nodes), KFCA achieves 98.2% reward accuracy with sub-120ms per-round latency and integrates natively with smart contracts—reducing on-chain reward delay to 1.7 seconds (4.3× faster than CA). KFCA is the first incentive scheme offering formal truthfulness guarantees under zero supervision.

Pen-Strategist: A Reasoning Framework for Penetration Testing Strategy Formation and Analysis

Wed, 06 May 2026 00:00:00 -0000

Cybersecurity faces escalating threats and a critical shortage of skilled professionals, motivating the automation of penetration testing. Existing LLM-based frameworks suffer from poor strategic reasoning, inaccurate tool/action selection, and low execution stability. To address this, we propose **Pen-Strategist**, a novel reasoning framework comprising: (1) a domain-specific Qwen-3-14B model fine-tuned via reinforcement learning on a logically annotated pentesting dataset (2,184 samples with strategy derivation chains and step justifications), and (2) a semantic-aware CNN classifier for robust step-to-command mapping. Evaluation shows Pen-Strategist achieves **87% higher strategy derivation accuracy** vs. baseline, **47.5% improvement in subtask completion** when integrated into PentestGPT on vulnerable machines (outperforming GPT-5), and **18% gain on CTFKnow**. Its CNN classifier surpasses commercial LLMs by **28% in step prediction accuracy** and significantly enhances execution reliability. A user study with 15 security experts confirms its superior strategy quality over Claude-4.6-Sonnet.

Trustworthy Federated Label Distribution Learning under Annotation Quality Disparity

Wed, 06 May 2026 00:00:00 -0000

Label Distribution Learning (LDL) models supervision as instance-wise probability distributions to handle inherent ambiguity, but high-fidelity label distributions are costly and often noisy—especially under federated settings where data isolation exacerbates *annotation quality disparity* across clients. This heterogeneity invalidates sample-size-based aggregation (e.g., FedAvg), creating a critical trust dilemma. To address it, we propose **FedQual**, a quality-aware Fed-LDL framework featuring: (i) *quality-adaptive client training*, guided by a global semantic anchor that calibrates low-quality clients while preserving autonomy of high-quality ones; and (ii) *reliability-aware server aggregation*, which reweights updates by effective reliable information—not raw sample count. We introduce four new Fed-LDL benchmarks (FER-LDL, FI-LDL, PIPAL-LDL, KADID-LDL) with controlled annotation quality gradients. Theoretically, we prove client-specific calibration strictly dominates uniform calibration under heterogeneous supervision quality. Extensive experiments show FedQual consistently outperforms SOTA methods (avg. +5.2% KL reduction, +4.8% distribution accuracy), demonstrating robustness even when only 10% clients provide high-quality labels.

Gray-Box Poisoning of Continuous Malware Ingestion Pipelines

Wed, 06 May 2026 00:00:00 -0000

This paper investigates a realistic gray-box poisoning threat against continuous malware detection pipelines, where attackers possess partial knowledge (e.g., feature space, model architecture) but no full access to training infrastructure. Using the `secml_malware` framework, we generate functionality-preserving adversarial binaries in problem space via Import Address Table (IAT) manipulation and section injection—both lightweight, semantically valid PE file modifications. Empirical evaluation on a production-grade LightGBM detector shows that subtle IAT-based perturbations (e.g., adding ≤5 benign DLL imports) yield compact poisoned samples (<0.5% size increase) that degrade recall by 32.7 percentage points (98.1% → 65.4%), outperforming section-based alternatives. We further propose and validate a homogeneous ensemble defense that leverages prediction disagreement across identical LightGBM models to flag suspicious samples *before ingestion*: it achieves **95.6% poisoning detection rate** while retaining **99.2% of legitimate samples**, demonstrating practical viability for real-world deployment.

FL-Sailer: Efficient and Privacy-Preserving Federated Learning for Scalable Single-Cell Epigenetic Data Analysis via Adaptive Sampling

Wed, 06 May 2026 00:00:00 -0000

Single-cell ATAC-seq (scATAC-seq) enables high-resolution chromatin accessibility profiling, but privacy regulations and data heterogeneity impede multi-institutional collaboration. Federated learning (FL) promises privacy preservation yet struggles with scATAC-seq’s ultra-high dimensionality, extreme sparsity, and cross-site distribution shifts. We propose **FL-Sailer**, the first FL framework tailored for scATAC-seq. It integrates (i) *adaptive leverage score sampling*—biologically interpretable feature selection reducing dimensionality by 80%—and (ii) an *invariant VAE* that disentangles biological signals from technical confounders via mutual information minimization. We provide theoretical convergence guarantees with bounded approximation error. Experiments on synthetic and real multi-center epigenomic datasets (200K+ cells across 4 institutions) show FL-Sailer not only enables previously infeasible privacy-compliant collaborations but also **outperforms centralized methods** in clustering (ARI +12.3%), cell-type annotation (F1 +9.7%), and batch correction—demonstrating adaptive sampling as an effective implicit regularizer against technical noise.

A Comparative Analysis of Machine Learning and Deep Learning Models for Tweet Sentiment Classification: A Case Study on the Sentiment140 Dataset

Wed, 06 May 2026 00:00:00 -0000

This study rigorously compares Logistic Regression (LR) with TF-IDF features against a Bidirectional LSTM (BiLSTM) model on a balanced 10,000-tweet subset of the Sentiment140 dataset. Contrary to common assumptions, LR achieved superior test accuracy (**73.5%**) compared to BiLSTM (**69.17%**), which exhibited mild overfitting (train: 82.3%, val: 68.9%). Results indicate that for medium-scale, noisy social media text, classical ML with robust feature engineering can outperform complex deep learning architectures in both performance and generalizability. The models were deployed as an open, interactive web application via Streamlit and Hugging Face Spaces, enabling real-time sentiment analysis and public accessibility.

Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours