✨From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images
📝 Summary:
MLLMs struggle with human cognitive perception of images like memorability or aesthetics. CogIP-Bench evaluates this gap, showing post-training significantly improves alignment. This enhances human-like perception and improves creative AI tasks.
🔹 Publication Date: Published on Nov 27
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.22805
• PDF: https://arxiv.org/pdf/2511.22805
• Project Page: https://follen-cry.github.io/MLLM-Cognition-project-page/
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#MLLM #CognitiveAI #ImagePerception #AIAlignment #AIResearch
📝 Summary:
MLLMs struggle with human cognitive perception of images like memorability or aesthetics. CogIP-Bench evaluates this gap, showing post-training significantly improves alignment. This enhances human-like perception and improves creative AI tasks.
🔹 Publication Date: Published on Nov 27
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.22805
• PDF: https://arxiv.org/pdf/2511.22805
• Project Page: https://follen-cry.github.io/MLLM-Cognition-project-page/
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#MLLM #CognitiveAI #ImagePerception #AIAlignment #AIResearch
✨Steerability of Instrumental-Convergence Tendencies in LLMs
📝 Summary:
This research investigates AI system steerability, noting a safety-security dilemma. It demonstrates that a short anti-instrumental prompt suffix dramatically reduces unwanted instrumental behaviors, like self-replication, in large language models. For Qwen3-30B, this reduced the convergence rate...
🔹 Publication Date: Published on Jan 4
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2601.01584
• PDF: https://arxiv.org/pdf/2601.01584
• Github: https://github.com/j-hoscilowicz/instrumental_steering/
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#AISafety #LLMs #AISteering #PromptEngineering #AIAlignment
📝 Summary:
This research investigates AI system steerability, noting a safety-security dilemma. It demonstrates that a short anti-instrumental prompt suffix dramatically reduces unwanted instrumental behaviors, like self-replication, in large language models. For Qwen3-30B, this reduced the convergence rate...
🔹 Publication Date: Published on Jan 4
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2601.01584
• PDF: https://arxiv.org/pdf/2601.01584
• Github: https://github.com/j-hoscilowicz/instrumental_steering/
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#AISafety #LLMs #AISteering #PromptEngineering #AIAlignment
✨Beyond Binary Preference: Aligning Diffusion Models to Fine-grained Criteria by Decoupling Attributes
📝 Summary:
Current diffusion model alignment struggles with complex, fine-grained human expertise due to simplified preferences. This paper proposes a framework with hierarchical criteria and Complex Preference Optimization CPO, maximizing positive and minimizing negative attributes to improve generation qu...
🔹 Publication Date: Published on Jan 7
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2601.04300
• PDF: https://arxiv.org/pdf/2601.04300
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#DiffusionModels #AIAlignment #MachineLearning #GenerativeAI #PreferenceLearning
📝 Summary:
Current diffusion model alignment struggles with complex, fine-grained human expertise due to simplified preferences. This paper proposes a framework with hierarchical criteria and Complex Preference Optimization CPO, maximizing positive and minimizing negative attributes to improve generation qu...
🔹 Publication Date: Published on Jan 7
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2601.04300
• PDF: https://arxiv.org/pdf/2601.04300
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#DiffusionModels #AIAlignment #MachineLearning #GenerativeAI #PreferenceLearning
✨Real-Time Aligned Reward Model beyond Semantics
📝 Summary:
RLHF faces reward overoptimization from reward model misalignment. R2M introduces a new framework that uses real-time policy feedback to dynamically adapt the reward model. This improves alignment by responding to continuous policy distribution shifts beyond just semantics.
🔹 Publication Date: Published on Jan 30
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2601.22664
• PDF: https://arxiv.org/pdf/2601.22664
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#ReinforcementLearning #AI #MachineLearning #RewardModels #AIAlignment
📝 Summary:
RLHF faces reward overoptimization from reward model misalignment. R2M introduces a new framework that uses real-time policy feedback to dynamically adapt the reward model. This improves alignment by responding to continuous policy distribution shifts beyond just semantics.
🔹 Publication Date: Published on Jan 30
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2601.22664
• PDF: https://arxiv.org/pdf/2601.22664
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#ReinforcementLearning #AI #MachineLearning #RewardModels #AIAlignment
✨THINKSAFE: Self-Generated Safety Alignment for Reasoning Models
📝 Summary:
ThinkSafe is a self-aligned framework that enhances safety in large reasoning models. It uses lightweight refusal steering and fine-tuning on self-generated responses to preserve reasoning performance and reduce computational costs. ThinkSafe significantly improves safety without degrading native...
🔹 Publication Date: Published on Jan 30
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2601.23143
• PDF: https://arxiv.org/pdf/2601.23143
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#AISafety #LLMs #AIAlignment #MachineLearning #DeepLearning
📝 Summary:
ThinkSafe is a self-aligned framework that enhances safety in large reasoning models. It uses lightweight refusal steering and fine-tuning on self-generated responses to preserve reasoning performance and reduce computational costs. ThinkSafe significantly improves safety without degrading native...
🔹 Publication Date: Published on Jan 30
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2601.23143
• PDF: https://arxiv.org/pdf/2601.23143
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#AISafety #LLMs #AIAlignment #MachineLearning #DeepLearning
✨SLIME: Stabilized Likelihood Implicit Margin Enforcement for Preference Optimization
📝 Summary:
SLIME is a new objective for aligning large language models, addressing 'unlearning' and 'formatting collapse' issues in prior methods. It maximizes preferred response likelihood, stabilizes rejected token probabilities, and uses dual-margin constraints, achieving superior performance and stable ...
🔹 Publication Date: Published on Feb 2
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.02383
• PDF: https://arxiv.org/pdf/2602.02383
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#LLM #AIAlignment #MachineLearning #NLP #DeepLearning
📝 Summary:
SLIME is a new objective for aligning large language models, addressing 'unlearning' and 'formatting collapse' issues in prior methods. It maximizes preferred response likelihood, stabilizes rejected token probabilities, and uses dual-margin constraints, achieving superior performance and stable ...
🔹 Publication Date: Published on Feb 2
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.02383
• PDF: https://arxiv.org/pdf/2602.02383
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#LLM #AIAlignment #MachineLearning #NLP #DeepLearning
✨The Truthfulness Spectrum Hypothesis
📝 Summary:
This paper proposes the truthfulness spectrum hypothesis: LLMs contain truth directions ranging from domain-general to domain-specific. While general directions exist, domain-specific ones steer more effectively, with post-training reshaping this geometry to influence behaviors like sycophancy.
🔹 Publication Date: Published on Feb 23
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.20273
• PDF: https://arxiv.org/pdf/2602.20273
• Github: https://github.com/zfying/truth_spec
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#LLMs #AIResearch #AIAlignment #NLP #Truthfulness
📝 Summary:
This paper proposes the truthfulness spectrum hypothesis: LLMs contain truth directions ranging from domain-general to domain-specific. While general directions exist, domain-specific ones steer more effectively, with post-training reshaping this geometry to influence behaviors like sycophancy.
🔹 Publication Date: Published on Feb 23
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.20273
• PDF: https://arxiv.org/pdf/2602.20273
• Github: https://github.com/zfying/truth_spec
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#LLMs #AIResearch #AIAlignment #NLP #Truthfulness
❤1
✨Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use
📝 Summary:
MOSAIC is a framework aligning agentic models for safe multi-step tool use, employing explicit safety reasoning and refusal. It significantly reduces harmful actions, increases refusal for unsafe tasks, cuts privacy leakage, and preserves benign performance.
🔹 Publication Date: Published on Mar 3
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.03205
• PDF: https://arxiv.org/pdf/2603.03205
• Project Page: https://aradhye2002.github.io/mosaic-agent-safety/
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#AISafety #AIAgents #ResponsibleAI #LLMs #AIAlignment
📝 Summary:
MOSAIC is a framework aligning agentic models for safe multi-step tool use, employing explicit safety reasoning and refusal. It significantly reduces harmful actions, increases refusal for unsafe tasks, cuts privacy leakage, and preserves benign performance.
🔹 Publication Date: Published on Mar 3
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.03205
• PDF: https://arxiv.org/pdf/2603.03205
• Project Page: https://aradhye2002.github.io/mosaic-agent-safety/
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#AISafety #AIAgents #ResponsibleAI #LLMs #AIAlignment
❤1
✨Alignment Makes Language Models Normative, Not Descriptive
📝 Summary:
Aligned language models excel at normative, rule-based behavior prediction but struggle with complex descriptive human strategic interactions. Base models predict real human choices in these games better. This reveals a trade-off in model optimization.
🔹 Publication Date: Published on Mar 17
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.17218
• PDF: https://arxiv.org/pdf/2603.17218
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#LLM #AIAlignment #NormativeAI #GameTheory #AIBehavior
📝 Summary:
Aligned language models excel at normative, rule-based behavior prediction but struggle with complex descriptive human strategic interactions. Base models predict real human choices in these games better. This reveals a trade-off in model optimization.
🔹 Publication Date: Published on Mar 17
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.17218
• PDF: https://arxiv.org/pdf/2603.17218
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#LLM #AIAlignment #NormativeAI #GameTheory #AIBehavior
✨Internal Safety Collapse in Frontier Large Language Models
📝 Summary:
Frontier LLMs suffer Internal Safety Collapse, continuously generating harmful content under specific task conditions, even for benign tasks. A new framework triggers this vulnerability, yielding 95% safety failure rates and revealing inherent unsafe capabilities despite alignment efforts.
🔹 Publication Date: Published on Mar 4
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.23509
• PDF: https://arxiv.org/pdf/2603.23509
• Project Page: https://wuyoscar.github.io/ISC-Bench
• Github: https://github.com/wuyoscar/ISC-Bench
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#AISafety #LLM #AIAlignment #MachineLearning #AIResearch
📝 Summary:
Frontier LLMs suffer Internal Safety Collapse, continuously generating harmful content under specific task conditions, even for benign tasks. A new framework triggers this vulnerability, yielding 95% safety failure rates and revealing inherent unsafe capabilities despite alignment efforts.
🔹 Publication Date: Published on Mar 4
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.23509
• PDF: https://arxiv.org/pdf/2603.23509
• Project Page: https://wuyoscar.github.io/ISC-Bench
• Github: https://github.com/wuyoscar/ISC-Bench
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#AISafety #LLM #AIAlignment #MachineLearning #AIResearch
❤1