Research

Interpreting and Steering LLM Agents for Social Simulations

We compare prompting, sparse autoencoders, and linear probes for interpreting and controlling LLM agent behavior in social science simulations. SAE-based steering outperforms prompting, offering fine-grained, predictable control over preferences and capabilities.

Sybil-Resilient Preference Aggregation for RLHF

We present the first formal framework for defending RLHF preference aggregation against sybil attacks. We prove standard Bradley-Terry is not sybil-safe, propose SQ-BT as a defense, and characterize a tight, fundamental safety-liveness tradeoff.

Probing Circuit Robustness: How Syntactic Form Shapes Neural Circuit Activation in LLMs

We investigate whether neural circuits in LLMs remain stable under semantic-preserving paraphrases. Our central finding: syntactic form, not semantic content, is the primary determinant of circuit activation.

Theorizing with LLMs

This is the first ever paper that I have contributed to, we did this exactly an year back.