Interpreting and Steering LLM Agents for Social Simulations
We compare prompting, sparse autoencoders, and linear probes for interpreting and controlling LLM agent behavior in social science simulations. SAE-based steering outperforms prompting, offering fine-grained, predictable control over preferences and capabilities.
Sybil-Resilient Preference Aggregation for RLHF
We present the first formal framework for defending RLHF preference aggregation against sybil attacks. We prove standard Bradley-Terry is not sybil-safe, propose SQ-BT as a defense, and characterize a tight, fundamental safety-liveness tradeoff.
Probing Circuit Robustness: How Syntactic Form Shapes Neural Circuit Activation in LLMs
We investigate whether neural circuits in LLMs remain stable under semantic-preserving paraphrases. Our central finding: syntactic form, not semantic content, is the primary determinant of circuit activation.
Theorizing with LLMs
This is the first ever paper that I have contributed to, we did this exactly an year back.