Hi there 👋

Thanks for dropping by! I’m Arul Murugan, currently a masters student at UC Berkeley.

Interpreting and Steering LLM Agents for Social Simulations

We compare prompting, sparse autoencoders, and linear probes for interpreting and controlling LLM agent behavior in social science simulations. SAE-based steering outperforms prompting, offering fine-grained, predictable control over preferences and capabilities.

Sybil-Resilient Preference Aggregation for RLHF

We present the first formal framework for defending RLHF preference aggregation against sybil attacks. We prove standard Bradley-Terry is not sybil-safe, propose SQ-BT as a defense, and characterize a tight, fundamental safety-liveness tradeoff.

Probing Circuit Robustness: How Syntactic Form Shapes Neural Circuit Activation in LLMs

We investigate whether neural circuits in LLMs remain stable under semantic-preserving paraphrases. Our central finding: syntactic form, not semantic content, is the primary determinant of circuit activation.

Theorizing with LLMs

This is the first ever paper that I have contributed to, we did this exactly an year back.