Interpreting and Steering LLM Agents for Social Simulations
We compare prompting, sparse autoencoders, and linear probes for interpreting and controlling LLM agent behavior in social science simulations. SAE-based steering outperforms prompting, offering fine-grained, predictable control over preferences and capabilities.