MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents
MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents arXiv:2604.03436v1 Announce Type: cross Abstract: Sparse autoencoders (SAEs) are increasingly used for safety-relevant applications including alignment detection and model steering. Policy stories matter because compliance friction can slow adoption even when model quality keeps improving.
Why It Matters
Policy stories matter because compliance friction can slow adoption even when model quality keeps improving.