
EleutherAI has released a new study detailing a data filtering approach aimed at improving safety in open-weight large language models (LLMs). The research focuses on preventing the inclusion of dangerous knowledge during pretraining, utilizing a multi-stage filtering pipeline that processes over 400 million documents. Key findings indicate that effective filtering can significantly reduce undesirable knowledge without notable degradation in unrelated model performance. This approach aims to address the vulnerabilities of existing safeguards and enhance the robustness of open-weight models against tampering.
Read original