Building Pipelines for Data Filtering and Cleansing
Was this section helpful?
Deduplicating Training Data Makes Language Models Better, Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, Nicholas Carlini, 2021Proceedings of the 59th Annual Meeting of the Association for Computational LinguisticsDOI: 10.48550/arXiv.2107.06499 - Demonstrates that deduplication of training data significantly improves the performance of language models, supporting the need for near-duplicate filtering.
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, Nils Reimers, Iryna Gurevych, 2019Proceedings of the 2019 Conference on Empirical Methods in Natural Language ProcessingDOI: 10.48550/arXiv.1908.10084 - Introduces Sentence-BERT, a method for generating semantically meaningful sentence embeddings, which is fundamental for embedding-based near-duplicate detection.
Hugging Face Datasets Library Documentation, Hugging Face, 2023 (Hugging Face) - Provides comprehensive documentation for the Hugging Face datasets library, showing how to efficiently load, process, and filter large text datasets for machine learning applications.