FineWeb-C: A community-created dataset to improve language models in ALL languages
FineWeb2 Significantly advances multilingual pre-training datasets, covering over 1000 languages with high-quality data. The dataset uses approximately 8 terabytes of ...