Hugging Face Releases FinePDFs: a 3-Trillion-Token Dataset Built from PDFs

Robert Krzaczyński — Mon, 15 Sep 2025 08:55:00 GMT

Hugging Face has unveiled FinePDFs, the largest publicly available corpus built entirely from PDFs. The dataset spans 475 million documents in 1,733 languages, totaling roughly 3 trillion tokens. At 3.65 terabytes in size, FinePDFs introduces a new dimension to open training datasets by tapping into a resource long considered too complex and expensive to process.

By Robert Krzaczyński

InfoQ - PDF

Hugging Face Releases FinePDFs: a 3-Trillion-Token Dataset Built from PDFs