This written version of the video tutorial was generated by an LLM from the video transcript, and supervised by me, Alejandro.
When it comes to building powerful large language models, the dataset often matters more than the model architecture. FineWeb is a prime example of why data quality and scale are so crucial.
In this tutorial, we’ll explore how Hugging Face created one of the largest open datasets for LLM training, and the key insights that make FineWeb stand out from other web-scale datasets.
Why FineWeb Matters
There’s very little information on the web about how to create datasets for training large language models. Even when companies publish their models open source, they rarely release their pre-training data — there’s just very little incentive to do that.
Hugging Face set out to change that by creating an open dataset that’s both high-quality and well-documented. The result is FineWeb: a 15 trillion token dataset in English derived from CommonCrawl dumps.
The core idea is simple: start with a massive corpus of web text, then apply a series of transformations and cleaning steps to produce a refined dataset.
Getting Data from the Web
Large language models are trained on enormous quantities of data from across the web. But how do you actually collect that data?
Hugging Face used CommonCrawl, an open non-profit organization that publishes snapshots of the entire web every one or two months. They started with 96 snapshots released since 2013, each containing hundreds of terabytes of raw HTML data.
Raw HTML vs. Pre-extracted Text
CommonCrawl offers two formats for download:
- WARC files — Raw HTML data that requires custom extraction
- Pre-extracted text files — Already cleaned, supposedly HTML-free
Hugging Face’s key finding: Raw HTML extraction outperforms pre-extracted text. When they trained tiny models on both approaches, their custom-cleaned version performed significantly better.
This is the most expensive part of the process, so smaller teams might be tempted to skip it. But the quality gains are substantial enough to justify the effort.
Base Filtering
After extraction, you need to clean and filter the raw text. FineWeb’s approach includes:
- Adult content removal using URL blocklists
- Language filtering — keeping only English text (a multilingual version also exists)
- Repetition filtering to remove low-quality content
After this initial filtering, they had about 36 trillion tokens — and this was just the first step.
Deduplication: A Critical Lesson
Here’s where it gets interesting. The web is full of repeated content: mirrors, scraped copies, templates, syndicated articles, recrawled pages. Deduplication improves model generalization, reduces memorization, and increases data diversity.
The Failed Approach: Global Deduplication
Hugging Face initially deduplicated the entire dataset against itself, processing snapshots from newest to oldest. The problem? This removed 90% of the data from the oldest snapshots — and the quality didn’t improve.
When they tested, models trained on the removed data actually performed better than models trained on the data that remained. The aggressive deduplication was keeping template data, navigation content, and ads while removing valuable unique content.
The Better Approach: Independent Deduplication
The fix was simple: deduplicate within each snapshot, not against the entire dataset. Each crawl is deduplicated only against itself, then all dumps are combined.
This approach produced the 15 trillion token FineWeb dataset — and it finally matched the performance of RefinedWeb, the dataset used to train Falcon.
Key lesson: Deduplication is not monotonic. More aggressive deduplication can actually hurt your dataset.
C4-Style Quality Filters
C4 is a well-performing dataset that achieves impressive results with just four strict rules:
- Remove documents that don’t end with punctuation
- Remove documents containing Lorem Ipsum
- Remove documents mentioning JavaScript or cookie notices
- Remove documents containing curly braces
When Hugging Face applied these to FineWeb, the punctuation rule alone removed 30% of tokens — too destructive.
Custom Filters
They replaced the strict punctuation rule with more nuanced custom filters:
- Lines without punctuation: Remove documents where more than 12% of lines lack ending punctuation (instead of removing every line)
- Duplicated content: Remove documents with more than 10% duplicated lines
- Short lines: Remove documents where more than 67% of lines are under 30 characters
This approach removed 22% of tokens instead of 30% — and improved benchmark performance.
FineWeb-Edu: LLM-Powered Quality Filtering
The most interesting part of the project is FineWeb-Edu: a 1.3 trillion token subset that actually outperforms the full 15 trillion token dataset on educational benchmarks.
How It Works
Llama 3 and Phi-3 reportedly use educational quality filtering in their training pipelines. Hugging Face wanted to replicate this but with open models.
Step 1: Use Llama 3 70B Instruct to score 500,000 FineWeb samples from 0 to 5 based on educational value (0 = no educational value, 5 = PhD-level content).
Step 2: Train a smaller classifier on these scored examples.
Step 3: Apply the classifier to the entire FineWeb dataset, keeping only documents with a score of 3 or higher.
The result? A 1.3 trillion token dataset that dramatically outperforms all other open datasets on MMLU and similar benchmarks.
Key lesson: LLMs can bootstrap scalable filters. Use a large model to label a small sample, then train a smaller model to process the full dataset.
Key Lessons Learned
Extraction quality matters — Raw HTML processing outperforms pre-extracted text, despite being more expensive.
Deduplication is not monotonic — Aggressive global deduplication can destroy valuable data. Dedup within snapshots instead.
Evaluate with models, not heuristics — The best way to test dataset quality is to train small models on it and evaluate their performance. Heuristic filters should be validated against actual model performance.
Data quality varies over time — Recent CommonCrawl snapshots produce better models. Starting around 2022, there’s a noticeable improvement in data quality, likely due to more AI-generated (and thus well-structured) content on the web.
LLMs can bootstrap scalable filters — Use large models to create training data for smaller, more efficient classifiers.
Conclusion
FineWeb is an excellent example of how to approach dataset creation for LLM training. The documentation is thorough, the results are impressive, and the lessons are applicable to any data pipeline project.
Whether you’re building a dataset from scratch or fine-tuning on existing data, the principles of quality filtering, intelligent deduplication, and model-based evaluation apply universally.
