When it comes to building powerful large language models, the dataset often matters more than the model architecture. FineWeb is a great example of why data quality, filtering, and scale are so important.
In this tutorial, we’ll walk through how Hugging Face created one of the largest open datasets for LLM training, starting from Common Crawl snapshots and ending with FineWeb-Edu, a smaller educational subset that can outperform the full dataset on several benchmarks.
Why FineWeb Matters
There is surprisingly little public information about how to create datasets for training large language models. Even when companies release open models, they rarely release the full pre-training data.
FineWeb is valuable because it documents the process in detail. It is a 15 trillion token English dataset derived from Common Crawl dumps, and the project explains the tradeoffs behind extraction, filtering, deduplication, and evaluation.
The core idea is simple: start with a huge corpus of web data, then apply a sequence of transformations to turn messy raw web pages into a refined dataset that can train stronger models.
Starting from Common Crawl
Large language models are trained on enormous amounts of text from the web. Hugging Face used Common Crawl as the starting point. Common Crawl is a nonprofit organization that publishes snapshots of the web every one or two months.
FineWeb started from 96 snapshots released since 2013. Each snapshot contains hundreds of terabytes of raw HTML data, which means the first challenge is not simply downloading the data — it is extracting useful text from it.
The raw web contains boilerplate, spam, navigation menus, duplicated pages, non-English text, and many other forms of noise. A dataset pipeline has to remove as much of that as possible without throwing away useful content.
Raw HTML vs. Pre-Extracted Text
Common Crawl provides multiple formats. Two important ones are:
- WARC files — raw web archive files containing HTML and metadata.
- Pre-extracted text files — text that Common Crawl has already extracted from the pages.
The pre-extracted text is easier and cheaper to use, but Hugging Face found that processing the raw HTML themselves gave better results. They used Trafilatura to extract text from the raw HTML and metadata files.
This is one of the most expensive parts of the process, so smaller teams might be tempted to skip it. But the FineWeb experiments showed that the custom-cleaned version produced better tiny-model evaluations than the pre-extracted text version.
Base Filtering
After extracting text, FineWeb applies a first round of filtering. This includes:
- Removing adult content using URL blocklists.
- Filtering out non-English documents.
- Applying repetition and quality filters to remove obvious low-quality content.
After this first stage, the dataset still contained about 36 trillion tokens. That is a huge amount of data, but it was not yet the final dataset.
The next major challenge was deduplication.
Deduplication: More Is Not Always Better
The web is full of repeated content: mirrors, scraped copies, templates, syndicated articles, and pages that appear across many crawls. Deduplication is useful because it can improve model generalization, reduce memorization, avoid wasting training compute on repeated text, and increase data diversity.
FineWeb used MinHash for deduplication, but one of the most important lessons from the project is that deduplication is not monotonic. More deduplication is not automatically better.
The Failed Approach
The first approach deduplicated the entire dataset against itself across snapshots. Hugging Face started with the newest snapshot, deduplicated it, then deduplicated older snapshots against the newer ones.
This removed a huge amount of data from older snapshots. But when they evaluated the remaining data, the results were worse than expected. In some cases, the removed data was actually better than the data that remained.
The aggressive global deduplication removed useful content while keeping less valuable repeated structures like templates, ads, or boilerplate.
The Better Approach
The fix was to deduplicate within each snapshot instead of globally across all snapshots. Each crawl was deduplicated against itself, and then the resulting snapshots were combined.
This produced the final 15 trillion token FineWeb dataset and improved downstream performance.
The lesson is important: deduplication strategy matters. You need to evaluate the effect of the filter, not assume that removing more data is always better.
C4-Style Quality Filters
FineWeb also experiments with quality filters inspired by C4. C4 is a strong baseline dataset that performs well with relatively simple rules.
Some C4-style filters remove documents that:
- Do not end with punctuation.
- Contain Lorem Ipsum.
- Mention JavaScript or cookie notices.
- Contain curly braces.
But applying these rules too directly can be destructive. For example, the punctuation rule alone removed about 30% of the tokens.
FineWeb replaced strict document-level rules with more nuanced filters. Instead of removing every document with a line that lacks punctuation, they removed documents where too many lines lacked punctuation. They also filtered based on duplicated lines and excessive short lines.
This removed less data while improving model performance.
FineWeb-Edu: Filtering for Educational Value
FineWeb-Edu is one of the most interesting parts of the project. It is a 1.3 trillion token subset of FineWeb focused on educational content.
The idea is inspired by reports that models like Llama 3 and Phi-3 use educational quality filtering in their data pipelines. Hugging Face wanted to reproduce this idea with open tools.
The process works like this:
- Use a large model, Llama 3 70B Instruct, to score 500,000 FineWeb samples for educational value from 0 to 5.
- Train a smaller classifier on those scored examples.
- Run that classifier over the full FineWeb dataset.
- Keep documents above a chosen educational quality threshold.
With a threshold of 3 or higher, the result is a 1.3 trillion token dataset that performs extremely well on educational benchmarks.
The key lesson is that LLMs can bootstrap scalable filters. You can use a large model to label a smaller sample, then train a cheaper classifier to process the full dataset.
Key Lessons from FineWeb
FineWeb is useful not just as a dataset, but as a case study in dataset engineering. A few lessons stand out.
1. Extraction quality matters
Raw HTML processing is more expensive, but it can outperform pre-extracted text if your extraction pipeline is better.
2. Deduplication is not automatically good
Aggressive global deduplication can remove useful data. Deduplicating within snapshots worked better for FineWeb.
3. Evaluate datasets by training models
Heuristics are useful, but the best way to evaluate a training dataset is to train models on it and measure performance.
4. Data quality changes over time
FineWeb found that newer Common Crawl snapshots tended to produce better models. One possible explanation is that more recent web content is more structured, partly because of AI-generated text.
5. LLMs can help build data filters
FineWeb-Edu shows how to use a large model to create labels, then train a smaller model to scale that filtering across trillions of tokens.
Conclusion
FineWeb is one of the clearest public examples of how modern LLM pre-training datasets are built. It shows that dataset creation is not just about collecting more text. The extraction method, filtering rules, deduplication strategy, and evaluation loop all matter.
If you’re working on LLM training data, fine-tuning datasets, or data curation pipelines, FineWeb is a valuable reference because it makes many of these tradeoffs explicit.
