The dataset for your next fine-tune.
Tell FineSet what you want to train on. It pulls the papers from arXiv and Semantic Scholar, merges the duplicates, scores them, and hands you clean JSONL.
You're #— in the queue.
Share your link to move up 5 spots for every person who joins:
— people waiting · 40 spots open next week
Real output, not a mockup
These came straight out of the pipeline. One record per line, normalized, cross-referenced between arXiv and Semantic Scholar so duplicates merge, and scored by citations.
{"id": "4ecf6c33e4…", "sources": ["arxiv", "semantic_scholar"], "title": "Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model", "abstract": "Recently the state space models (SSMs) with efficient…", "authors": ["Lianghui Zhu", "Bencheng Liao", "…"], "categories": ["cs.CV", "cs.LG"], "published_date": "2024-01-17", "citation_count": 1826, "has_code": true, "code_url": "https://github.com/hustvl/Vim", "quality_score": 0.8154}
{"id": "75217fd65d…", "sources": ["arxiv", "semantic_scholar"], "title": "KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache", "abstract": "Efficiently serving large language models (LLMs) requires batching…", "authors": ["Zirui Liu", "Jiayi Yuan", "…"], "categories": ["cs.CL", "cs.LG", "cs.PF"], "published_date": "2024-02-05", "citation_count": 507, "has_code": true, "code_url": "https://github.com/jy-yuan/KIVI", "quality_score": 0.6765}
{"id": "ab96776701…", "sources": ["arxiv", "semantic_scholar"], "title": "Visual Attention Network", "abstract": "While originally designed for natural language processing tasks,…", "authors": ["Meng-Hao Guo", "Cheng-Ze Lu", "…"], "categories": ["cs.CV"], "published_date": "2022-02-20", "citation_count": 1019, "has_code": true, "code_url": "https://github.com/Visual-Attention-Network", "quality_score": 0.7522}
6,500+ papers across 4 datasets, already on HuggingFace: agents, efficiency, interpretability, and synthetic data.
Who it's for
ML practitioners
Thousands of papers on your topic as a JSONL dataset that updates daily. Quality scored, so you can filter out the noise before you train.
Researchers
Every new paper in your subfield, structured and deduplicated, waiting for you in the morning.
Builders
Tell it what data you want to track and the records keep flowing. No scrapers to write, none to babysit.
How it works
Describe your topic
Give it keywords and, if you want, arXiv categories. Something like "RLHF, cs.LG, since 2023".
The pipeline assembles it
It pulls from every source that fits, normalizes the fields, merges duplicates, scores quality, and strips PII.
Download JSONL
Export it and it stays fresh, refreshed every day. Parquet and a one-click push to HuggingFace are coming soon.
Questions
What is FineSet?
FineSet turns a plain-English topic into a clean, deduplicated, export-ready training dataset. You describe what you want to train on, and FineSet assembles the records, scores their quality, and gives you JSONL.
Where does the data come from?
For research papers, FineSet pulls from arXiv and Semantic Scholar, cross-references them so duplicate papers merge, and keeps citation counts and metadata. More sources such as GitHub, job boards, news, and forums follow.
What format are the datasets in?
JSONL, one record per line, with fields like title, abstract, authors, categories, citation_count, has_code, and a 0 to 1 quality score. Parquet and a one-click push to HuggingFace are coming.
How often are datasets updated?
Daily. Once you subscribe to a topic, FineSet refreshes the dataset with new matching records every day.
How much does FineSet cost?
FineSet is in early access. Join the waitlist for a free spot. Pricing for larger usage comes later.
Can I get a dataset on my own topic?
Yes. Describe any research area or data domain and FineSet builds the pipeline for it. Join the waitlist to request your topic.