FineSet — The dataset for your next fine-tune

Real output, not a mockup

These came straight out of the pipeline. One record per line, normalized, cross-referenced between arXiv and Semantic Scholar so duplicates merge, and scored by citations.

efficient-llm-papers.jsonl

{"id": "4ecf6c33e4…", "sources": ["arxiv", "semantic_scholar"], "title": "Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model", "abstract": "Recently the state space models (SSMs) with efficient…", "authors": ["Lianghui Zhu", "Bencheng Liao", "…"], "categories": ["cs.CV", "cs.LG"], "published_date": "2024-01-17", "citation_count": 1826, "has_code": true, "code_url": "https://github.com/hustvl/Vim", "quality_score": 0.8154}
{"id": "75217fd65d…", "sources": ["arxiv", "semantic_scholar"], "title": "KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache", "abstract": "Efficiently serving large language models (LLMs) requires batching…", "authors": ["Zirui Liu", "Jiayi Yuan", "…"], "categories": ["cs.CL", "cs.LG", "cs.PF"], "published_date": "2024-02-05", "citation_count": 507, "has_code": true, "code_url": "https://github.com/jy-yuan/KIVI", "quality_score": 0.6765}
{"id": "ab96776701…", "sources": ["arxiv", "semantic_scholar"], "title": "Visual Attention Network", "abstract": "While originally designed for natural language processing tasks,…", "authors": ["Meng-Hao Guo", "Cheng-Ze Lu", "…"], "categories": ["cs.CV"], "published_date": "2022-02-20", "citation_count": 1019, "has_code": true, "code_url": "https://github.com/Visual-Attention-Network", "quality_score": 0.7522}

6,500+ papers across 4 datasets, already on HuggingFace: agents, efficiency, interpretability, and synthetic data.

Who it's for

ML practitioners

Thousands of papers on your topic as a JSONL dataset that updates daily. Quality scored, so you can filter out the noise before you train.

Researchers

Every new paper in your subfield, structured and deduplicated, waiting for you in the morning.

Builders

Tell it what data you want to track and the records keep flowing. No scrapers to write, none to babysit.

How it works

1

Describe your topic

Give it keywords and, if you want, arXiv categories. Something like "RLHF, cs.LG, since 2023".

2

The pipeline assembles it

It pulls from every source that fits, normalizes the fields, merges duplicates, scores quality, and strips PII.

3

Download JSONL

Export it and it stays fresh, refreshed every day. Parquet and a one-click push to HuggingFace are coming soon.

Questions

What is FineSet?

FineSet turns a plain-English topic into a clean, deduplicated, export-ready training dataset. You describe what you want to train on, and FineSet assembles the records, scores their quality, and gives you JSONL.

Where does the data come from?

For research papers, FineSet pulls from arXiv and Semantic Scholar, cross-references them so duplicate papers merge, and keeps citation counts and metadata. More sources such as GitHub, job boards, news, and forums follow.

What format are the datasets in?

JSONL, one record per line, with fields like title, abstract, authors, categories, citation_count, has_code, and a 0 to 1 quality score. Parquet and a one-click push to HuggingFace are coming.

How often are datasets updated?

Daily. Once you subscribe to a topic, FineSet refreshes the dataset with new matching records every day.

How much does FineSet cost?

FineSet is in early access. Join the waitlist for a free spot. Pricing for larger usage comes later.

Can I get a dataset on my own topic?

Yes. Describe any research area or data domain and FineSet builds the pipeline for it. Join the waitlist to request your topic.

Get your dataset.

Join the waitlist

The dataset for your next fine-tune.