ALITH

BOMForge-AI Dataset — Now Open Source

BOMForge AI Team · TECHgium 2026, Jaya Engineering College Released under CC BY 4.0

"The quality of your data is the ceiling of your model. Not your GPU. Not your team. Not your funding. Your data."

We spent months travelling across India — boarding buses, walking into factory floors, sitting with production engineers, and learning the unglamorous craft of manufacturing planning. We did this to solve a problem that no AI company in the world had solved: the automatic conversion of Engineering Bills of Materials (eBOM) into Manufacturing Bills of Materials (mBOM).

Today, we are open-sourcing the dataset we built from that journey. Completely free. For anyone to train on, research with, or build upon.

Official Announcement

The BOMForge-AI Dataset — 12,500 curated eBOM-to-mBOM transformation pairs spanning 9 manufacturing industries — is now publicly available under the Creative Commons Attribution 4.0 (CC BY 4.0) licence. It is free for training, research, academic study, and commercial use with attribution.

12,500 Curated pairs total

4,606 Field-collected pairs

15 Factories visited

9 Industries covered

Why This Dataset Matters

Every physical product that gets manufactured starts with an Engineering BOM — a flat list of components. Before any factory can begin production, a skilled engineer must manually convert that list into a Manufacturing BOM: assigning machines, defining operation sequences (Op 10 → Op 20 → Op 30), deciding what to make versus buy, setting work centres, calculating lead times, and mapping routings. For a mid-size manufacturer handling 150 BOMs a month, this consumes roughly 750 engineer-hours every single month, with a 15–20% error rate.

Yet despite this scale, no public dataset for eBOM-to-mBOM conversion existed. The closest public resource we found was a 791-pair electronics component dataset covering a single industry. We searched HuggingFace, Kaggle, and academic literature. The gap was absolute.

"We are not releasing a demo dataset. We are releasing the institutional memory of Indian manufacturing — and we are giving it away."

The Factories Behind the Data

The 4,606 field-collected pairs in this dataset were gathered from real factory floors, with the cooperation of production engineers who generously shared their domain expertise. Below is a full account of every facility we visited and what we learned there.

Factory	Location	Industry	BOM Types Studied	Pairs
Trident Group	Madhya Pradesh	Textiles	Yarn Construction BOMs, Weaving Process BOMs, Wet Processing routing	847
Rockman Industries	Ludhiana, Punjab	Automotive	High Pressure Die Casting, CNC machining, surface treatment, assembly routing	623
JSG Innotech	Gurugram, Haryana	Automotive	Automotive press shop BOMs, wiring harness BOMs	445
GVL Electro Controls	Maharashtra	Electrical	MCC panels, PCC switchboards, Factory Acceptance Test BOMs	312
Reve Pharma	Maharashtra	Pharma	Formulation BOMs, granulation routing, blister packing sequences	234
Kumar Pumps	Andhra Pradesh	Metal Engineering	Foundry pattern BOMs, hydrostatic test routing	189
GIE Jewels	Jaipur, Rajasthan	Jewellery	Investment casting BOMs, stone setting routing, BIS Hallmarking process	143

Beyond these seven primary sites, data was collected from eight additional facilities covering garments, plastics, personal care, and general metal fabrication across Tamil Nadu, Gujarat, and Karnataka — bringing the total to 15 factories across 9 Indian states.

Supplementary Sources

The field-collected pairs form the irreplaceable core of the dataset. To extend coverage and improve generalisation across SAP material types and ERP structures, we supplemented with:

SAP S/4HANA reference BOM templates
Open-source PCB and electronics hardware BOMs
IIT Bombay and NIT Trichy manufacturing research datasets
ERPNext and Odoo demo system data

All supplementary data was cleaned, structured to a unified schema, and validated against real-world routing conventions before inclusion.

Dataset Schema — Full Field Reference

Each record in the dataset is a structured JSON object representing one complete eBOM-to-mBOM transformation pair. The schema was designed for direct compatibility with SAP S/4HANA, Oracle Cloud SCM, and ERPNext.

Field Name	Type	Description	Example Values
record_id	string	Unique identifier for each eBOM–mBOM pair	BF-00001, BF-04606
source_type	enum	Whether the pair is field-collected or supplementary	field, synthetic, reference
industry	string	Manufacturing sector of the source factory	automotive, pharma, textiles, jewellery
factory_state	string	Indian state where the factory is located	Punjab, Maharashtra, Rajasthan
ebom_item_code	string	Engineering BOM line item identifier	AS-SHAFT-EN24-001
ebom_description	string	Full text description of the component as it appears in the design BOM	Shaft, EN24 Steel, Ø32mm × 420mm
ebom_quantity	float	Base quantity per assembly unit	1.0, 4.0, 0.025
ebom_uom	string	Unit of measure in the engineering BOM	EA, KG, MTR, LTR, SET
sap_material_type	enum	SAP material classification assigned by the model	FERT, HALB, ROH, VERP
make_or_buy	enum	Procurement decision for this item	MAKE, BUY, SUBCONTRACT
routing_ops	array	Ordered list of manufacturing operations (SAP routing format)	[Op 10, Op 20, Op 30, ...]
op_code	string	SAP standard operation code (within each routing_ops entry)	0010, 0020, 0030
op_description	string	Plain-text name of the manufacturing operation	HPDC Casting, CNC Turning, Shot Blasting
work_centre	string	Machine or station assigned to the operation	CNC-LATHE-01, CMM-INSP, PAINT-LINE-B
std_time_min	float	Standard time in minutes for the operation	12.5, 45.0, 3.0
lead_time_days	integer	Procurement or production lead time in calendar days	3, 14, 45
scrap_percentage	float	Expected scrappage allowance for this operation (%)	2.5, 0.5, 8.0
component_group	string	Logical kit grouping assigned during intelligent grouping (Model 2)	FASTENER-KIT-A, SEAL-ASSEMBLY-01
mbom_line_item	object	Complete mBOM output record — the target for model training	{ material, plant, routing, BOM_usage }
erp_target	enum	ERP system this record was validated against	SAP_S4HANA, ORACLE_SCM, ERPNEXT
validation_status	enum	Human-reviewed quality flag	verified, auto-generated, pending-review
annotator_notes	string	Free-text comments from the production engineer who reviewed the pair	"Shaft goes to cylindrical grinding, not painting"

How to Use This Dataset

For Model Training

The dataset is structured as instruction-response pairs suitable for fine-tuning transformer-based language models. We used QLoRA (Quantised Low-Rank Adaptation) for our own SLM fine-tune — the same technique Meta used to make LLaMA run on consumer hardware. The ebom_description, industry, and factory_state fields form the instruction context; the routing_ops, sap_material_type, make_or_buy, and mbom_line_item fields form the expected output.

For Research

Researchers studying industrial AI, knowledge graph construction, manufacturing process ontologies, or ERP automation will find the schema and multi-industry coverage particularly useful. The dataset spans genuinely heterogeneous process logic — granulation-before-compression in pharma is structurally different from HPDC-before-CNC in automotive — making it a strong benchmark for domain transfer and cross-industry generalisation studies.

For Benchmarking

The closest comparable public dataset is a 791-pair electronics component resource. The BOMForge-AI Dataset is 15× larger in volume and covers 9× more industries. We welcome comparisons and hope it becomes a standard benchmark for manufacturing BOM intelligence tasks.

Licence: Creative Commons Attribution 4.0 International (CC BY 4.0)
You are free to share, copy, distribute, and adapt this dataset — including for commercial purposes — provided you give appropriate credit to the BOMForge AI team and link to the original release. You do not need to ask for permission. You do not need to share your derivatives under the same terms.

What We Are Building With It

At BOMForge AI, we are training a domain-specific Small Language Model on this dataset — a model that understands manufacturing routing the way a 20-year veteran production engineer does. Not because it pattern-matches keywords, but because it has seen 12,500 real conversion examples across 9 industries.

Our current fine-tune reached 85% conversion accuracy before GPU memory constraints halted the full training run. We are working to complete a full A100 training run to reach our 95% target. The platform supports direct import and export to SAP S/4HANA, Oracle Cloud SCM, and ERPNext — with the ERPNext integration live today.

We built this dataset ourselves. We are open-sourcing it because we believe the Indian manufacturing ecosystem deserves AI tools built on real Indian manufacturing data — not synthetic approximations, not single-industry proxies.

"We are building the institutional memory of Indian manufacturing. And we are the only ones doing it."

If you train on this data, find errors, extend the schema, or build something interesting — please reach out. This is a living dataset, and the community that grows around it will make it better.

BOMForge AI · TECHgium 2026, Jaya Engineering College
Dataset released May 2026 · CC BY 4.0 · 12,500 pairs · 9 industries · 9 Indian states

Search This Blog

ALITH

We Built the World's Largest eBOM-to-mBOM Dataset. Now It's Yours.

Why This Dataset Matters

The Factories Behind the Data

Supplementary Sources

Dataset Schema — Full Field Reference

How to Use This Dataset

For Model Training

For Research

For Benchmarking

What We Are Building With It

Comments

Post a Comment

Popular posts from this blog

Alith X402 epi-2 SIMPLE OVERVIEW

Alith X402 epi-1 INTRODUCTION

Multi-Agent Orchestration with Alith-CrewAI Integration