BOMForge-AI Dataset — Now Open Source

Open Source Release  ·  Manufacturing AI  ·  BOMForge

We Built the World's Largest eBOM-to-mBOM Dataset. Now It's Yours.

12,500 real manufacturing conversion pairs — collected across 15 Indian factories, 9 industries, and 9 states — released freely for research and training.

"The quality of your data is the ceiling of your model. Not your GPU. Not your team. Not your funding. Your data."

We spent months travelling across India — boarding buses, walking into factory floors, sitting with production engineers, and learning the unglamorous craft of manufacturing planning. We did this to solve a problem that no AI company in the world had solved: the automatic conversion of Engineering Bills of Materials (eBOM) into Manufacturing Bills of Materials (mBOM).

Today, we are open-sourcing the dataset we built from that journey. Completely free. For anyone to train on, research with, or build upon.

Official Announcement

The BOMForge-AI Dataset — 12,500 curated eBOM-to-mBOM transformation pairs spanning 9 manufacturing industries — is now publicly available under the Creative Commons Attribution 4.0 (CC BY 4.0) licence. It is free for training, research, academic study, and commercial use with attribution.

12,500 Curated pairs total
4,606 Field-collected pairs
15 Factories visited
9 Industries covered

Why This Dataset Matters

Every physical product that gets manufactured starts with an Engineering BOM — a flat list of components. Before any factory can begin production, a skilled engineer must manually convert that list into a Manufacturing BOM: assigning machines, defining operation sequences (Op 10 → Op 20 → Op 30), deciding what to make versus buy, setting work centres, calculating lead times, and mapping routings. For a mid-size manufacturer handling 150 BOMs a month, this consumes roughly 750 engineer-hours every single month, with a 15–20% error rate.

Yet despite this scale, no public dataset for eBOM-to-mBOM conversion existed. The closest public resource we found was a 791-pair electronics component dataset covering a single industry. We searched HuggingFace, Kaggle, and academic literature. The gap was absolute.

"We are not releasing a demo dataset. We are releasing the institutional memory of Indian manufacturing — and we are giving it away."

The Factories Behind the Data

The 4,606 field-collected pairs in this dataset were gathered from real factory floors, with the cooperation of production engineers who generously shared their domain expertise. Below is a full account of every facility we visited and what we learned there.

Factory Location Industry BOM Types Studied Pairs
Trident Group Madhya Pradesh Textiles Yarn Construction BOMs, Weaving Process BOMs, Wet Processing routing 847
Rockman Industries Ludhiana, Punjab Automotive High Pressure Die Casting, CNC machining, surface treatment, assembly routing 623
JSG Innotech Gurugram, Haryana Automotive Automotive press shop BOMs, wiring harness BOMs 445
GVL Electro Controls Maharashtra Electrical MCC panels, PCC switchboards, Factory Acceptance Test BOMs 312
Reve Pharma Maharashtra Pharma Formulation BOMs, granulation routing, blister packing sequences 234
Kumar Pumps Andhra Pradesh Metal Engineering Foundry pattern BOMs, hydrostatic test routing 189
GIE Jewels Jaipur, Rajasthan Jewellery Investment casting BOMs, stone setting routing, BIS Hallmarking process 143

Beyond these seven primary sites, data was collected from eight additional facilities covering garments, plastics, personal care, and general metal fabrication across Tamil Nadu, Gujarat, and Karnataka — bringing the total to 15 factories across 9 Indian states.

Supplementary Sources

The field-collected pairs form the irreplaceable core of the dataset. To extend coverage and improve generalisation across SAP material types and ERP structures, we supplemented with:

  • SAP S/4HANA reference BOM templates
  • Open-source PCB and electronics hardware BOMs
  • IIT Bombay and NIT Trichy manufacturing research datasets
  • ERPNext and Odoo demo system data

All supplementary data was cleaned, structured to a unified schema, and validated against real-world routing conventions before inclusion.

Dataset Schema — Full Field Reference

Each record in the dataset is a structured JSON object representing one complete eBOM-to-mBOM transformation pair. The schema was designed for direct compatibility with SAP S/4HANA, Oracle Cloud SCM, and ERPNext.

Field Name Type Description Example Values
record_id string Unique identifier for each eBOM–mBOM pair BF-00001, BF-04606
source_type enum Whether the pair is field-collected or supplementary field, synthetic, reference
industry string Manufacturing sector of the source factory automotive, pharma, textiles, jewellery
factory_state string Indian state where the factory is located Punjab, Maharashtra, Rajasthan
ebom_item_code string Engineering BOM line item identifier AS-SHAFT-EN24-001
ebom_description string Full text description of the component as it appears in the design BOM Shaft, EN24 Steel, Ø32mm × 420mm
ebom_quantity float Base quantity per assembly unit 1.0, 4.0, 0.025
ebom_uom string Unit of measure in the engineering BOM EA, KG, MTR, LTR, SET
sap_material_type enum SAP material classification assigned by the model FERT, HALB, ROH, VERP
make_or_buy enum Procurement decision for this item MAKE, BUY, SUBCONTRACT
routing_ops array Ordered list of manufacturing operations (SAP routing format) [Op 10, Op 20, Op 30, ...]
op_code string SAP standard operation code (within each routing_ops entry) 0010, 0020, 0030
op_description string Plain-text name of the manufacturing operation HPDC Casting, CNC Turning, Shot Blasting
work_centre string Machine or station assigned to the operation CNC-LATHE-01, CMM-INSP, PAINT-LINE-B
std_time_min float Standard time in minutes for the operation 12.5, 45.0, 3.0
lead_time_days integer Procurement or production lead time in calendar days 3, 14, 45
scrap_percentage float Expected scrappage allowance for this operation (%) 2.5, 0.5, 8.0
component_group string Logical kit grouping assigned during intelligent grouping (Model 2) FASTENER-KIT-A, SEAL-ASSEMBLY-01
mbom_line_item object Complete mBOM output record — the target for model training { material, plant, routing, BOM_usage }
erp_target enum ERP system this record was validated against SAP_S4HANA, ORACLE_SCM, ERPNEXT
validation_status enum Human-reviewed quality flag verified, auto-generated, pending-review
annotator_notes string Free-text comments from the production engineer who reviewed the pair "Shaft goes to cylindrical grinding, not painting"

How to Use This Dataset

For Model Training

The dataset is structured as instruction-response pairs suitable for fine-tuning transformer-based language models. We used QLoRA (Quantised Low-Rank Adaptation) for our own SLM fine-tune — the same technique Meta used to make LLaMA run on consumer hardware. The ebom_description, industry, and factory_state fields form the instruction context; the routing_ops, sap_material_type, make_or_buy, and mbom_line_item fields form the expected output.

For Research

Researchers studying industrial AI, knowledge graph construction, manufacturing process ontologies, or ERP automation will find the schema and multi-industry coverage particularly useful. The dataset spans genuinely heterogeneous process logic — granulation-before-compression in pharma is structurally different from HPDC-before-CNC in automotive — making it a strong benchmark for domain transfer and cross-industry generalisation studies.

For Benchmarking

The closest comparable public dataset is a 791-pair electronics component resource. The BOMForge-AI Dataset is 15× larger in volume and covers 9× more industries. We welcome comparisons and hope it becomes a standard benchmark for manufacturing BOM intelligence tasks.

Licence: Creative Commons Attribution 4.0 International (CC BY 4.0)
You are free to share, copy, distribute, and adapt this dataset — including for commercial purposes — provided you give appropriate credit to the BOMForge AI team and link to the original release. You do not need to ask for permission. You do not need to share your derivatives under the same terms.

What We Are Building With It

At BOMForge AI, we are training a domain-specific Small Language Model on this dataset — a model that understands manufacturing routing the way a 20-year veteran production engineer does. Not because it pattern-matches keywords, but because it has seen 12,500 real conversion examples across 9 industries.

Our current fine-tune reached 85% conversion accuracy before GPU memory constraints halted the full training run. We are working to complete a full A100 training run to reach our 95% target. The platform supports direct import and export to SAP S/4HANA, Oracle Cloud SCM, and ERPNext — with the ERPNext integration live today.

We built this dataset ourselves. We are open-sourcing it because we believe the Indian manufacturing ecosystem deserves AI tools built on real Indian manufacturing data — not synthetic approximations, not single-industry proxies.

"We are building the institutional memory of Indian manufacturing. And we are the only ones doing it."

If you train on this data, find errors, extend the schema, or build something interesting — please reach out. This is a living dataset, and the community that grows around it will make it better.

Comments

Popular posts from this blog

Alith X402 epi-2 SIMPLE OVERVIEW

Alith X402 epi-1 INTRODUCTION

Multi-Agent Orchestration with Alith-CrewAI Integration