Open Source Release · Manufacturing AI · BOMForge
We Built the World's Largest eBOM-to-mBOM Dataset. Now It's Yours.
12,500 real manufacturing conversion pairs — collected across 15 Indian factories, 9 industries, and 9 states — released freely for research and training.
"The quality of your data is the ceiling of your model. Not your GPU. Not your team. Not your funding. Your data."
We spent months travelling across India — boarding buses, walking into factory floors, sitting with production engineers, and learning the unglamorous craft of manufacturing planning. We did this to solve a problem that no AI company in the world had solved: the automatic conversion of Engineering Bills of Materials (eBOM) into Manufacturing Bills of Materials (mBOM).
Today, we are open-sourcing the dataset we built from that journey. Completely free. For anyone to train on, research with, or build upon.
Official Announcement
The BOMForge-AI Dataset — 12,500 curated eBOM-to-mBOM transformation pairs spanning 9 manufacturing industries — is now publicly available under the Creative Commons Attribution 4.0 (CC BY 4.0) licence. It is free for training, research, academic study, and commercial use with attribution.
Why This Dataset Matters
Every physical product that gets manufactured starts with an Engineering BOM — a flat list of components. Before any factory can begin production, a skilled engineer must manually convert that list into a Manufacturing BOM: assigning machines, defining operation sequences (Op 10 → Op 20 → Op 30), deciding what to make versus buy, setting work centres, calculating lead times, and mapping routings. For a mid-size manufacturer handling 150 BOMs a month, this consumes roughly 750 engineer-hours every single month, with a 15–20% error rate.
Yet despite this scale, no public dataset for eBOM-to-mBOM conversion existed. The closest public resource we found was a 791-pair electronics component dataset covering a single industry. We searched HuggingFace, Kaggle, and academic literature. The gap was absolute.
The Factories Behind the Data
The 4,606 field-collected pairs in this dataset were gathered from real factory floors, with the cooperation of production engineers who generously shared their domain expertise. Below is a full account of every facility we visited and what we learned there.
| Factory | Location | Industry | BOM Types Studied | Pairs |
|---|---|---|---|---|
| Trident Group | Madhya Pradesh | Textiles | Yarn Construction BOMs, Weaving Process BOMs, Wet Processing routing | 847 |
| Rockman Industries | Ludhiana, Punjab | Automotive | High Pressure Die Casting, CNC machining, surface treatment, assembly routing | 623 |
| JSG Innotech | Gurugram, Haryana | Automotive | Automotive press shop BOMs, wiring harness BOMs | 445 |
| GVL Electro Controls | Maharashtra | Electrical | MCC panels, PCC switchboards, Factory Acceptance Test BOMs | 312 |
| Reve Pharma | Maharashtra | Pharma | Formulation BOMs, granulation routing, blister packing sequences | 234 |
| Kumar Pumps | Andhra Pradesh | Metal Engineering | Foundry pattern BOMs, hydrostatic test routing | 189 |
| GIE Jewels | Jaipur, Rajasthan | Jewellery | Investment casting BOMs, stone setting routing, BIS Hallmarking process | 143 |
Beyond these seven primary sites, data was collected from eight additional facilities covering garments, plastics, personal care, and general metal fabrication across Tamil Nadu, Gujarat, and Karnataka — bringing the total to 15 factories across 9 Indian states.
Supplementary Sources
The field-collected pairs form the irreplaceable core of the dataset. To extend coverage and improve generalisation across SAP material types and ERP structures, we supplemented with:
- SAP S/4HANA reference BOM templates
- Open-source PCB and electronics hardware BOMs
- IIT Bombay and NIT Trichy manufacturing research datasets
- ERPNext and Odoo demo system data
All supplementary data was cleaned, structured to a unified schema, and validated against real-world routing conventions before inclusion.
Dataset Schema — Full Field Reference
Each record in the dataset is a structured JSON object representing one complete eBOM-to-mBOM transformation pair. The schema was designed for direct compatibility with SAP S/4HANA, Oracle Cloud SCM, and ERPNext.
| Field Name | Type | Description | Example Values |
|---|---|---|---|
| record_id | string | Unique identifier for each eBOM–mBOM pair | BF-00001, BF-04606 |
| source_type | enum | Whether the pair is field-collected or supplementary | field, synthetic, reference |
| industry | string | Manufacturing sector of the source factory | automotive, pharma, textiles, jewellery |
| factory_state | string | Indian state where the factory is located | Punjab, Maharashtra, Rajasthan |
| ebom_item_code | string | Engineering BOM line item identifier | AS-SHAFT-EN24-001 |
| ebom_description | string | Full text description of the component as it appears in the design BOM | Shaft, EN24 Steel, Ø32mm × 420mm |
| ebom_quantity | float | Base quantity per assembly unit | 1.0, 4.0, 0.025 |
| ebom_uom | string | Unit of measure in the engineering BOM | EA, KG, MTR, LTR, SET |
| sap_material_type | enum | SAP material classification assigned by the model | FERT, HALB, ROH, VERP |
| make_or_buy | enum | Procurement decision for this item | MAKE, BUY, SUBCONTRACT |
| routing_ops | array | Ordered list of manufacturing operations (SAP routing format) | [Op 10, Op 20, Op 30, ...] |
| op_code | string | SAP standard operation code (within each routing_ops entry) | 0010, 0020, 0030 |
| op_description | string | Plain-text name of the manufacturing operation | HPDC Casting, CNC Turning, Shot Blasting |
| work_centre | string | Machine or station assigned to the operation | CNC-LATHE-01, CMM-INSP, PAINT-LINE-B |
| std_time_min | float | Standard time in minutes for the operation | 12.5, 45.0, 3.0 |
| lead_time_days | integer | Procurement or production lead time in calendar days | 3, 14, 45 |
| scrap_percentage | float | Expected scrappage allowance for this operation (%) | 2.5, 0.5, 8.0 |
| component_group | string | Logical kit grouping assigned during intelligent grouping (Model 2) | FASTENER-KIT-A, SEAL-ASSEMBLY-01 |
| mbom_line_item | object | Complete mBOM output record — the target for model training | { material, plant, routing, BOM_usage } |
| erp_target | enum | ERP system this record was validated against | SAP_S4HANA, ORACLE_SCM, ERPNEXT |
| validation_status | enum | Human-reviewed quality flag | verified, auto-generated, pending-review |
| annotator_notes | string | Free-text comments from the production engineer who reviewed the pair | "Shaft goes to cylindrical grinding, not painting" |
How to Use This Dataset
For Model Training
The dataset is structured as instruction-response pairs suitable for fine-tuning transformer-based language models. We used QLoRA (Quantised Low-Rank Adaptation) for our own SLM fine-tune — the same technique Meta used to make LLaMA run on consumer hardware. The ebom_description, industry, and factory_state fields form the instruction context; the routing_ops, sap_material_type, make_or_buy, and mbom_line_item fields form the expected output.
For Research
Researchers studying industrial AI, knowledge graph construction, manufacturing process ontologies, or ERP automation will find the schema and multi-industry coverage particularly useful. The dataset spans genuinely heterogeneous process logic — granulation-before-compression in pharma is structurally different from HPDC-before-CNC in automotive — making it a strong benchmark for domain transfer and cross-industry generalisation studies.
For Benchmarking
The closest comparable public dataset is a 791-pair electronics component resource. The BOMForge-AI Dataset is 15× larger in volume and covers 9× more industries. We welcome comparisons and hope it becomes a standard benchmark for manufacturing BOM intelligence tasks.
You are free to share, copy, distribute, and adapt this dataset — including for commercial purposes — provided you give appropriate credit to the BOMForge AI team and link to the original release. You do not need to ask for permission. You do not need to share your derivatives under the same terms.
What We Are Building With It
At BOMForge AI, we are training a domain-specific Small Language Model on this dataset — a model that understands manufacturing routing the way a 20-year veteran production engineer does. Not because it pattern-matches keywords, but because it has seen 12,500 real conversion examples across 9 industries.
Our current fine-tune reached 85% conversion accuracy before GPU memory constraints halted the full training run. We are working to complete a full A100 training run to reach our 95% target. The platform supports direct import and export to SAP S/4HANA, Oracle Cloud SCM, and ERPNext — with the ERPNext integration live today.
We built this dataset ourselves. We are open-sourcing it because we believe the Indian manufacturing ecosystem deserves AI tools built on real Indian manufacturing data — not synthetic approximations, not single-industry proxies.
If you train on this data, find errors, extend the schema, or build something interesting — please reach out. This is a living dataset, and the community that grows around it will make it better.
Comments
Post a Comment