Every computational chemist or drug discovery researcher knows the feeling: you find the perfect paper or patent, spot the key molecular structures buried in the figures, and then—resign yourself to redrawing them by hand.
It’s a quiet but universal frustration. Collectively we’ve built powerful modeling tools, sophisticated AI systems, and highly-automated drug discovery pipelines—but we still rely on screenshots and sketchpads to reconstruct molecules from PDFs.
At Deep Origin, we believe the next generation of drug discovery tools shouldn’t start with anonymous training data. They should start with your lived scientific context—the patents, papers, and presentations that already hold the molecules you care about. Data you’ve been making note of for years in talks, memos, competitor patents, and the research of others. Data that’s largely locked into PDFs.
That’s precisely why we’ve created DO Patent, our tool for extracting full-molecule data from PDFs instantly with over 98% accuracy (more on this later).
The invisible bottleneck in every discovery project
For every medicinal or computational chemist, early-stage work begins not with modeling or simulation—but with curation.
Before you can explore binding pockets or predict properties, you need a starting dataset: a clean, structured list of molecules relevant to your target or mechanism. And most of that data still lives inside documents—patents, publications, notes—rendered as static images.
Creating that dataset is slow, manual work. Even with text mining or OCR, chemical structure diagrams rarely survive the extraction process intact. Molecules that differ only by a single substituent are often mislabeled or ignored. What should be a one-hour task turns into several days of redrawing and checking.
What DO Patent does
DO Patent lets scientists turn any PDF—patent, publication, or presentation—into a collection of editable, exportable molecular structures.
Upload a file, and the tool automatically identifies chemical diagrams, extracts them as SMILES strings, and assigns each a confidence score reflecting how certain the AI is about its interpretation.
If a molecule falls below your chosen confidence threshold, it’s flagged for manual review. Click the molecule, and DO Patent jumps to the precise spot in the source document so you can quickly verify or correct it using an integrated molecular editor.
That’s it. No manual redrawing, no hunting through pages of figures, no ambiguity about where a structure came from. When you’re done, you can export your curated molecules directly for use in Balto, our conversational molecular modeling assistant, or any other tool that supports SMILES input.
A tool built for real scientific use
We didn’t build DO Patent as a novelty. We built it for the moments when chemists, biologists, and data scientists need fast, reproducible access to the chemical content scattered across the world’s literature.
From early dataset construction to competitive intelligence and IP analysis, DO Patent fits the workflows scientists already use:
- Computational chemists can assemble datasets for model training or screening in hours instead of days.
- Medicinal chemists can review competitors’ filings or supplement internal compound libraries.
- Biologists can capture relevant chemical matter from presentations or posters and hand it directly to their collaborators.
- Academic researchers can mine open literature without running into the download limits imposed by proprietary databases.
Everything runs in the browser. There’s no installation, no setup, and no coding required. Just drag and drop your PDFs.
How it works under the hood
While many text-based patent parsers rely on line-drawing heuristics or simply match patents to a database, DO Patent uses proprietary ML trained on diverse real-world document formats. It recognizes molecular diagrams in context, interprets bond topology, and cross-checks extracted structures against chemical syntax rules before outputting SMILES.
Each extracted molecule receives a confidence score based on internal consistency and prediction uncertainty. These scores aren’t cosmetic—they determine which molecules are auto-accepted and which require a quick human glance.
Transparency is a core part of our design philosophy. Every extraction shows you both the predicted molecule and its source figure side-by-side, so you can make an informed call about accuracy in the event a molecule is flagged. Once you feel good about flagged molecules, simply export as SMILES to continue your discovery.
Validation: how well does it really work?
To understand how reliable DO Patent really is, we turned to one of our most experienced chemists. Over the course of 100 hours, he manually benchmarked DO Patent against real-world patents for marketed drugs from all major pharmaceutical companies.
Each molecule was reviewed bond by bond. If even a single atom or bond was incorrect, the molecule was marked as a failed extraction.
Across a benchmark of more than 30 patents and thousands of individual structures, over 99% of full-molecule structural elements were correctly extracted.
What took a trained chemist 100 hours to validate could have been completed by DO Patent in minutes—and at a cost of roughly $270 for full extraction. Re-drawing those same molecules manually would take exponentially longer.
That combination of accuracy and efficiency is what makes DO Patent more than a convenience feature—it’s a force multiplier for chemists and data scientists who need verified molecular data fast.
You can view the full dataset here.
Why it matters
Every machine learning model in drug discovery starts with data. But the process of creating and cleaning that data is often invisible—the kind of scientific labor that never makes it into the paper or the grant proposal.
By automating the extraction of chemical structures directly from source documents, DO Patent doesn’t just save time. It liberates knowledge that already exists in the literature but isn’t machine-readable.
That means faster hypothesis testing, richer internal databases, and fewer blind spots when exploring chemical space. It also means that the data scientists, modelers, and chemists using Balto or other Deep Origin tools can start from the same foundation of verified, transparent molecular data.
A few design choices worth mentioning
- Edit in place – Correct or refine structures directly in the interface; no need for external editors.
- Bulk processing – Upload dozens of PDFs at once to build large-scale datasets.
- Reference tracking – Each molecule is automatically linked to its originating figure and page for full traceability.
- Transparent AI – Confidence scores surface uncertainty rather than hiding it, giving users control.
- Privacy by design – Your uploaded documents are processed securely and never shared or used for model retraining.
Built to be accessible
We want every scientist—from small biotech founders to graduate students—to have access to data extraction tools powerful enough for professional work.
That’s why DO Patent includes free monthly usage quota covering up to 50 pages of extraction per month. After that, standard pricing is $0.10 per page, with an academic rate of $0.06 per page for users registering with a .edu email address.
No subscriptions. No hidden limits. Just pay for what you process.
From documents to discovery
When we launched Balto, our conversational molecular modeling assistant, we set out to make advanced simulation and structure prediction accessible to everyone. DO Patent extends that same philosophy upstream—helping scientists start their modeling workflows with clean, verified molecular data.
Together, these tools close the loop between what’s known and what’s possible: extracting structures from literature, exploring their properties, and designing the next generation of molecules—all within the same ecosystem.
The future of drug discovery won’t belong to those with the biggest databases, but to those who can make the best use of the information already in front of them.
With DO Patent, that future starts one PDF at a time.
Try DO Patent for free today 💪

.png)




.png)
.png)



.png)

.png)








.png)
.png)
.png)




.png)

.png)

.png)