Making AI-generated code more accurate in any language

A novel method helps steer large language models toward producing output that follows the syntax and structure of programming or other formal languages.

Developers now increasingly rely on large language models (LLMs) to streamline code generation. But this efficiency comes at a cost if the generated code fails to run correctly or deviates from the rules of the target programming language.

While some existing strategies aim to enforce language compliance, they often compromise either the semantic intent or demand excessive computing power, especially for more intricate tasks.

Researchers from MIT and collaborators have proposed a new method that guides LLMs to create output that is both valid in structure and correct in meaning—without the overhead of trial and error. Their system leverages probability to prioritize the most promising outputs and discard weaker ones earlier, optimizing both performance and resource use.

Thanks to this approach, smaller LLMs were able to outperform significantly larger ones across several applied scenarios, including writing queries in SQL, planning robotic actions, and designing molecular structures.

This innovation could be especially beneficial for non-programmers. With it, users might issue natural language commands—like asking for a database report—and receive accurate, code-based responses tailored to their request.

“This isn’t just a theoretical advancement,” says MIT graduate student João Loula, one of the lead authors. “It has practical potential to make tools like programming assistants and AI for scientific research much more reliable.”

Loula co-authored the study with Benjamin LeBrun from the Mila-Quebec AI Institute and Li Du from Johns Hopkins University. Senior contributors include MIT’s Vikash Mansinghka, Yale’s Alexander K. Lew, ETH Zurich’s Tim Vieira, and Timothy J. O’Donnell of McGill University and Mila. The team will present their findings at the International Conference on Learning Representations.

Guiding generation with structure and intent
Traditional methods of ensuring structured LLM output often involve post-generation validation, such as running code to confirm it’s error-free. If the output fails validation, the user must start over—an inefficient loop.

Alternatively, developers might check output incrementally, correcting mistakes along the way. But this often disrupts the original intention behind the prompt, causing a mismatch between what the user meant and what the code ends up doing.

“Checking for structure is straightforward, but preserving meaning is much more complex,” Loula explains. “You can instantly verify syntax, but ensuring a piece of code does what the user wants requires running it.”

The researchers designed a system that injects expert-like knowledge into the generation process, nudging the LLM to produce results that align with both structural constraints and user intent.

Rather than training the LLM from scratch, their method supplements the model’s capabilities by integrating expert heuristics. This fusion approach offers a more scalable path than pure deep learning.

Their technique, built on sequential Monte Carlo sampling, enables multiple generation paths to proceed in parallel. Each path is evaluated and given a weight based on how likely it is to produce a valid, semantically accurate result.

At each generation step, the model concentrates on the highest-weighted outputs and abandons those less likely to succeed. This resembles having a knowledgeable assistant guide the model’s decision-making process at every stage.

Users simply provide the rules and intended meaning. The architecture takes over from there, steering generation toward the optimal result.

“We’ve handled the math so users don’t have to,” Loula says. “Whatever constraints you need, the model can factor them in and converge on the correct output.”

Small models, big results
The researchers applied their technique across various tasks, including generating Python code, SQL statements, molecular diagrams, and robotic plans.

Despite working with compact, open-source LLMs, the team achieved results that outperformed much larger, specialized models. For instance, in Python generation tasks, a lightweight model with their architecture outperformed a commercial system more than twice its size.

“This shows just how far we can push smaller models with the right strategy,” Loula notes.

Looking ahead, the team plans to expand their method to handle larger blocks of generated content, rather than working step by step. They also aim to integrate learning so that models improve over time as they produce more accurate outputs.

Eventually, this could open new possibilities for non-technical users. Imagine systems for automating data modeling or querying generative databases with plain English instructions.

Mansinghka suggests it could even support systems that interpret user questions while modeling data meaningfully—essentially making AI systems more conversational and semantically aware.

“One of the biggest challenges in linguistics is grounding language in real-world meaning,” says O’Donnell. “LLMs predict likely text, but that’s not the same as understanding. Our work shows it’s possible—at least in narrow domains—to map language to grounded meanings. It’s a small but critical step toward more human-like machine communication.”

What to read next