Try to play Sudoku with a large language model, and you’ll quickly see where AI reasoning breaks down. These complex systems can easily verify a solution you’ve entered, but they struggle to fill in the nine-by-nine grid themselves, especially when strict rules must be followed.
This struggle with constraints, whether it’s designing molecules, writing math proofs, or planning a travel itinerary, is common across all language models. It takes time and enormous computing power for the largest systems to handle these open-ended, rule-heavy requests reliably. A team from MIT’s Computer Science and Artificial Intelligence Laboratory developed a collaborative approach that sidesteps the problem entirely.
The new method, called DisCIPL (short for “Distributional Constraints by Inference Programming with Language Models”), uses a powerful LLM as a planner or “boss” that instructs multiple smaller “follower” models on how to proceed. This collective approach helps the small models provide more accurate responses than leading systems like OpenAI’s GPT-4o, while being dramatically more efficient and cheaper than both.
Delegating the Details to Small, Cheap AIs
The framework works much like contracting a company for a particular job. The “boss” model, which was GPT-4o in the initial experiments, receives the complex request and figures out a strategic plan. It then translates these guidelines into a clear set of instructions for the smaller models using a specialized programming language called LLaMPPL. This program can encode rules like “write eight lines of poetry where each line has exactly eight words,” queuing the smaller models to contribute their part to the final answer.
The boss can even correct a follower’s output where needed, perhaps replacing one model’s phrasing that doesn’t fit in a poem with a better option from another. Much of the efficiency stems from the follower models, which were small Llama-3.2-1B systems developed by Meta. These tiny models are 1,000 to 10,000 times cheaper per token than comparable reasoning models, which allows the researchers to run dozens of them in parallel for a fraction of the cost.
In testing, DisCIPL led to 40.1 percent shorter reasoning and an impressive 80.2 percent cost savings over OpenAI’s o1. The division of labor delivers major financial benefits that appeal to the eye of any budget-conscious developer.
“We are working toward improving LMs’ inference efficiency, particularly on the many modern applications of these models that involve generating outputs subject to constraints.” – Gabriel Grand, MIT PhD student and CSAIL researcher
Outperforming the Giants on Real-World Puzzles
MIT’s team put their system through rigorous testing, including real-world tasks like making ingredient lists, planning out a travel itinerary, and writing grant proposals with strict word limits. GPT-4o struggled with these requests. The DisCIPL framework shined.
In one writing test, the models were given a highly specific prompt: create a sentence that has exactly 18 words, where the fourth word must be “Glasgow,” the eighth should be “in,” and the 11th must be “and.” While watching the system generate the complex string of text that perfectly met the tight, technical rules, it became clear the collective approach was working. DisCIPL crafted coherent outputs while achieving accuracy similar to o1.
Lead author Gabriel Grand says that DisCIPL enables language models to guide one another toward better answers, which improves the overall speed and efficiency of the system. He emphasized the long-term goal, noting that society needs models that can provide accurate answers while using less energy. This is crucial because language models are consuming more energy as people use them more.
Senior author Jacob Andreas noted that the ability to “auto-formalize” text generation itself, or represent it with code, promises the same kinds of efficiency gains and guarantees already seen in fields like math and robotics. The team now plans to expand the framework to tackle even harder problems, like mathematical reasoning, and test its ability to meet users’ fuzzy preferences.
“The ability to auto-formalize text generation itself promises the same kinds of efficiency gains and guarantees we’ve seen in fields like math and robotics.” – Jacob Andreas, senior author
This work suggests an exciting counterpoint to the common idea that only massive models can handle complex tasks. If the problem is all about constraints, a coordinated crowd of smaller models may be the most practical way to keep an AI honest.
If our reporting has informed or inspired you, please consider making a donation. Every contribution, no matter the size, empowers us to continue delivering accurate, engaging, and trustworthy science and medical news. Independent journalism requires time, effort, and resources—your support ensures we can keep uncovering the stories that matter most to you.
Join us in making knowledge accessible and impactful. Thank you for standing with us!

