How to make small models punch way above their weight with DeepClause (and Prolog!)

Juni 16, 2026

tldr; with the DeepClause harness it’s possible increase benchmark performance of small models such as qwen3.6-35b-a3b by as much as 90%! If you’re a local LLM user and you want to get the last bit of performance out of your model, then consider trying DeepClause and especially its planning capabilities.

[The following is mostly written by a real person ;-]

It’s been a bit of common wisdom these days that non-SOTA models can be pushed towards better performance if they are paired with the right harness. As my very own (somewhat) opinionated harness called DeepClause has finally reached a more or less stable state, I’ve decided to try and see how well it would fare on a benchmark.

But let’s first try to understand what makes DeepClause different.

The core of DeepClause consists of a domain specific language (DSL) called “DeepClause Meta Language” (DML). It is build on top of Prolog and its main purpose is to concisely express LLM-based workflows and agent logic. The core predicates allow precise control of context (=the messages[] array) and orchestration of agent loops.

Here is a very short example:

tool(run_code(Code, Output), "Execute code in a sandboxed VM") :-
    exec(vm_exec(code: Code), Output).

agent_main :-
    system("You are a helpful coding assistant."),
    user(UserMsg),
    task("Help the user with their coding request.", string(Response)),
    answer(Response).

Some explanations are in order here:

tool/2 defines a tool that the LLM can call during a task. Here it wraps vm_exec — arbitrary code execution in a sandboxed VM. The LLM decides when to call it and what arguments to pass.
system/1 sets the system prompt — the instruction the LLM sees first.
user/1 adds that text to the conversation memory. From this point on, every subsequent task/N call inherits it as context.
task/N sends the accumulated memory (system prompt + user messages) to the LLM and gets a response. During execution, the LLM can call any defined tool/2 — and the arguments it passes are shaped by everything in its context.

The execution semantics of DML are those of Prolog, of course. This means that we can define branches for any predicate (including agent_main), so that failures in tasks or deterministic code will trigger backtracking. This makes DML very suitable for expressing plan and search logic (in the classical AI sense). Also, this also allows us to implement all sorts of test time compute strategies.

When it comes to local models or more generally smaller and cheaper models, we can often observe that they struggle with executing long running tasks. Even though the supported context length itself is not much of an issue anymore, it’s relatively safe to say that smaller models struggle to keep track of multi step plansinvolving e a large amount of tools calls. So, I do believe it does make sense to try to find a mechanism that constrains them in their execution logic, but leaves enough freedom to actually find a solution. This is where DML comes in. It is a simple and concise language that lends itself to express exactly those long horizon plans where small models fail. And since it can be executed deterministically, we can increase the chances of successfully finishing a longer task even with a less than SOTA model. Moreover, we can easily let a SOTA model do the planning and execute using a smaller, faster model.

Attacking the DeepPlanning benchmark

The DeepPlanning Benchmark (https://arxiv.org/abs/2601.18137) is a recent benchmark that tests model long horizon planning capabilities. It is quite a challenging benchmark and a suitable candidate to test how we can improve upon base model capabilities with a smart harness.

To attack the benchmark we can now run the following steps:

Define the planner module: We write a short DML program that reads the user request incl. all constraints and produces as output an executable plan in the form of a DML program. This program can be considered as an auto generated agent tailored to a specific task.
For each problem instance in the benchmark we run the planner and execute the generated plan.

Although we could generate the DML plans in one go using an LLM with a typical coding-agent style loop, we choose a much simpler approach for the first step: Our planner DML prompts the LLM to output a list of steps as natural language descriptions and then have the planner deterministically convert the results into DML. Compared to using a full coding-agent-style loop (which DeepClause also supports) this is both simple and robust and saves us a great deal of tokens.

The generated plans then look as follows:

:- use_module(library(http/json)).

% --- Travel Tool Bridge ---

run_tool(ToolName, ArgsDict, Result) :-
    param(db_path, DbPath),
    param(bridge_dir, BridgeDir),
    param(bench_dir, BenchDir),
    param(python_path, PythonPath),
    (var(PythonPath) -> PythonPath = 'python3' ; true),
    atom_json_dict(ArgsJson, ArgsDict, []),
    format(string(ArgsFile), ".dc_bridge_~w.json", [ToolName]),
    exec(write_file(path: ArgsFile, content: ArgsJson), _),
    format(string(Cmd), "~w '~w/python-bridge.py' --domain travel --db-path '~w' --bench-dir '~w' --tool ~w --args-file '~w'", [PythonPath, BridgeDir, DbPath, BenchDir, ToolName, ArgsFile]),
    exec(bash(command: Cmd), Raw),
    parse_bridge_result(Raw, Result).

tool(query_train_info(Origin, Dest, Date, Result),
     "Search for train tickets between two cities on a given date. Returns train number, times, stations, duration, seat class, remaining seats, price.") :-
    run_tool(query_train_info, _{origin: Origin, destination: Dest, depDate: Date}, Result).


[ ... more tool definitions ...]

agent_main(Request) :-
    system("You are a travel planning agent creating a detailed itinerary for 4 travelers from Zhengzhou to Quanzhou (Nov 12-15, 2025). Execute the planned gathering steps autonomously without asking for user input.\n\nRULES:\n- All information must come from tool results — never fabricate names, prices, or details\n- Restaurant names must be EXACT matches from recommend_restaurants/query_restaurant_details results\n- Use recommend_restaurants with attraction/hotel coordinates passed as STRINGS, not restaurant coordinates\n- Current City on intercity travel days must use 'from CityA to CityB' format (literal 'from' and 'to')\n- Budget: all prices are per-unit; multiply by number of travelers/rooms in the final summary\n- Schedule times must be continuous with no gaps; do not schedule breakfast; full days require lunch+dinner\n- Last day (Nov 15) must end at the departure airport/station for the return journey\n- Select the shortest-duration direct outbound flight, a 4-star hotel with free parking, the 3 highest-rated attractions, and the cheapest restaurant in Donghai Bay area\n\nProceed through all gathering steps, then compile the complete itinerary with continuous daily schedules, exact venue names, and accurate budget calculations."),
    user(Request),
    task("Search for direct flights from Zhengzhou to Quanzhou on November 12, 2025, and return flights from Quanzhou to Zhengzhou on November 15, 2025, for 4 passengers, identif
ying the shortest-duration direct flight option for the outbound journey."),
    task("Search for four-star hotels in Quanzhou with free parking available for check-in on November 12, 2025 and check-out on November 15, 2025, accommodating 4 guests."),
    task("Search for and recommend top-rated attractions in Quanzhou, then retrieve details to identify the three highest-rated attractions."),
    task("Search for restaurants in the Donghai Bay area of Quanzhou and find the one with the lowest average spending per person."),
    task('Based on all gathered information, generate the complete travel itinerary inside <plan></plan> tags. Include day-by-day schedule with times, Current City, attractions,
meals with EXACT restaurant names from tool results, transport details, and a budget summary where per-unit costs are multiplied by number of travelers/rooms.', string(DraftPlan)
),
    task('Review this travel plan for errors and fix them. Check ALL of the following:
1. BUDGET: Is total cost (per-person costs * travelers, per-room costs * rooms) within budget?
2. TIME CONTINUITY: Are times continuous with no gaps or overlaps? Each activity end time = next start time.
3. MEAL RULES: Full sightseeing days need lunch AND dinner. No breakfast. Meals 1-2 hours. At least 2 hours between lunch and dinner.
4. DAILY STRUCTURE: Every day except last ends at hotel. Last day ends at departure airport/station.
5. GEOSPATIAL: No teleportation — travel_city between different locations.
6. DIVERSITY: No repeating restaurants or attractions across days.
7. NAMES: All names EXACTLY match tool results — no abbreviations or renames.
8. CURRENT CITY: Intercity days must say "from CityA to CityB".

Plan to review:
{DraftPlan}

If the plan has errors, output the CORRECTED plan inside <plan></plan> tags. If it is already correct, output it unchanged inside <plan></plan> tags. Store the final verified plan in VerifiedPlan.', string(VerifiedPlan)),
    answer(VerifiedPlan).

agent_main(Request) :-
    system('You are a travel planning assistant. Create a travel plan using the available tools. Output inside <plan></plan> tags with budget summary.'),
    user(Request),
    task('Create a complete travel plan using all available tools. Output inside <plan></plan> tags.', string(Plan)),
    answer(Plan).
```

As we can see from this, this enforces a clear structure on what our travel planning agent can do in each task instance. Alternatively, if we were to give the request directly to an LLM plus basic harness that just runs one big agent loop, then it would be much more likely that the model would at some point struggle with executing all steps in the correct order, while simultaneously correctly reasoning about what to do with all those tool call results.

To verify this claim, let’s look at what this gives us in the case of the travel planning benchmark, which tests long-horizon agentic planning with verifiable constraints (time, budget, geography). The benchmark requires agents to gather information via tool calls, reason about local constraints, and produce globally coherent multi-day itineraries evaluated across 8 commonsense dimensions and personalized hard constraints.

Here are the results for qwen3.6-35b-a3b:

Keypoints:

As we can see, we get an impressive boost in performance on the composite score of more than 90%!
Also, we can see that the baseline agent has a much lower delivery rate, this means it fails completely (without any result) with a higher probability than the DeepClause agent.
Why is Case Accuracy = 0? This means that none of the produced travel plan actually satisfied all constraints imposed by the benchmark. So, this Benchmark is actually very hard, even for SOTA LLMs! According to the original benchmark paper, even Gemini 3 Pro Preview only scored a measly 0.7% on this metric! This also suggests that there are plenty of more things to try with the DeepClause approach (DML natively supports constraint logic programming after all!)

Benchmark Reference: https://arxiv.org/abs/2601.18137

Some more thoughts and notes

Is this a groundbreaking result? Well, probably no. But it is yet another data point for the claim that we can reach acceptable performance at much lower costs (and even with local models!) if we combine the right harness with the right model
A lot of the magic happens behind the scenes: during execution, when the Prolog interpreter encounters a task predicates, it yields back to the calling Javascript loop, where the actual agent loop is implemented. This bit of code is optimized to work with smaller models (retry loops, nudging, …)
These are the results for specific model only, there are plenty of more experiments to run (Also, this is a blog post, not a paper). Anybody wants to help?
How well will this work for me and my particular use case"?? I have no idea, please try and let me know how it goes!

Substack von Andreas

Diskussion über diese Post

Sind Sie bereit für mehr?