<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Ker102 Research | AI & Prompt Engineering Insights]]></title><description><![CDATA[Empirical research, deep dives, and data-driven insights into Large Language Models and prompt engineering.]]></description><link>https://blog.kaelux.dev</link><image><url>https://cdn.hashnode.com/uploads/logos/69c053a9d9da55a9a5dc37e2/cf5ada2b-b8d2-4ed4-9766-bff9081d6a98.jpg</url><title>Ker102 Research | AI &amp; Prompt Engineering Insights</title><link>https://blog.kaelux.dev</link></image><generator>RSS for Node</generator><lastBuildDate>Sat, 09 May 2026 05:45:55 GMT</lastBuildDate><atom:link href="https://blog.kaelux.dev/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[What We Learned From Analyzing 28,000 Production AI System Prompts]]></title><description><![CDATA[Over the last few months developing PromptTriage, we've collected and analyzed over 28,000 production system prompts. Most are bloated, contradictory, and actively hurt reasoning quality.

📉 Anti-Pat]]></description><link>https://blog.kaelux.dev/28k-ai-system-prompts-analysis</link><guid isPermaLink="true">https://blog.kaelux.dev/28k-ai-system-prompts-analysis</guid><category><![CDATA[AI]]></category><category><![CDATA[#PromptEngineering]]></category><category><![CDATA[Devops]]></category><category><![CDATA[MachineLearning]]></category><dc:creator><![CDATA[Kristofer Jussmann]]></dc:creator><pubDate>Tue, 24 Mar 2026 14:44:31 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/69c053a9d9da55a9a5dc37e2/e81c28fc-b7bf-4017-a03a-59a846846585.jpg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Over the last few months developing PromptTriage, we've collected and analyzed over <strong>28,000 production system prompts</strong>. Most are bloated, contradictory, and actively hurt reasoning quality.</p>
<hr />
<h2>📉 Anti-Pattern 1: The "Emotional Blackmail" Scaffold (14%)</h2>
<p><img src="https://github.com/Ker102/PromptTriage/releases/download/28k-study-system/Bar_chart_anti-pattern_202603241606.jpeg" alt="Anti-Pattern Prevalence in 28,000 Production System Prompts" />
<em>Caption: Anti-Pattern Prevalence in 28,000 Production System Prompts. (Full-width hero chart)</em></p>
<p>Over <strong>14%</strong> still contain emotional appeals:</p>
<blockquote>
<p><em>"Take a deep breath. If you miss a bug, the company will lose millions."</em></p>
</blockquote>
<p><strong>Why it fails:</strong> Modern RLHF has trained out the "anxiety" response. Emotional context distracts self-attention from the actual task.</p>
<hr />
<h2>🏗️ Anti-Pattern 2: The "Just in Case" Clause (62%)</h2>
<p><strong>62% of prompts over 300 words</strong> contained contradictory constraints. Our Study E data proved short prompts (&lt;50 words, scoring 80.1/100) consistently outperform long ones (&gt;300 words, 66.9/100).</p>
<hr />
<h2>🎭 Anti-Pattern 3: The "World Class Expert" Trap (80%)</h2>
<p>Nearly <strong>80%</strong> started with "Act as a world-class expert." Our Study C proved this provides zero lift on modern models (~78/100 with or without it).</p>
<hr />
<h2>🚀 The Fix: The 50-Word Rule</h2>
<ol>
<li><strong>State the role (10 words):</strong> "Extract data from SEC filings."</li>
<li><strong>State negatives (20 words):</strong> "Do not include pleasantries. Do not output markdown."</li>
<li><strong>Halt.</strong></li>
</ol>
<p><em><a href="https://prompttriage.kaelux.dev">PromptTriage</a> compresses 500-word prompts to the optimal 50-word framework.</em></p>
]]></content:encoded></item><item><title><![CDATA[AI Format Wars: Does the Shape of Your Prompt Matter? (1,080 Evals Later)]]></title><description><![CDATA[AI Format Wars: Does the Shape of Your Prompt Matter? (1,080 Evals Later)
We spend hours tweaking the words in our prompts, but how much thought do we give to the structure? If you ask an AI to return]]></description><link>https://blog.kaelux.dev/ai-format-wars</link><guid isPermaLink="true">https://blog.kaelux.dev/ai-format-wars</guid><category><![CDATA[AI]]></category><category><![CDATA[llm]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[#PromptEngineering]]></category><dc:creator><![CDATA[Kristofer Jussmann]]></dc:creator><pubDate>Sun, 22 Mar 2026 22:19:35 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/69c053a9d9da55a9a5dc37e2/93128a9b-e410-48d8-887e-d8bbab9ee09f.jpg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1>AI Format Wars: Does the Shape of Your Prompt Matter? (1,080 Evals Later)</h1>
<p>We spend hours tweaking the <em>words</em> in our prompts, but how much thought do we give to the <strong>structure</strong>? If you ask an AI to return data in JSON vs. Markdown, or if you write a concise 50-word prompt vs. a detailed 500-word prompt, does the quality of the reasoning actually change?</p>
<p>To find out, I ran <strong>Study E v2: The Format Wars</strong>.</p>
<p>I subjected 5 frontier models to <strong>1,080 rigorous evaluations</strong> across 12 distinct task domains (coding, math, data extraction, analysis, creative writing, and more). Every single evaluation was scored blindly by a 3-judge LLM jury on a 100-point scale.</p>
<p>The results completely changed how I build AI applications.</p>
<hr />
<h2>🔬 The Setup: 1,080 Evaluations</h2>
<p>We tested five heavyweight models:</p>
<ul>
<li><strong>GPT-5.4</strong> (OpenAI)</li>
<li><strong>Nemotron 3 Super 120B</strong> (Nvidia)</li>
<li><strong>Claude Sonnet 4.6</strong> (Anthropic)</li>
<li><strong>Gemini 3.1 Pro</strong> (Google)</li>
<li><strong>Qwen 3.5 397B</strong> (Alibaba)</li>
</ul>
<p>For each model, we ran 216 evaluations testing 18 unique prompt configurations:</p>
<ul>
<li><strong>6 Formats:</strong> Plain Text, Markdown, XML, JSON, YAML, Hybrid (Text + Code Blocks)</li>
<li><strong>3 Lengths:</strong> Short (&lt;50 words), Medium (~150 words), Long (&gt;300 words)</li>
</ul>
<p>The scoring was handled by a ruthless 3-judge panel (Llama 4 Maverick, Claude Opus 4.6, and Atla Selene Mini) grading on instruction following, reasoning quality, formatting adherence, and edge-case handling.</p>
<hr />
<h2>🏆 Finding 1: The Model Rankings</h2>
<p>Before looking at formats, how did the models perform overall across all 18 permutations?</p>
<p><img src="https://github.com/Ker102/PromptTriage/releases/download/Research-chart/Enhance_horizontal_bar_202603222302.jpeg" alt="Overall Model Rankings — Average Score out of 100 (1,080 evaluations)" /></p>
<ol>
<li>🥇 <strong>GPT-5.4:</strong> <code>88.1 / 100</code> — Won 10 out of 12 task domains</li>
<li>🥈 <strong>Nemotron 120B:</strong> <code>85.1 / 100</code> — Won 1 domain (Data Extraction), extremely close to GPT-5.4</li>
<li>🥉 <strong>Claude Sonnet 4.6:</strong> <code>69.5 / 100</code></li>
<li><strong>Gemini 3.1 Pro:</strong> <code>62.6 / 100</code> — Won 1 domain (Question Answering)</li>
<li><strong>Qwen 397B:</strong> <code>61.0 / 100</code></li>
</ol>
<p><strong>Takeaway:</strong> GPT-5.4 is the undeniable reasoning king right now. But Nvidia's Nemotron 120B is a shocking powerhouse—it scored incredibly close and actually beat GPT-5.4 outright in Data Extraction tasks. If you aren't testing Nemotron in your pipelines, you are missing out.</p>
<p><img src="https://github.com/Ker102/PromptTriage/releases/download/Research-chart/Pie_chart_with_202603222343.jpeg" alt="Task Domain Winners — GPT-5.4 dominates 10/12, but Nemotron owns Extraction" /></p>
<hr />
<h2>🧱 Finding 2: The Best Format is... JSON?</h2>
<p>If you want the highest quality reasoning and instruction following from an LLM, what format should you ask it to return?</p>
<p><img src="https://github.com/Ker102/PromptTriage/releases/download/Research-chart/Horizontal_bar_chart_202603222343.jpeg" alt="Format Impact on Reasoning Quality — Averaged Over All 5 Models" /></p>
<ol>
<li><strong>YAML:</strong> <code>74.6 / 100</code></li>
<li><strong>JSON:</strong> <code>74.4 / 100</code> (Statistical tie with YAML)</li>
<li><strong>Hybrid:</strong> <code>73.5 / 100</code></li>
<li><strong>XML:</strong> <code>73.3 / 100</code></li>
<li><strong>Markdown:</strong> <code>72.9 / 100</code></li>
<li><strong>Plain Text:</strong> <code>70.8 / 100</code></li>
</ol>
<p><strong>Takeaway:</strong> Asking the model to structure its output in JSON or YAML doesn't just make it easier for your code to parse—<strong>it actually improves the model's reasoning.</strong></p>
<p>Why? Forcing the model into a strict structural schema (like JSON keys) acts as a <strong>cognitive scaffold</strong>. It forces the model to categorize its thoughts before generating output, leading to fewer hallucinations and better instruction adherence. Plain unstructured text performed the worst across the board.</p>
<p>But here's the nuance: different models prefer different formats:</p>
<p><img src="https://github.com/Ker102/PromptTriage/releases/download/Research-chart/Heatmap_chart_with_202603222343.jpeg" alt="Format × Model Heatmap — The sweet-spot varies by model" /></p>
<p><em>Note: While JSON was the best overall, Nemotron and Qwen actually performed slightly better when outputting YAML.</em></p>
<hr />
<h2>📏 Finding 3: The Prompt Length Paradox</h2>
<p>We've been trained to write massive, highly detailed "megaprompts" with endless context. But the data reveals a startling paradox:</p>
<p><img src="https://github.com/Ker102/PromptTriage/releases/download/Research-chart/3-bar_chart_enhancements_202603222343.jpeg" alt="The Length Paradox — Shorter Prompts Win Across All Models" /></p>
<ul>
<li><strong>Short Prompts (&lt;50 words):</strong> <code>80.1 / 100</code></li>
<li><strong>Medium Prompts (~150 words):</strong> <code>72.8 / 100</code></li>
<li><strong>Long Prompts (&gt;300 words):</strong> <code>66.9 / 100</code></li>
</ul>
<p><strong>Takeaway:</strong> Across all 5 models and all 6 formats, <strong>short prompts absolutely demolished long prompts.</strong></p>
<p>When you flood the context window with too many instructions, constraints, and examples, the model suffers from <strong>attention dilution</strong>. It forgets the primary objective and gets bogged down trying to satisfy secondary constraints.</p>
<p>The worst combination in the entire study? <strong>Qwen 397B given a Long prompt asking for Plain Text (38.8/100).</strong></p>
<hr />
<h2>🏅 Finding 4: The Best and Worst Combinations</h2>
<p>What are the absolute best and worst model + format + length trios?</p>
<p><img src="https://github.com/Ker102/PromptTriage/releases/download/Research-chart/Dual-color_bar_chart_202603222343.jpeg" alt="Top 5 vs Bottom 5 Combinations — The gap is massive (53+ points)" /></p>
<p>The <strong>Golden Combo</strong> scored <code>92.2 / 100</code>: GPT-5.4 + Hybrid Output + Short Prompt.</p>
<hr />
<h2>🚀 The Ultimate Prompting Formula</h2>
<p>If you want to maximize the performance of a modern LLM, the data points to a clear formula:</p>
<ol>
<li><strong>Keep it brief:</strong> State your objective clearly in under 50 words. Drop the fluff.</li>
<li><strong>Demand structure:</strong> Always ask the model to return its answer in JSON or YAML. Avoid asking for unstructured text.</li>
<li><strong>Use the right model:</strong> GPT-5.4 for general reasoning/coding, Nemotron 120B for extraction.</li>
</ol>
<p>I built <a href="https://prompttriage.kaelux.dev">PromptTriage</a> specifically to help developers automatically refactor those bloated 500-word metaprompts down into the high-scoring "Short + Structured" format this data proves works best.</p>
<p><em>Data lovers: The full 1,080-row dataset and analysis script are open-sourced in the <a href="https://github.com/Ker102/PromptTriage">PromptTriage repo</a>.</em></p>
]]></content:encoded></item></channel></rss>