<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://jnidzwetzki.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://jnidzwetzki.github.io/" rel="alternate" type="text/html" /><updated>2026-03-04T21:34:02+00:00</updated><id>https://jnidzwetzki.github.io/feed.xml</id><title type="html">Jan’s website and blog</title><subtitle>Jan&apos;s blog on big data, databases, and distributed systems</subtitle><entry><title type="html">pg_plan_alternatives: Tracing PostgreSQL’s Query Plan Alternatives using eBPF</title><link href="https://jnidzwetzki.github.io/2026/03/04/pg-plan-alternatives.html" rel="alternate" type="text/html" title="pg_plan_alternatives: Tracing PostgreSQL’s Query Plan Alternatives using eBPF" /><published>2026-03-04T00:00:00+00:00</published><updated>2026-03-04T00:00:00+00:00</updated><id>https://jnidzwetzki.github.io/2026/03/04/pg-plan-alternatives</id><content type="html" xml:base="https://jnidzwetzki.github.io/2026/03/04/pg-plan-alternatives.html"><![CDATA[<p>PostgreSQL uses a cost-based optimizer (CBO) to determine the best execution plan for a given query. The optimizer considers multiple alternative plans during the planning phase. Using the <code class="language-plaintext highlighter-rouge">EXPLAIN</code> command, a user can only inspect the chosen plan, but not the alternatives that were considered. To address this gap, I developed <code class="language-plaintext highlighter-rouge">pg_plan_alternatives</code>, a tool that uses eBPF to instrument the PostgreSQL optimizer and trace all alternative plans and their costs that were considered during the planning phase. This information helps the user understand the optimizer’s decision-making process and tune system parameters. This article explains how <a href="https://github.com/jnidzwetzki/pg_plan_alternatives">pg_plan_alternatives</a> works, provides examples, and discusses the insights the tool can provide.</p>

<!--more-->

<h1 id="cost-based-optimization">Cost-Based Optimization</h1>
<p>SQL is a declarative language, which means that users only specify what they want to achieve, but not how to achieve it. For example, should the query <code class="language-plaintext highlighter-rouge">SELECT * FROM mytable WHERE age &gt; 50;</code> perform a full table scan and apply a filter, or should it use an index (see the <a href="/2025/06/03/art-of-query-optimization.html">following blog post</a> for more details about this)? The optimizer of the database management system is responsible for determining the best execution plan to execute a given query. During query planning, the optimizer generates multiple alternative plans. Many DBMSs perform <a href="https://dl.acm.org/doi/10.1145/582095.582099">cost-based optimization</a>, where each plan is qualified with a cost estimate, a numerical value representing the estimated resource usage (e.g., CPU time, I/O operations) required to execute the plan. The optimizer then selects the plan with the lowest estimated cost as the final execution plan for the query.</p>

<p>To calculate the costs of the plan nodes, the optimizer uses a cost model that accounts for factors such as the number of rows predicted to be processed (based on statistics and selectivity estimates) and constants.</p>

<h2 id="query-plans-in-postgresql">Query Plans in PostgreSQL</h2>
<p>Using the <code class="language-plaintext highlighter-rouge">EXPLAIN</code> command in PostgreSQL, you can see the final chosen plan and its estimated total cost, and the costs of the individual plan nodes. For example, using <code class="language-plaintext highlighter-rouge">EXPLAIN (VERBOSE, ANALYZE) SELECT * FROM test1 WHERE id = 5;</code>, the query plan of the given select query is shown:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">jan2</span><span class="o">=#</span> <span class="k">EXPLAIN</span> <span class="p">(</span><span class="k">VERBOSE</span><span class="p">,</span> <span class="k">ANALYZE</span><span class="p">)</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">test1</span> <span class="k">WHERE</span> <span class="n">id</span> <span class="o">=</span> <span class="mi">5</span><span class="p">;</span>
 <span class="n">QUERY</span> <span class="n">PLAN</span>
<span class="c1">------------------------------------------------------------------------------------------------------------------------------</span>
 <span class="k">Index</span> <span class="k">Only</span> <span class="n">Scan</span> <span class="k">using</span> <span class="n">test1_pkey</span> <span class="k">on</span> <span class="k">public</span><span class="p">.</span><span class="n">test1</span>  <span class="p">(</span><span class="n">cost</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">28</span><span class="p">..</span><span class="mi">8</span><span class="p">.</span><span class="mi">29</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1</span> <span class="n">width</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span> <span class="p">(</span><span class="n">actual</span> <span class="nb">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">153</span><span class="p">..</span><span class="mi">0</span><span class="p">.</span><span class="mi">160</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
   <span class="k">Output</span><span class="p">:</span> <span class="n">id</span>
   <span class="k">Index</span> <span class="n">Cond</span><span class="p">:</span> <span class="p">(</span><span class="n">test1</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="mi">5</span><span class="p">)</span>
 <span class="n">Heap</span> <span class="n">Fetches</span><span class="p">:</span> <span class="mi">1</span>
 <span class="n">Planning</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">1</span><span class="p">.</span><span class="mi">166</span> <span class="n">ms</span>
 <span class="n">Execution</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">284</span> <span class="n">ms</span>
<span class="p">(</span><span class="mi">6</span> <span class="k">rows</span><span class="p">)</span>
</code></pre></div></div>

<p>The plan consists of only one <code class="language-plaintext highlighter-rouge">Index Only Scan</code> node, with an estimated total cost of <code class="language-plaintext highlighter-rouge">0.28..8.29</code>, which means that the startup cost is <code class="language-plaintext highlighter-rouge">0.28</code> and the total cost is <code class="language-plaintext highlighter-rouge">8.29</code>. The startup cost is the cost of getting the first row, while the total cost is the cost of getting all rows.</p>

<p>A more complex example with a join might look like this:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">jan2</span><span class="o">=#</span> <span class="k">EXPLAIN</span> <span class="p">(</span><span class="k">VERBOSE</span><span class="p">,</span> <span class="k">ANALYZE</span><span class="p">)</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">test1</span> <span class="k">LEFT</span> <span class="k">JOIN</span> <span class="n">test2</span> <span class="k">ON</span> <span class="p">(</span><span class="n">test1</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="n">test2</span><span class="p">.</span><span class="n">id</span><span class="p">);</span>
 <span class="n">QUERY</span> <span class="n">PLAN</span>
<span class="c1">-------------------------------------------------------------------------------------------------------------------------</span>
 <span class="n">Hash</span> <span class="k">Left</span> <span class="k">Join</span> <span class="p">(</span><span class="n">cost</span><span class="o">=</span><span class="mi">27</span><span class="p">.</span><span class="mi">50</span><span class="p">..</span><span class="mi">45</span><span class="p">.</span><span class="mi">14</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1000</span> <span class="n">width</span><span class="o">=</span><span class="mi">8</span><span class="p">)</span> <span class="p">(</span><span class="n">actual</span> <span class="nb">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">625</span><span class="p">..</span><span class="mi">1</span><span class="p">.</span><span class="mi">422</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1000</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
   <span class="k">Output</span><span class="p">:</span> <span class="n">test1</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">test2</span><span class="p">.</span><span class="n">id</span>
 <span class="k">Inner</span> <span class="k">Unique</span><span class="p">:</span> <span class="k">true</span>
   <span class="n">Hash</span> <span class="n">Cond</span><span class="p">:</span> <span class="p">(</span><span class="n">test1</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="n">test2</span><span class="p">.</span><span class="n">id</span><span class="p">)</span>
 <span class="o">-&gt;</span>  <span class="n">Seq</span> <span class="n">Scan</span> <span class="k">on</span> <span class="k">public</span><span class="p">.</span><span class="n">test1</span>  <span class="p">(</span><span class="n">cost</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">00</span><span class="p">..</span><span class="mi">15</span><span class="p">.</span><span class="mi">00</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1000</span> <span class="n">width</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span> <span class="p">(</span><span class="n">actual</span> <span class="nb">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">038</span><span class="p">..</span><span class="mi">0</span><span class="p">.</span><span class="mi">220</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1000</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
         <span class="k">Output</span><span class="p">:</span> <span class="n">test1</span><span class="p">.</span><span class="n">id</span>
 <span class="o">-&gt;</span>  <span class="n">Hash</span> <span class="p">(</span><span class="n">cost</span><span class="o">=</span><span class="mi">15</span><span class="p">.</span><span class="mi">00</span><span class="p">..</span><span class="mi">15</span><span class="p">.</span><span class="mi">00</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1000</span> <span class="n">width</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span> <span class="p">(</span><span class="n">actual</span> <span class="nb">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">571</span><span class="p">..</span><span class="mi">0</span><span class="p">.</span><span class="mi">572</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1000</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
         <span class="k">Output</span><span class="p">:</span> <span class="n">test2</span><span class="p">.</span><span class="n">id</span>
 <span class="n">Buckets</span><span class="p">:</span> <span class="mi">1024</span>  <span class="n">Batches</span><span class="p">:</span> <span class="mi">1</span> <span class="n">Memory</span> <span class="k">Usage</span><span class="p">:</span> <span class="mi">44</span><span class="n">kB</span>
 <span class="o">-&gt;</span>  <span class="n">Seq</span> <span class="n">Scan</span> <span class="k">on</span> <span class="k">public</span><span class="p">.</span><span class="n">test2</span>  <span class="p">(</span><span class="n">cost</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">00</span><span class="p">..</span><span class="mi">15</span><span class="p">.</span><span class="mi">00</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1000</span> <span class="n">width</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span> <span class="p">(</span><span class="n">actual</span> <span class="nb">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">019</span><span class="p">..</span><span class="mi">0</span><span class="p">.</span><span class="mi">191</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1000</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
               <span class="k">Output</span><span class="p">:</span> <span class="n">test2</span><span class="p">.</span><span class="n">id</span>
 <span class="n">Planning</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">3</span><span class="p">.</span><span class="mi">436</span> <span class="n">ms</span>
 <span class="n">Execution</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">1</span><span class="p">.</span><span class="mi">551</span> <span class="n">ms</span>
<span class="p">(</span><span class="mi">13</span> <span class="k">rows</span><span class="p">)</span>
</code></pre></div></div>

<p>In this case, the plan consists of a <code class="language-plaintext highlighter-rouge">Hash Left Join</code> node with an estimated cost of <code class="language-plaintext highlighter-rouge">27.50..45.14</code>. The plan also contains two <code class="language-plaintext highlighter-rouge">Seq Scan</code> nodes with estimated costs of <code class="language-plaintext highlighter-rouge">15.00..15.00</code>, one for each of the <code class="language-plaintext highlighter-rouge">test1</code> and <code class="language-plaintext highlighter-rouge">test2</code> tables. Furthermore, a <code class="language-plaintext highlighter-rouge">Hash</code> node is used to build a hash table for <code class="language-plaintext highlighter-rouge">test2</code>; its cost remains <code class="language-plaintext highlighter-rouge">15.00..15.00</code>.</p>

<h2 id="structure-of-a-query-plan">Structure of a Query Plan</h2>
<p>Like most database management systems, PostgreSQL uses a tree of plan nodes to organize the data processing. Each node represents a specific operation (e.g., scan, join, aggregate), requests data from its child nodes as input tuples (like an iterator), and provides the operation’s result as output. The child nodes usually read the data of tables, and the root node provides the final result of the query. The interface of the nodes is standardized, which means that the nodes can be easily combined to create different plans (see the <a href="https://cs-people.bu.edu/mathan/reading-groups/papers-classics/volcano.pdf">open-next-close protocol</a>). Most nodes just let tuples pass in a streaming manner and work on only one tuple at a time. However, some nodes, like the sort operations, have to read the entire input of the child node before they can emit the first output tuple.</p>

<p>Another representation of the plan is the following, which shows the plan nodes and their relationships in a graph format:</p>

<div class="mermaid">
graph TB
 A[Hash Left Join]
 B[Seq Scan test1]
 C[Hash]
 D[Seq Scan test2]
 A --&gt; B
 A --&gt; C
 C --&gt; D
</div>

<h1 id="trace-plan-alternatives-using-pg_plan_alternatives">Trace Plan Alternatives using pg_plan_alternatives</h1>
<p>For all queries, the optimizer considers multiple alternative plans during the planning phase. However, PostgreSQL does not provide a way to inspect these alternatives. This is where <code class="language-plaintext highlighter-rouge">pg_plan_alternatives</code> comes into play.</p>

<p><code class="language-plaintext highlighter-rouge">pg_plan_alternatives</code> uses <a href="https://ebpf.io/">eBPF</a> (Extended Berkeley Packet Filter) to instrument the PostgreSQL optimizer. eBPF allows loading custom programs into the kernel and attaching them to various events, such as function calls. By attaching an eBPF program to the <a href="https://github.com/postgres/postgres/blob/f191dc676632614ea1c74616f457096114f9fa29/src/backend/optimizer/util/pathnode.c#L459"><code class="language-plaintext highlighter-rouge">add_path</code></a> function of the PostgreSQL optimizer, the tool can capture all the alternative paths that are generated and considered. <a href="https://github.com/postgres/postgres/blob/b30656ce0071806ce649f2b69a4d06018d5c01a4/src/include/nodes/pathnodes.h#L1950"><code class="language-plaintext highlighter-rouge">Paths</code></a> are an early lightweight representation of a plan node during query planning. Such a path consists of an operator, the estimated costs, and the number of tuples the node is expected to process.</p>

<h2 id="high-level-architecture-of-pg_plan_alternatives">High-Level Architecture of pg_plan_alternatives</h2>

<p>pg_plan_alternatives consists of three main components:</p>
<ul>
  <li>An eBPF program that runs in kernel space.</li>
  <li>A user-space script, <code class="language-plaintext highlighter-rouge">pg_plan_alternatives</code>, that collects events emitted by the eBPF program and loads the eBPF program into the kernel.</li>
  <li>A visualization script, <code class="language-plaintext highlighter-rouge">visualize_plan_graph</code>, that takes the collected events and visualizes the alternative plans.</li>
</ul>

<div class="mermaid">
graph LR
 A["eBPF program<br />(kernel space)"]
 B["pg_plan_alternatives<br />(user space)"]
 C["PostgreSQL<br />(user space)"]
 D["visualize_plan_graph<br />(user space)"]
 A --&gt;|attaches to| C
 A --&gt;|emits events| B
 B --&gt;|visualizes| D
</div>

<p>The <code class="language-plaintext highlighter-rouge">pg_plan_alternatives</code> Python script, which runs in user space, collects the data emitted by the eBPF program and prints the received events. These events can then be visualized using the <code class="language-plaintext highlighter-rouge">visualize_plan_graph</code> script, which is also part of the <code class="language-plaintext highlighter-rouge">pg_plan_alternatives</code> project. The visualization shows the alternative plans and their costs in a graph format, which makes it easier to understand the decision-making process of the optimizer.</p>

<h3 id="capturing-paths-during-query-planning">Capturing Paths During Query Planning</h3>
<p>Capturing the plans directly at the moment they are generated is necessary, since there is no point in time when all the alternative query plans are available in memory. The optimizer removes alternatives that are not promising directly using <a href="https://github.com/postgres/postgres/blob/f191dc676632614ea1c74616f457096114f9fa29/src/backend/optimizer/util/pathnode.c#L664"><code class="language-plaintext highlighter-rouge">pfree</code></a> and only keeps the most promising ones (e.g., those with lower estimated costs). When the query planning is done, a second probe is attached to the <a href="https://github.com/postgres/postgres/blob/f191dc676632614ea1c74616f457096114f9fa29/src/backend/optimizer/plan/createplan.c#L339"><code class="language-plaintext highlighter-rouge">create_plan</code></a> function, which is responsible for creating the final execution plan. This allows <code class="language-plaintext highlighter-rouge">pg_plan_alternatives</code> to determine which of the alternatives was finally chosen by the optimizer.</p>

<h3 id="handling-function-parameters-in-ebpf">Handling Function Parameters in eBPF</h3>
<p>A challenge is dealing with the function parameters in the eBPF program, since PostgreSQL structs like <code class="language-plaintext highlighter-rouge">Path</code> are opaque to eBPF. The function parameters are therefore pointers to opaque data structures, but the eBPF program needs to access certain fields (e.g., the type of the path or the costs).</p>

<p>There are three approaches to this problem:</p>

<ul>
  <li>
    <p>Copy PostgreSQL structs from the source code into the eBPF program. This is possible because both are written in C. However, eBPF supports a limited set of datatypes, and complex struct members (such as pointers to other structs) must be resolved and converted to simple data types, which requires considerable work.</p>
  </li>
  <li>
    <p>Hard-code the byte offsets of the fields. Like the previous approach, this makes the tool fragile and less robust to changes in the PostgreSQL codebase.</p>
  </li>
  <li>
    <p>The approach used by <code class="language-plaintext highlighter-rouge">pg_plan_alternatives</code> is to analyze the PostgreSQL binary and extract the offsets of the relevant struct fields using <a href="https://dwarfstd.org/">DWARF</a> debug information. This debug information lets the plan tracer <a href="https://github.com/jnidzwetzki/pg_plan_alternatives/blob/bd37a1b56495c43877956dce85fb81db2eaf08ba/src/pg_plan_alternatives/helper.py#L64">determine</a> the byte offset of each field. These offsets are extracted by the user-space part of <code class="language-plaintext highlighter-rouge">pg_plan_alternatives</code> and provided to the eBPF program using <code class="language-plaintext highlighter-rouge">#define</code> directives. With these offsets, the eBPF program can locate the relevant fields and read the necessary information. Extracting offsets dynamically allows <code class="language-plaintext highlighter-rouge">pg_plan_alternatives</code> to adapt to different PostgreSQL versions without changing the eBPF program.</p>
  </li>
</ul>

<h1 id="insights-from-pg_plan_alternatives">Insights from pg_plan_alternatives</h1>
<p>The insights that <code class="language-plaintext highlighter-rouge">pg_plan_alternatives</code> provides can be used in several places.</p>

<ul>
  <li>
    <p>The PostgreSQL planner can be tuned using configuration parameters such as <code class="language-plaintext highlighter-rouge">random_page_cost</code> or <code class="language-plaintext highlighter-rouge">cpu_tuple_cost</code>. These parameters feed into the cost functions and influence the planner’s estimates; they should match the actual system environment so the optimizer can make good decisions. Using <code class="language-plaintext highlighter-rouge">pg_plan_alternatives</code>, you can see which alternatives are considered and how close their costs are.</p>
  </li>
  <li>
    <p>Extension developers who rewrite query plans during the planning phase should be aware of the alternatives considered by the optimizer, since their code must handle all relevant cases correctly. <code class="language-plaintext highlighter-rouge">pg_plan_alternatives</code> helps visualize the alternatives generated by the optimizer.</p>
  </li>
</ul>

<h1 id="examples">Examples</h1>
<p>In this section examples of how <code class="language-plaintext highlighter-rouge">pg_plan_alternatives</code> can be used to gain insights into PostgreSQL’s query planning process. All examples below use two tables, each with 1000 rows and up-to-date statistics:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">test1</span><span class="p">(</span><span class="n">id</span> <span class="nb">INTEGER</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">);</span>
<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">test2</span><span class="p">(</span><span class="n">id</span> <span class="nb">INTEGER</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">);</span>

<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">test1</span> <span class="k">SELECT</span> <span class="n">generate_series</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1000</span><span class="p">);</span>
<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">test2</span> <span class="k">SELECT</span> <span class="n">generate_series</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1000</span><span class="p">);</span>

<span class="k">ANALYZE</span><span class="p">;</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">pg_plan_alternatives</code> tracer can be installed using the following command:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>pg_plan_alternatives
</code></pre></div></div>

<h2 id="simple-select-query">Simple SELECT Query</h2>

<p>To inspect the alternative plans for a simple <code class="language-plaintext highlighter-rouge">SELECT</code> query, the query tracer can be started as follows:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>pg_plan_alternatives <span class="nt">-x</span> /home/jan/postgresql-sandbox/bin/REL_17_1_DEBUG/bin/postgres <span class="nt">-n</span> <span class="si">$(</span>pg_config <span class="nt">--includedir-server</span><span class="si">)</span>/nodes/nodetags.h
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">-x</code> parameter specifies the path to the PostgreSQL binary that should be instrumented, while the <code class="language-plaintext highlighter-rouge">-n</code> parameter specifies the path to the <code class="language-plaintext highlighter-rouge">nodetags.h</code> header file, which contains the definitions of the plan node types. When no <code class="language-plaintext highlighter-rouge">-p</code> parameter is specified, the tool will trace all running processes for that binary. After starting the tracer, the following query can be executed:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">test1</span><span class="p">;</span>
</code></pre></div></div>

<p>The tracer should show an output as follows:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">================================================================================</span>
PostgreSQL Plan Alternatives Tracer
Binary: /home/jan/postgresql-sandbox/bin/REL_17_1_DEBUG/bin/postgres
Tracing all PostgreSQL processes
<span class="o">================================================================================</span>

Received event: <span class="nv">PID</span><span class="o">=</span>3917080, <span class="nv">Type</span><span class="o">=</span>ADD_PATH, <span class="nv">PathType</span><span class="o">=</span>T_SeqScan
<span class="o">[</span>20:14:54.116] <span class="o">[</span>PID 3917080] ADD_PATH: T_SeqScan <span class="o">(</span><span class="nv">startup</span><span class="o">=</span>0.00, <span class="nv">total</span><span class="o">=</span>15.00, <span class="nv">rows</span><span class="o">=</span>1000, <span class="nv">parent_rti</span><span class="o">=</span>1, <span class="nv">parent_oid</span><span class="o">=</span>26144<span class="o">)</span>
Received event: <span class="nv">PID</span><span class="o">=</span>3917080, <span class="nv">Type</span><span class="o">=</span>ADD_PATH, <span class="nv">PathType</span><span class="o">=</span>T_IndexOnlyScan
<span class="o">[</span>20:14:54.118] <span class="o">[</span>PID 3917080] ADD_PATH: T_IndexOnlyScan <span class="o">(</span><span class="nv">startup</span><span class="o">=</span>0.28, <span class="nv">total</span><span class="o">=</span>43.27, <span class="nv">rows</span><span class="o">=</span>1000, <span class="nv">parent_rti</span><span class="o">=</span>1, <span class="nv">parent_oid</span><span class="o">=</span>26144<span class="o">)</span>
Received event: <span class="nv">PID</span><span class="o">=</span>3917080, <span class="nv">Type</span><span class="o">=</span>ADD_PATH, <span class="nv">PathType</span><span class="o">=</span>T_BitmapHeapScan
<span class="o">[</span>20:14:54.118] <span class="o">[</span>PID 3917080] ADD_PATH: T_BitmapHeapScan <span class="o">(</span><span class="nv">startup</span><span class="o">=</span>25.52, <span class="nv">total</span><span class="o">=</span>40.52, <span class="nv">rows</span><span class="o">=</span>1000, <span class="nv">parent_rti</span><span class="o">=</span>1, <span class="nv">parent_oid</span><span class="o">=</span>26144<span class="o">)</span>
Received event: <span class="nv">PID</span><span class="o">=</span>3917080, <span class="nv">Type</span><span class="o">=</span>ADD_PATH, <span class="nv">PathType</span><span class="o">=</span>T_SeqScan
<span class="o">[</span>20:14:54.118] <span class="o">[</span>PID 3917080] ADD_PATH: T_SeqScan <span class="o">(</span><span class="nv">startup</span><span class="o">=</span>0.00, <span class="nv">total</span><span class="o">=</span>15.00, <span class="nv">rows</span><span class="o">=</span>1000, <span class="nv">parent_oid</span><span class="o">=</span>26144<span class="o">)</span>
Received event: <span class="nv">PID</span><span class="o">=</span>3917080, <span class="nv">Type</span><span class="o">=</span>CREATE_PLAN, <span class="nv">PathType</span><span class="o">=</span>T_SeqScan
<span class="o">[</span>20:14:54.118] <span class="o">[</span>PID 3917080] CREATE_PLAN: T_SeqScan <span class="o">(</span><span class="nv">startup</span><span class="o">=</span>0.00, <span class="nv">total</span><span class="o">=</span>15.00<span class="o">)</span> <span class="o">[</span>CHOSEN]
</code></pre></div></div>

<p>The output already gives insights into the planning process. For example, the optimizer considered three different plans for scanning the <code class="language-plaintext highlighter-rouge">test1</code> table: a <code class="language-plaintext highlighter-rouge">SeqScan</code>, an <code class="language-plaintext highlighter-rouge">IndexOnlyScan</code>, and a <code class="language-plaintext highlighter-rouge">BitmapHeapScan</code>. The costs of these plans are shown. When the <code class="language-plaintext highlighter-rouge">CREATE_PLAN</code> event is emitted, the <code class="language-plaintext highlighter-rouge">SeqScan</code> plan was chosen by the optimizer, which is also reflected in the <code class="language-plaintext highlighter-rouge">EXPLAIN</code> output of the query:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">jan2</span><span class="o">=#</span> <span class="k">EXPLAIN</span> <span class="p">(</span><span class="k">VERBOSE</span><span class="p">,</span> <span class="k">ANALYZE</span><span class="p">)</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">test1</span><span class="p">;</span>
 <span class="n">QUERY</span> <span class="n">PLAN</span>
<span class="c1">-------------------------------------------------------------------------------------------------------------</span>
 <span class="n">Seq</span> <span class="n">Scan</span> <span class="k">on</span> <span class="k">public</span><span class="p">.</span><span class="n">test1</span>  <span class="p">(</span><span class="n">cost</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">00</span><span class="p">..</span><span class="mi">15</span><span class="p">.</span><span class="mi">00</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1000</span> <span class="n">width</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span> <span class="p">(</span><span class="n">actual</span> <span class="nb">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">119</span><span class="p">..</span><span class="mi">0</span><span class="p">.</span><span class="mi">291</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1000</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
   <span class="k">Output</span><span class="p">:</span> <span class="n">id</span>
 <span class="n">Planning</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">855</span> <span class="n">ms</span>
 <span class="n">Execution</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">437</span> <span class="n">ms</span>
<span class="p">(</span><span class="mi">4</span> <span class="k">rows</span><span class="p">)</span>
</code></pre></div></div>

<p>To visualize the alternatives, the tracer output can be formatted as JSON and stored in a file. To run <code class="language-plaintext highlighter-rouge">pg_plan_alternatives</code> in that mode, the following command can be used:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">sudo </span>pg_plan_alternatives <span class="nt">-x</span> /home/jan/postgresql-sandbox/bin/REL_17_1_DEBUG/bin/postgres <span class="nt">-n</span> <span class="si">$(</span>pg_config <span class="nt">--includedir-server</span><span class="si">)</span>/nodes/nodetags.h <span class="nt">-j</span> <span class="nt">-o</span> examples/select.json
</code></pre></div></div>

<p>The SELECT SQL query should be repeated while the tracer is running to capture the events. After that, the <code class="language-plaintext highlighter-rouge">visualize_plan_graph</code> script can be used to visualize the alternatives:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>visualize_plan_graph <span class="nt">-i</span> examples/select.json <span class="nt">-o</span> examples/select.svg <span class="nt">--db-url</span> psql://localhost/jan2 <span class="nt">-v</span> 
</code></pre></div></div>

<p>This produces an <code class="language-plaintext highlighter-rouge">.svg</code> file with the following content:</p>

<figure class="row">
    
    
    
    <div class="column" style="flex: 0 0 50.0%">
        
        
        
          <a href="/assets/img/pg_plan_alternatives/select.svg" target="_blank" rel="noopener">
        
            <img class="single" src="/assets/img/pg_plan_alternatives/select.svg" alt="select.svg" />
        
          </a>
        
    </div>
    
    
    <figcaption class="caption-style">Alternative query plans to perform a SELECT query</figcaption>
</figure>

<p>The graph shows all nodes considered for scanning the base relation <code class="language-plaintext highlighter-rouge">test1</code> and their costs. The green <code class="language-plaintext highlighter-rouge">T_SeqScan</code> node is the one finally chosen by the optimizer, with costs of <code class="language-plaintext highlighter-rouge">0.00..15.00</code>, while the gray-blue nodes are the alternatives considered but not chosen.</p>

<h2 id="select-query-with-where-clause">SELECT Query with WHERE Clause</h2>

<p>The second example is a <code class="language-plaintext highlighter-rouge">SELECT</code> query with a <code class="language-plaintext highlighter-rouge">WHERE</code> clause. In this example, the <code class="language-plaintext highlighter-rouge">pg_plan_alternatives</code> tracer should be started as in the previous example. The query in this example is as follows:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">test1</span> <span class="k">WHERE</span> <span class="n">id</span> <span class="o">=</span> <span class="mi">5</span><span class="p">;</span>
</code></pre></div></div>

<p>PostgreSQL will choose an <code class="language-plaintext highlighter-rouge">Index Only Scan</code> for this query, since there is an index on the <code class="language-plaintext highlighter-rouge">id</code> column. The <code class="language-plaintext highlighter-rouge">EXPLAIN</code> output of the query should look as follows:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">jan2</span><span class="o">=#</span> <span class="k">EXPLAIN</span> <span class="p">(</span><span class="k">VERBOSE</span><span class="p">,</span> <span class="k">ANALYZE</span><span class="p">)</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">test1</span> <span class="k">WHERE</span> <span class="n">id</span> <span class="o">=</span> <span class="mi">5</span><span class="p">;</span>
 <span class="n">QUERY</span> <span class="n">PLAN</span>
<span class="c1">------------------------------------------------------------------------------------------------------------------------------</span>
 <span class="k">Index</span> <span class="k">Only</span> <span class="n">Scan</span> <span class="k">using</span> <span class="n">test1_pkey</span> <span class="k">on</span> <span class="k">public</span><span class="p">.</span><span class="n">test1</span>  <span class="p">(</span><span class="n">cost</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">28</span><span class="p">..</span><span class="mi">8</span><span class="p">.</span><span class="mi">29</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1</span> <span class="n">width</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span> <span class="p">(</span><span class="n">actual</span> <span class="nb">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">153</span><span class="p">..</span><span class="mi">0</span><span class="p">.</span><span class="mi">160</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
   <span class="k">Output</span><span class="p">:</span> <span class="n">id</span>
   <span class="k">Index</span> <span class="n">Cond</span><span class="p">:</span> <span class="p">(</span><span class="n">test1</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="mi">5</span><span class="p">)</span>
 <span class="n">Heap</span> <span class="n">Fetches</span><span class="p">:</span> <span class="mi">1</span>
 <span class="n">Planning</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">1</span><span class="p">.</span><span class="mi">166</span> <span class="n">ms</span>
 <span class="n">Execution</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">284</span> <span class="n">ms</span>
<span class="p">(</span><span class="mi">6</span> <span class="k">rows</span><span class="p">)</span>
</code></pre></div></div>

<p>The output of the plan visualization should look as follows:</p>

<figure class="row">
    
    
    
    <div class="column" style="flex: 0 0 50.0%">
        
        
        
          <a href="/assets/img/pg_plan_alternatives/select_where.svg" target="_blank" rel="noopener">
        
            <img class="single" src="/assets/img/pg_plan_alternatives/select_where.svg" alt="select_where.svg" />
        
          </a>
        
    </div>
    
    
    <figcaption class="caption-style">Alternative query plans to perform a SELECT WHERE query</figcaption>
</figure>

<p>The same nodes as in the previous example are shown, but now the <code class="language-plaintext highlighter-rouge">T_IndexOnlyScan</code> node is the one that was chosen by the optimizer with costs of <code class="language-plaintext highlighter-rouge">0.28..8.29</code>, while the <code class="language-plaintext highlighter-rouge">T_SeqScan</code> and <code class="language-plaintext highlighter-rouge">T_BitmapHeapScan</code> nodes are the alternatives that were considered but not chosen.</p>

<h2 id="select-query-with-order-by-clause">SELECT Query with ORDER BY Clause</h2>

<p>The third example is a <code class="language-plaintext highlighter-rouge">SELECT</code> query with an <code class="language-plaintext highlighter-rouge">ORDER BY</code> clause. The example query is as follows:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">test1</span> <span class="k">ORDER</span> <span class="k">BY</span> <span class="n">id</span><span class="p">;</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">EXPLAIN</code> output of the query shows that an <code class="language-plaintext highlighter-rouge">Index Only Scan</code> is used to scan the <code class="language-plaintext highlighter-rouge">test1</code> table, which is also used to return the rows in the correct order.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">EXPLAIN</span> <span class="p">(</span><span class="k">VERBOSE</span><span class="p">,</span> <span class="k">ANALYZE</span><span class="p">)</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">test1</span> <span class="k">ORDER</span> <span class="k">BY</span> <span class="n">id</span><span class="p">;</span>
 <span class="n">QUERY</span> <span class="n">PLAN</span>
<span class="c1">-------------------------------------------------------------------------------------------------------------------------------------</span>
 <span class="k">Index</span> <span class="k">Only</span> <span class="n">Scan</span> <span class="k">using</span> <span class="n">test1_pkey</span> <span class="k">on</span> <span class="k">public</span><span class="p">.</span><span class="n">test1</span>  <span class="p">(</span><span class="n">cost</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">28</span><span class="p">..</span><span class="mi">43</span><span class="p">.</span><span class="mi">27</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1000</span> <span class="n">width</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span> <span class="p">(</span><span class="n">actual</span> <span class="nb">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">192</span><span class="p">..</span><span class="mi">5</span><span class="p">.</span><span class="mi">385</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1000</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
   <span class="k">Output</span><span class="p">:</span> <span class="n">id</span>
 <span class="n">Heap</span> <span class="n">Fetches</span><span class="p">:</span> <span class="mi">1000</span>
 <span class="n">Planning</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">1</span><span class="p">.</span><span class="mi">167</span> <span class="n">ms</span>
 <span class="n">Execution</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">5</span><span class="p">.</span><span class="mi">579</span> <span class="n">ms</span>
<span class="p">(</span><span class="mi">5</span> <span class="k">rows</span><span class="p">)</span>
</code></pre></div></div>

<p>However, this time the optimizer also considered producing the result by performing a <code class="language-plaintext highlighter-rouge">Seq Scan</code> and then sorting the output using a <code class="language-plaintext highlighter-rouge">T_Sort</code> node. This is followed by a <code class="language-plaintext highlighter-rouge">T_Result</code> node, an internal PostgreSQL node used for operations such as applying a projection. Since the costs of this path are higher than the <code class="language-plaintext highlighter-rouge">Index Only Scan</code> path, it was not chosen by the optimizer.</p>

<figure class="row">
    
    
    
    <div class="column">
        
        
        
          <a href="/assets/img/pg_plan_alternatives/select_order.svg" target="_blank" rel="noopener">
        
            <img class="single" src="/assets/img/pg_plan_alternatives/select_order.svg" alt="select_order.svg" />
        
          </a>
        
    </div>
    
    
    <figcaption class="caption-style">Alternative query plans to perform a SELECT ORDER BY query</figcaption>
</figure>

<h2 id="select-query-with-group-by-clause">SELECT Query with GROUP BY Clause</h2>

<p>The fourth example is a <code class="language-plaintext highlighter-rouge">SELECT</code> query with a <code class="language-plaintext highlighter-rouge">GROUP BY</code> clause. This time, a simple <code class="language-plaintext highlighter-rouge">COUNT</code> aggregation is performed, and a <code class="language-plaintext highlighter-rouge">GROUP BY</code> is applied on the <code class="language-plaintext highlighter-rouge">id</code> column. The example query is as follows:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">test1</span> <span class="k">GROUP</span> <span class="k">BY</span> <span class="n">id</span><span class="p">;</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">EXPLAIN</code> output of the query shows that a <code class="language-plaintext highlighter-rouge">HashAggregate</code> node is used to perform the aggregation, which is fed by a <code class="language-plaintext highlighter-rouge">Seq Scan</code> node that scans the <code class="language-plaintext highlighter-rouge">test1</code> table. The <code class="language-plaintext highlighter-rouge">HashAggregate</code> node has an estimated cost of <code class="language-plaintext highlighter-rouge">20.00..30.00</code>.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">jan2</span><span class="o">=#</span> <span class="k">EXPLAIN</span> <span class="p">(</span><span class="k">VERBOSE</span><span class="p">,</span> <span class="k">ANALYZE</span><span class="p">)</span> <span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">test1</span> <span class="k">GROUP</span> <span class="k">BY</span> <span class="n">id</span><span class="p">;</span>
 <span class="n">QUERY</span> <span class="n">PLAN</span>
<span class="c1">-------------------------------------------------------------------------------------------------------------------</span>
 <span class="n">HashAggregate</span>  <span class="p">(</span><span class="n">cost</span><span class="o">=</span><span class="mi">20</span><span class="p">.</span><span class="mi">00</span><span class="p">..</span><span class="mi">30</span><span class="p">.</span><span class="mi">00</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1000</span> <span class="n">width</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span> <span class="p">(</span><span class="n">actual</span> <span class="nb">time</span><span class="o">=</span><span class="mi">3</span><span class="p">.</span><span class="mi">171</span><span class="p">..</span><span class="mi">4</span><span class="p">.</span><span class="mi">009</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1000</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
   <span class="k">Output</span><span class="p">:</span> <span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">),</span> <span class="n">id</span>
 <span class="k">Group</span> <span class="k">Key</span><span class="p">:</span> <span class="n">test1</span><span class="p">.</span><span class="n">id</span>
   <span class="n">Batches</span><span class="p">:</span> <span class="mi">1</span> <span class="n">Memory</span> <span class="k">Usage</span><span class="p">:</span> <span class="mi">193</span><span class="n">kB</span>
 <span class="o">-&gt;</span>  <span class="n">Seq</span> <span class="n">Scan</span> <span class="k">on</span> <span class="k">public</span><span class="p">.</span><span class="n">test1</span>  <span class="p">(</span><span class="n">cost</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">00</span><span class="p">..</span><span class="mi">15</span><span class="p">.</span><span class="mi">00</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1000</span> <span class="n">width</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span> <span class="p">(</span><span class="n">actual</span> <span class="nb">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">176</span><span class="p">..</span><span class="mi">0</span><span class="p">.</span><span class="mi">769</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1000</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
         <span class="k">Output</span><span class="p">:</span> <span class="n">id</span>
 <span class="n">Planning</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">2</span><span class="p">.</span><span class="mi">297</span> <span class="n">ms</span>
 <span class="n">Execution</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">4</span><span class="p">.</span><span class="mi">637</span> <span class="n">ms</span>
<span class="p">(</span><span class="mi">8</span> <span class="k">rows</span><span class="p">)</span>
</code></pre></div></div>

<p>The plan visualization shows that the optimizer also considered aggregating on top of an <code class="language-plaintext highlighter-rouge">Index Only Scan</code>; sorting the result of the <code class="language-plaintext highlighter-rouge">Seq Scan</code> and then aggregating was also considered. Since the costs of these alternatives are higher than the chosen plan, they were not selected. The chosen plan has a total cost of <code class="language-plaintext highlighter-rouge">30.00</code>, while the two alternatives have total costs of <code class="language-plaintext highlighter-rouge">58.28</code> and <code class="language-plaintext highlighter-rouge">82.33</code>.</p>

<figure class="row">
    
    
    
    <div class="column">
        
        
        
          <a href="/assets/img/pg_plan_alternatives/select_group.svg" target="_blank" rel="noopener">
        
            <img class="single" src="/assets/img/pg_plan_alternatives/select_group.svg" alt="select_group.svg" />
        
          </a>
        
    </div>
    
    
    <figcaption class="caption-style">Alternative query plans to perform a SELECT GROUP BY query</figcaption>
</figure>

<h2 id="join-query">JOIN Query</h2>

<p>The fifth example is a <code class="language-plaintext highlighter-rouge">JOIN</code> query, which joins the <code class="language-plaintext highlighter-rouge">test1</code> and <code class="language-plaintext highlighter-rouge">test2</code> tables on the <code class="language-plaintext highlighter-rouge">id</code> column. The query used in this example is:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">test1</span> <span class="k">LEFT</span> <span class="k">JOIN</span> <span class="n">test2</span> <span class="k">ON</span> <span class="p">(</span><span class="n">test1</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="n">test2</span><span class="p">.</span><span class="n">id</span><span class="p">);</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">EXPLAIN</code> output of the query shows that a <code class="language-plaintext highlighter-rouge">Hash Left Join</code> is used to perform the join, which is fed by two <code class="language-plaintext highlighter-rouge">Seq Scan</code> nodes that scan the <code class="language-plaintext highlighter-rouge">test1</code> and <code class="language-plaintext highlighter-rouge">test2</code> tables. The <code class="language-plaintext highlighter-rouge">Hash Left Join</code> node has an estimated cost of <code class="language-plaintext highlighter-rouge">27.50..45.14</code>.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">jan2</span><span class="o">=#</span> <span class="k">EXPLAIN</span> <span class="p">(</span><span class="k">VERBOSE</span><span class="p">,</span> <span class="k">ANALYZE</span><span class="p">)</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">test1</span> <span class="k">LEFT</span> <span class="k">JOIN</span> <span class="n">test2</span> <span class="k">ON</span> <span class="p">(</span><span class="n">test1</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="n">test2</span><span class="p">.</span><span class="n">id</span><span class="p">);</span>
 <span class="n">QUERY</span> <span class="n">PLAN</span>
<span class="c1">-------------------------------------------------------------------------------------------------------------------------</span>
 <span class="n">Hash</span> <span class="k">Left</span> <span class="k">Join</span> <span class="p">(</span><span class="n">cost</span><span class="o">=</span><span class="mi">27</span><span class="p">.</span><span class="mi">50</span><span class="p">..</span><span class="mi">45</span><span class="p">.</span><span class="mi">14</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1000</span> <span class="n">width</span><span class="o">=</span><span class="mi">8</span><span class="p">)</span> <span class="p">(</span><span class="n">actual</span> <span class="nb">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">625</span><span class="p">..</span><span class="mi">1</span><span class="p">.</span><span class="mi">422</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1000</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
   <span class="k">Output</span><span class="p">:</span> <span class="n">test1</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">test2</span><span class="p">.</span><span class="n">id</span>
 <span class="k">Inner</span> <span class="k">Unique</span><span class="p">:</span> <span class="k">true</span>
   <span class="n">Hash</span> <span class="n">Cond</span><span class="p">:</span> <span class="p">(</span><span class="n">test1</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="n">test2</span><span class="p">.</span><span class="n">id</span><span class="p">)</span>
 <span class="o">-&gt;</span>  <span class="n">Seq</span> <span class="n">Scan</span> <span class="k">on</span> <span class="k">public</span><span class="p">.</span><span class="n">test1</span>  <span class="p">(</span><span class="n">cost</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">00</span><span class="p">..</span><span class="mi">15</span><span class="p">.</span><span class="mi">00</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1000</span> <span class="n">width</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span> <span class="p">(</span><span class="n">actual</span> <span class="nb">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">038</span><span class="p">..</span><span class="mi">0</span><span class="p">.</span><span class="mi">220</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1000</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
         <span class="k">Output</span><span class="p">:</span> <span class="n">test1</span><span class="p">.</span><span class="n">id</span>
 <span class="o">-&gt;</span>  <span class="n">Hash</span> <span class="p">(</span><span class="n">cost</span><span class="o">=</span><span class="mi">15</span><span class="p">.</span><span class="mi">00</span><span class="p">..</span><span class="mi">15</span><span class="p">.</span><span class="mi">00</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1000</span> <span class="n">width</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span> <span class="p">(</span><span class="n">actual</span> <span class="nb">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">571</span><span class="p">..</span><span class="mi">0</span><span class="p">.</span><span class="mi">572</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1000</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
         <span class="k">Output</span><span class="p">:</span> <span class="n">test2</span><span class="p">.</span><span class="n">id</span>
 <span class="n">Buckets</span><span class="p">:</span> <span class="mi">1024</span>  <span class="n">Batches</span><span class="p">:</span> <span class="mi">1</span> <span class="n">Memory</span> <span class="k">Usage</span><span class="p">:</span> <span class="mi">44</span><span class="n">kB</span>
 <span class="o">-&gt;</span>  <span class="n">Seq</span> <span class="n">Scan</span> <span class="k">on</span> <span class="k">public</span><span class="p">.</span><span class="n">test2</span>  <span class="p">(</span><span class="n">cost</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">00</span><span class="p">..</span><span class="mi">15</span><span class="p">.</span><span class="mi">00</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1000</span> <span class="n">width</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span> <span class="p">(</span><span class="n">actual</span> <span class="nb">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">019</span><span class="p">..</span><span class="mi">0</span><span class="p">.</span><span class="mi">191</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1000</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
               <span class="k">Output</span><span class="p">:</span> <span class="n">test2</span><span class="p">.</span><span class="n">id</span>
 <span class="n">Planning</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">3</span><span class="p">.</span><span class="mi">436</span> <span class="n">ms</span>
 <span class="n">Execution</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">1</span><span class="p">.</span><span class="mi">551</span> <span class="n">ms</span>
<span class="p">(</span><span class="mi">13</span> <span class="k">rows</span><span class="p">)</span>
</code></pre></div></div>

<p>The plan visualization shows that the optimizer considered many alternatives for joining the tables. Most alternatives try joining <code class="language-plaintext highlighter-rouge">test1</code> with <code class="language-plaintext highlighter-rouge">test2</code> using different join algorithms (e.g., <code class="language-plaintext highlighter-rouge">Nested Loop</code>, <code class="language-plaintext highlighter-rouge">Merge Join</code>, or <code class="language-plaintext highlighter-rouge">Hash Join</code>) and different access paths for the base relations like <code class="language-plaintext highlighter-rouge">Seq Scan</code>, <code class="language-plaintext highlighter-rouge">Index Only Scan</code>, or <code class="language-plaintext highlighter-rouge">Bitmap Heap Scan</code>. The optimizer also considered flipping the join order and joining <code class="language-plaintext highlighter-rouge">test2</code> with <code class="language-plaintext highlighter-rouge">test1</code>. However, the costs of these alternatives are higher than the chosen plan. For example, the Nested Loop Join using an <code class="language-plaintext highlighter-rouge">Index Only Scan</code> on <code class="language-plaintext highlighter-rouge">test1</code> and a <code class="language-plaintext highlighter-rouge">Seq Scan</code> on <code class="language-plaintext highlighter-rouge">test2</code> has an estimated cost of <code class="language-plaintext highlighter-rouge">27515.83</code>, which is much higher than the chosen plan’s total cost of <code class="language-plaintext highlighter-rouge">45.14</code>.</p>

<figure class="row">
    
    
    
    <div class="column">
        
        
        
          <a href="/assets/img/pg_plan_alternatives/join.svg" target="_blank" rel="noopener">
        
            <img class="single" src="/assets/img/pg_plan_alternatives/join.svg" alt="join.svg" />
        
          </a>
        
    </div>
    
    
    <figcaption class="caption-style">Alternative query plans to perform a JOIN query</figcaption>
</figure>

<h2 id="join-query-with-where-clause">JOIN Query with WHERE Clause</h2>

<p>The last example is a <code class="language-plaintext highlighter-rouge">JOIN</code> query with a <code class="language-plaintext highlighter-rouge">WHERE</code> clause. The example query is as follows:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">test1</span> <span class="k">LEFT</span> <span class="k">JOIN</span> <span class="n">test2</span> <span class="k">ON</span> <span class="p">(</span><span class="n">test1</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="n">test2</span><span class="p">.</span><span class="n">id</span><span class="p">)</span> <span class="k">WHERE</span> <span class="n">test1</span><span class="p">.</span><span class="n">id</span><span class="o">=</span><span class="mi">123</span><span class="p">;</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">EXPLAIN</code> output indicates that a <code class="language-plaintext highlighter-rouge">Nested Loop Left Join</code> is used to perform the join, which is fed by an <code class="language-plaintext highlighter-rouge">Index Only Scan</code> on the <code class="language-plaintext highlighter-rouge">test1</code> table and an <code class="language-plaintext highlighter-rouge">Index Only Scan</code> on the <code class="language-plaintext highlighter-rouge">test2</code> table. The <code class="language-plaintext highlighter-rouge">Nested Loop Left Join</code> node has an estimated cost of <code class="language-plaintext highlighter-rouge">0.55..16.60</code>.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">jan2</span><span class="o">=#</span> <span class="k">EXPLAIN</span> <span class="p">(</span><span class="k">VERBOSE</span><span class="p">,</span> <span class="k">ANALYZE</span><span class="p">)</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">test1</span> <span class="k">LEFT</span> <span class="k">JOIN</span> <span class="n">test2</span> <span class="k">ON</span> <span class="p">(</span><span class="n">test1</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="n">test2</span><span class="p">.</span><span class="n">id</span><span class="p">)</span> <span class="k">WHERE</span> <span class="n">test1</span><span class="p">.</span><span class="n">id</span><span class="o">=</span><span class="mi">123</span><span class="p">;</span>
 <span class="n">QUERY</span> <span class="n">PLAN</span>
<span class="c1">------------------------------------------------------------------------------------------------------------------------------------</span>
 <span class="n">Nested</span> <span class="n">Loop</span> <span class="k">Left</span> <span class="k">Join</span> <span class="p">(</span><span class="n">cost</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">55</span><span class="p">..</span><span class="mi">16</span><span class="p">.</span><span class="mi">60</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1</span> <span class="n">width</span><span class="o">=</span><span class="mi">8</span><span class="p">)</span> <span class="p">(</span><span class="n">actual</span> <span class="nb">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">183</span><span class="p">..</span><span class="mi">0</span><span class="p">.</span><span class="mi">189</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
   <span class="k">Output</span><span class="p">:</span> <span class="n">test1</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">test2</span><span class="p">.</span><span class="n">id</span>
 <span class="k">Inner</span> <span class="k">Unique</span><span class="p">:</span> <span class="k">true</span>
 <span class="o">-&gt;</span>  <span class="k">Index</span> <span class="k">Only</span> <span class="n">Scan</span> <span class="k">using</span> <span class="n">test1_pkey</span> <span class="k">on</span> <span class="k">public</span><span class="p">.</span><span class="n">test1</span>  <span class="p">(</span><span class="n">cost</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">28</span><span class="p">..</span><span class="mi">8</span><span class="p">.</span><span class="mi">29</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1</span> <span class="n">width</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span> <span class="p">(</span><span class="n">actual</span> <span class="nb">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">139</span><span class="p">..</span><span class="mi">0</span><span class="p">.</span><span class="mi">143</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
         <span class="k">Output</span><span class="p">:</span> <span class="n">test1</span><span class="p">.</span><span class="n">id</span>
         <span class="k">Index</span> <span class="n">Cond</span><span class="p">:</span> <span class="p">(</span><span class="n">test1</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="mi">123</span><span class="p">)</span>
 <span class="n">Heap</span> <span class="n">Fetches</span><span class="p">:</span> <span class="mi">1</span>
 <span class="o">-&gt;</span>  <span class="k">Index</span> <span class="k">Only</span> <span class="n">Scan</span> <span class="k">using</span> <span class="n">test2_pkey</span> <span class="k">on</span> <span class="k">public</span><span class="p">.</span><span class="n">test2</span>  <span class="p">(</span><span class="n">cost</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">28</span><span class="p">..</span><span class="mi">8</span><span class="p">.</span><span class="mi">29</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1</span> <span class="n">width</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span> <span class="p">(</span><span class="n">actual</span> <span class="nb">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">032</span><span class="p">..</span><span class="mi">0</span><span class="p">.</span><span class="mi">032</span> <span class="k">rows</span><span class="o">=</span><span class="mi">1</span> <span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
         <span class="k">Output</span><span class="p">:</span> <span class="n">test2</span><span class="p">.</span><span class="n">id</span>
         <span class="k">Index</span> <span class="n">Cond</span><span class="p">:</span> <span class="p">(</span><span class="n">test2</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="mi">123</span><span class="p">)</span>
 <span class="n">Heap</span> <span class="n">Fetches</span><span class="p">:</span> <span class="mi">1</span>
 <span class="n">Planning</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">1</span><span class="p">.</span><span class="mi">116</span> <span class="n">ms</span>
 <span class="n">Execution</span> <span class="nb">Time</span><span class="p">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">336</span> <span class="n">ms</span>
<span class="p">(</span><span class="mi">13</span> <span class="k">rows</span><span class="p">)</span>
</code></pre></div></div>

<p>The plan visualization shows that different ways to access the base relations and different join algorithms were considered by the optimizer. The optimizer also considers performing a <code class="language-plaintext highlighter-rouge">Bitmap Heap Scan</code> on <code class="language-plaintext highlighter-rouge">test2</code> and materializing the result as input for the <code class="language-plaintext highlighter-rouge">Nested Loop</code> join.</p>

<figure class="row">
    
    
    
    <div class="column">
        
        
        
          <a href="/assets/img/pg_plan_alternatives/join_where.svg" target="_blank" rel="noopener">
        
            <img class="single" src="/assets/img/pg_plan_alternatives/join_where.svg" alt="join_where.svg" />
        
          </a>
        
    </div>
    
    
    <figcaption class="caption-style">Alternative query plans to perform a JOIN WHERE query</figcaption>
</figure>

<h1 id="conclusion">Conclusion</h1>
<p>In this article, the <a href="https://github.com/jnidzwetzki/pg_plan_alternatives"><code class="language-plaintext highlighter-rouge">pg_plan_alternatives</code></a> tool was discussed, which uses eBPF to trace the alternative query plans that are considered by the PostgreSQL optimizer during the planning phase. The tool consists of an eBPF program that runs in kernel space and a user-space script that collects the events emitted by the eBPF program and visualizes the alternatives. By using <code class="language-plaintext highlighter-rouge">pg_plan_alternatives</code>, you can gain insights into the decision-making process of the optimizer and tune system parameters accordingly. Furthermore, the basics of cost-based optimization and the structure of query plans in PostgreSQL were explained. In addition, several examples were shown to demonstrate how <code class="language-plaintext highlighter-rouge">pg_plan_alternatives</code> can be used to inspect the alternative plans for different types of queries. The tool is open-source and can be found on GitHub.</p>]]></content><author><name>Jan Nidzwetzki</name></author><category term="PostgreSQL" /><category term="Performance" /><category term="eBPF" /><category term="Profiling" /><summary type="html"><![CDATA[PostgreSQL uses a cost-based optimizer (CBO) to determine the best execution plan for a given query. The optimizer considers multiple alternative plans during the planning phase. Using the EXPLAIN command, a user can only inspect the chosen plan, but not the alternatives that were considered. To address this gap, I developed pg_plan_alternatives, a tool that uses eBPF to instrument the PostgreSQL optimizer and trace all alternative plans and their costs that were considered during the planning phase. This information helps the user understand the optimizer’s decision-making process and tune system parameters. This article explains how pg_plan_alternatives works, provides examples, and discusses the insights the tool can provide.]]></summary></entry><entry><title type="html">eBPF Tracing of PostgreSQL Spinlocks</title><link href="https://jnidzwetzki.github.io/2026/02/08/postgresql-spinlocks.html" rel="alternate" type="text/html" title="eBPF Tracing of PostgreSQL Spinlocks" /><published>2026-02-08T00:00:00+00:00</published><updated>2026-02-08T00:00:00+00:00</updated><id>https://jnidzwetzki.github.io/2026/02/08/postgresql-spinlocks</id><content type="html" xml:base="https://jnidzwetzki.github.io/2026/02/08/postgresql-spinlocks.html"><![CDATA[<p>PostgreSQL uses a process-based architecture where each connection is handled by a separate process. Some data structures are shared between these processes, for example, the shared buffer cache or the write-ahead log (WAL). To coordinate access to these shared resources, PostgreSQL uses several locking mechanisms, including spinlocks. Spinlocks are intended for very short-term protection of shared structures: rather than immediately putting a waiting process to sleep, they busy-wait and repeatedly check whether the lock is free. Under contention, PostgreSQL also applies an adaptive backoff that can include brief sleeps.</p>

<p>This article explains what spinlocks are and how they are implemented in PostgreSQL. It also describes how spinlocks can be monitored and demonstrates how my new <code class="language-plaintext highlighter-rouge">pg_spinlock_tracer</code> <a href="https://github.com/jnidzwetzki/pg-lock-tracer">tool</a> can be used to trace spinlock internals using eBPF.</p>

<!--more-->

<h1 id="what-are-spinlocks">What are Spinlocks?</h1>
<p>When multiple processes need to access a shared resource, locks are used to ensure that only one process can modify the resource at a time. If a lock is not available, the waiting process is put to sleep until the lock can be acquired. This reduces CPU usage since the waiting process does not consume CPU cycles while sleeping. However, putting a process to sleep and waking it up again involves context switches, which take time and add latency to the operation. If the lock is expected to be held for a very short time, it may be more efficient for the waiting process to continuously check if the lock is available instead of sleeping. That is what spinlocks do: the lock spins in a loop, repeatedly checking the lock’s status until it can be acquired. Using a spinlock avoids the sleep/wakeup latency but can consume CPU cycles while spinning. If the hardware has only a few CPU cores, spinning can waste CPU cycles and lead to worse overall performance.</p>

<h1 id="implementation-in-postgresql">Implementation in PostgreSQL</h1>
<p>The PostgreSQL implementation of spinlocks is mainly in <code class="language-plaintext highlighter-rouge">src/include/storage/s_lock.h</code> and <code class="language-plaintext highlighter-rouge">src/backend/storage/lmgr/s_lock.c</code>. The spinlock API provides four basic operations:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">SpinLockInit</code>: Initializes a spinlock.</li>
  <li><code class="language-plaintext highlighter-rouge">SpinLockAcquire</code>: Acquires a spinlock, blocking until it is available.</li>
  <li><code class="language-plaintext highlighter-rouge">SpinLockRelease</code>: Releases a spinlock.</li>
  <li><code class="language-plaintext highlighter-rouge">SpinLockFree</code>: Checks if a spinlock is free.</li>
</ul>

<p><em>Note:</em> <code class="language-plaintext highlighter-rouge">SpinLockAcquire</code> can also <a href="https://github.com/postgres/postgres/blob/7467041cde9ed1966cb3ea18da8ac119b462c2e4/src/backend/storage/lmgr/s_lock.c#L89">raise a <code class="language-plaintext highlighter-rouge">FATAL</code> error</a> if the lock cannot be acquired within a certain time limit. In that case, the server terminates, performs recovery on restart, and becomes available again once recovery finishes.</p>

<h2 id="using-spinlocks">Using Spinlocks</h2>
<p>To use a spinlock, it must first be initialized using <code class="language-plaintext highlighter-rouge">SpinLockInit</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">slock_t</span> <span class="n">mutex</span><span class="p">;</span>
<span class="n">SpinLockInit</span><span class="p">(</span><span class="o">&amp;</span><span class="n">mutex</span><span class="p">);</span>
</code></pre></div></div>

<p>After initialization, the lock can be acquired and released as needed:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">SpinLockAcquire</span><span class="p">(</span><span class="o">&amp;</span><span class="n">mutex</span><span class="p">);</span>
<span class="cm">/* critical section */</span>
<span class="n">SpinLockRelease</span><span class="p">(</span><span class="o">&amp;</span><span class="n">mutex</span><span class="p">);</span>
</code></pre></div></div>

<p>To determine if a spinlock is currently held by another process, the function <code class="language-plaintext highlighter-rouge">SpinLockFree</code> can be used:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">SpinLockFree</span><span class="p">(</span><span class="o">&amp;</span><span class="n">mutex</span><span class="p">))</span>
    <span class="cm">/* lock is held by another process */</span>
</code></pre></div></div>

<p>Spinlocks are used in several places in the PostgreSQL codebase, for example, to coordinate access in the write-ahead log (WAL) <a href="https://github.com/postgres/postgres/blob/7467041cde9ed1966cb3ea18da8ac119b462c2e4/src/backend/access/transam/xlog.c#L1137">implementation</a> or during <a href="https://github.com/postgres/postgres/blob/1653ce5236c4948550e52d15d54e4b6bb66a23b1/src/backend/postmaster/checkpointer.c#L426">checkpoints</a>.</p>

<h2 id="implementation-details">Implementation Details</h2>
<p>The implementation is split into a platform-independent part and platform-specific parts. The platform-independent code in <code class="language-plaintext highlighter-rouge">s_lock.c</code> defines the API and higher-level behavior, while <code class="language-plaintext highlighter-rouge">s_lock.h</code> pulls in platform-specific assembly implementations depending on the target architecture.</p>

<h3 id="acquiring-a-spinlock">Acquiring a Spinlock</h3>
<p>To acquire a spinlock, PostgreSQL performs an atomic test-and-set (TAS) on the lock variable. The lock value is 0 when <a href="https://github.com/postgres/postgres/blob/7467041cde9ed1966cb3ea18da8ac119b462c2e4/src/backend/storage/lmgr/s_lock.c#L118">free</a> and 1 when held. The TAS operation is atomic to avoid races where two processes both observe a free lock and try to acquire it simultaneously.</p>

<p>The platform-independent code for acquiring a lock <a href="https://github.com/postgres/postgres/blob/7467041cde9ed1966cb3ea18da8ac119b462c2e4/src/backend/storage/lmgr/s_lock.c#L97C1-L112C2">looks as follows</a>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">s_lock</span><span class="p">(</span><span class="k">volatile</span> <span class="n">slock_t</span> <span class="o">*</span><span class="n">lock</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">file</span><span class="p">,</span> <span class="kt">int</span> <span class="n">line</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">func</span><span class="p">)</span>
<span class="p">{</span>
	<span class="n">SpinDelayStatus</span> <span class="n">delayStatus</span><span class="p">;</span>

	<span class="n">init_spin_delay</span><span class="p">(</span><span class="o">&amp;</span><span class="n">delayStatus</span><span class="p">,</span> <span class="n">file</span><span class="p">,</span> <span class="n">line</span><span class="p">,</span> <span class="n">func</span><span class="p">);</span>

	<span class="k">while</span> <span class="p">(</span><span class="n">TAS_SPIN</span><span class="p">(</span><span class="n">lock</span><span class="p">))</span>
	<span class="p">{</span>
		<span class="n">perform_spin_delay</span><span class="p">(</span><span class="o">&amp;</span><span class="n">delayStatus</span><span class="p">);</span>
	<span class="p">}</span>

	<span class="n">finish_spin_delay</span><span class="p">(</span><span class="o">&amp;</span><span class="n">delayStatus</span><span class="p">);</span>

	<span class="k">return</span> <span class="n">delayStatus</span><span class="p">.</span><span class="n">delays</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>A struct <code class="language-plaintext highlighter-rouge">SpinDelayStatus</code> is used to track the number of spins and delays (this will be discussed in the next section). The platform-dependent macro <code class="language-plaintext highlighter-rouge">TAS_SPIN</code> performs the fast-path check and the actual test-and-set operation on the lock variable. As long as the lock is held by another process, <code class="language-plaintext highlighter-rouge">TAS_SPIN</code> returns 1 and the loop continues, calling <code class="language-plaintext highlighter-rouge">perform_spin_delay</code> before the next attempt. Once the lock becomes available, <code class="language-plaintext highlighter-rouge">TAS_SPIN</code> returns 0 and the loop terminates.</p>

<p>The implementation of <code class="language-plaintext highlighter-rouge">TAS_SPIN</code> for the x86-64 architecture <a href="https://github.com/postgres/postgres/blob/7467041cde9ed1966cb3ea18da8ac119b462c2e4/src/include/storage/s_lock.h#L216C1-L230C2">looks as follows</a>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define TAS_SPIN(lock)    (*(lock) ? 1 : TAS(lock))
</span>
<span class="k">static</span> <span class="n">__inline__</span> <span class="kt">int</span>
<span class="nf">tas</span><span class="p">(</span><span class="k">volatile</span> <span class="n">slock_t</span> <span class="o">*</span><span class="n">lock</span><span class="p">)</span>
<span class="p">{</span>
	<span class="n">slock_t</span>		<span class="n">_res</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>

	<span class="n">__asm__</span> <span class="n">__volatile__</span><span class="p">(</span>
		<span class="s">"   lock         </span><span class="se">\n</span><span class="s">"</span>
		<span class="s">"   xchgb  %0,%1 </span><span class="se">\n</span><span class="s">"</span>
<span class="o">:</span>		<span class="s">"+q"</span><span class="p">(</span><span class="n">_res</span><span class="p">),</span> <span class="s">"+m"</span><span class="p">(</span><span class="o">*</span><span class="n">lock</span><span class="p">)</span>
<span class="o">:</span>		<span class="cm">/* no inputs */</span>
<span class="o">:</span>		<span class="s">"memory"</span><span class="p">,</span> <span class="s">"cc"</span><span class="p">);</span>
	<span class="k">return</span> <span class="p">(</span><span class="kt">int</span><span class="p">)</span> <span class="n">_res</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The macro <code class="language-plaintext highlighter-rouge">TAS_SPIN</code> first checks whether the lock variable is non-zero; if so, it returns 1 immediately without performing the atomic exchange. If the lock variable is 0, it calls <code class="language-plaintext highlighter-rouge">TAS(lock)</code> (which ultimately invokes <code class="language-plaintext highlighter-rouge">tas</code>) to perform the atomic test-and-set operation.</p>

<p>The <code class="language-plaintext highlighter-rouge">tas</code> function performs the atomic exchange using inline assembly. The <code class="language-plaintext highlighter-rouge">lock</code> prefix ensures the instruction is executed atomically across multiple CPU cores. The <code class="language-plaintext highlighter-rouge">xchgb</code> instruction swaps <code class="language-plaintext highlighter-rouge">_res</code> and the lock variable: <code class="language-plaintext highlighter-rouge">_res</code> starts at 1, so if the lock was free (0), the swap sets the lock to 1 and <code class="language-plaintext highlighter-rouge">_res</code> becomes 0 (success). If the lock was already 1, <code class="language-plaintext highlighter-rouge">_res</code> becomes 1 (failure to acquire). The function returns <code class="language-plaintext highlighter-rouge">_res</code> (0 on success, 1 on failure).</p>

<h3 id="spinlock-contention">Spinlock Contention</h3>
<p>When the lock cannot be acquired, the <a href="https://github.com/postgres/postgres/blob/7467041cde9ed1966cb3ea18da8ac119b462c2e4/src/backend/storage/lmgr/s_lock.c#L126">function <code class="language-plaintext highlighter-rouge">perform_spin_delay</code></a> is invoked. It implements an adaptive backoff and looks like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">perform_spin_delay</span><span class="p">(</span><span class="n">SpinDelayStatus</span> <span class="o">*</span><span class="n">status</span><span class="p">)</span>
<span class="p">{</span>
    <span class="p">[...]</span>
	<span class="k">if</span> <span class="p">(</span><span class="o">++</span><span class="p">(</span><span class="n">status</span><span class="o">-&gt;</span><span class="n">spins</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="n">spins_per_delay</span><span class="p">)</span>
	<span class="p">{</span>
		<span class="k">if</span> <span class="p">(</span><span class="o">++</span><span class="p">(</span><span class="n">status</span><span class="o">-&gt;</span><span class="n">delays</span><span class="p">)</span> <span class="o">&gt;</span> <span class="n">NUM_DELAYS</span><span class="p">)</span>
			<span class="n">s_lock_stuck</span><span class="p">(</span><span class="n">status</span><span class="o">-&gt;</span><span class="n">file</span><span class="p">,</span> <span class="n">status</span><span class="o">-&gt;</span><span class="n">line</span><span class="p">,</span> <span class="n">status</span><span class="o">-&gt;</span><span class="n">func</span><span class="p">);</span>

		<span class="k">if</span> <span class="p">(</span><span class="n">status</span><span class="o">-&gt;</span><span class="n">cur_delay</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="cm">/* first time to delay? */</span>
			<span class="n">status</span><span class="o">-&gt;</span><span class="n">cur_delay</span> <span class="o">=</span> <span class="n">MIN_DELAY_USEC</span><span class="p">;</span>

		<span class="p">[...]</span>
		<span class="n">pg_usleep</span><span class="p">(</span><span class="n">status</span><span class="o">-&gt;</span><span class="n">cur_delay</span><span class="p">);</span>
		<span class="p">[...]</span>

		<span class="cm">/* increase delay by a random fraction between 1X and 2X */</span>
		<span class="n">status</span><span class="o">-&gt;</span><span class="n">cur_delay</span> <span class="o">+=</span> <span class="p">(</span><span class="kt">int</span><span class="p">)</span> <span class="p">(</span><span class="n">status</span><span class="o">-&gt;</span><span class="n">cur_delay</span> <span class="o">*</span>
			<span class="n">pg_prng_double</span><span class="p">(</span><span class="o">&amp;</span><span class="n">pg_global_prng_state</span><span class="p">)</span> <span class="o">+</span> <span class="mi">0</span><span class="p">.</span><span class="mi">5</span><span class="p">);</span>

		<span class="cm">/* wrap back to minimum delay when max is exceeded */</span>
		<span class="k">if</span> <span class="p">(</span><span class="n">status</span><span class="o">-&gt;</span><span class="n">cur_delay</span> <span class="o">&gt;</span> <span class="n">MAX_DELAY_USEC</span><span class="p">)</span>
			<span class="n">status</span><span class="o">-&gt;</span><span class="n">cur_delay</span> <span class="o">=</span> <span class="n">MIN_DELAY_USEC</span><span class="p">;</span>

		<span class="n">status</span><span class="o">-&gt;</span><span class="n">spins</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>On each invocation of the function, the number of spins is increased by one. If the number of spins exceeds a certain threshold (<code class="language-plaintext highlighter-rouge">spins_per_delay</code>), PostgreSQL sleeps for a few microseconds (<code class="language-plaintext highlighter-rouge">pg_usleep</code>) before the next attempt to acquire the lock. This turns PostgreSQL’s spinlocks into a hybrid approach (spin first, then sleep) and serves as a safety mechanism to prevent excessive CPU usage under high contention. It is only performed after a certain number of spins, which indicates that the lock was held for an extended period by another process.</p>

<p>Additionally, the delay is increased by a random fraction between 1X and 2X on every delay, which means that the delay increases exponentially with the number of delays. If the delay exceeds a certain maximum value (<code class="language-plaintext highlighter-rouge">MAX_DELAY_USEC</code>, 1000000 microseconds by default), it is wrapped back to a minimum value (<code class="language-plaintext highlighter-rouge">MIN_DELAY_USEC</code>, 1000 microseconds by default). This prevents the delay from growing indefinitely and ensures that the process will eventually wake up and try to acquire the lock again. The random fraction adds jitter, which can help reduce contention by preventing multiple processes from waking up and trying to acquire the lock at the same time.</p>

<p>If the number of delays exceeds <code class="language-plaintext highlighter-rouge">NUM_DELAYS</code> (default 1000), PostgreSQL calls <a href="https://github.com/postgres/postgres/blob/7467041cde9ed1966cb3ea18da8ac119b462c2e4/src/backend/storage/lmgr/s_lock.c#L78-L92">s_lock_stuck</a>, which raises a <code class="language-plaintext highlighter-rouge">FATAL</code> error indicating that the lock appears stuck.</p>

<h1 id="monitoring-spinlocks">Monitoring Spinlocks</h1>
<p>Monitoring spinlocks and understanding spinlock contention can be crucial for diagnosing performance issues in PostgreSQL. In the following sections, an artificial spinlock contention is created and then observed using the <code class="language-plaintext highlighter-rouge">pg_stat_activity</code> view and the <code class="language-plaintext highlighter-rouge">pg_spinlock_tracer</code> tool.</p>

<p><em>Note:</em> This example should not be executed on a production system, since it will cause the server to become unresponsive and may eventually terminate due to the <code class="language-plaintext highlighter-rouge">FATAL</code> error raised by <code class="language-plaintext highlighter-rouge">s_lock_stuck</code>.</p>

<h2 id="creating-artificial-spinlock-contention">Creating Artificial Spinlock Contention</h2>
<p>To create such an artificial contention, two sessions to a database are opened. Afterward, two different tables are created:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">data1</span> <span class="p">(</span><span class="n">id</span> <span class="nb">INT</span><span class="p">);</span>
<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">data2</span> <span class="p">(</span><span class="n">id</span> <span class="nb">INT</span><span class="p">);</span>
</code></pre></div></div>

<p>Furthermore, a debugger is attached to the first session, and a breakpoint is set in <code class="language-plaintext highlighter-rouge">ReserveXLogInsertLocation</code>. This function is responsible for reserving space in the write-ahead log (WAL) for a new record. It uses a spinlock to coordinate access to the WAL insertion point. Afterward, the first session performs an <code class="language-plaintext highlighter-rouge">INSERT</code> statement, which will cause the process to acquire the spinlock in <code class="language-plaintext highlighter-rouge">ReserveXLogInsertLocation</code> and then wait at the breakpoint.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">data1</span> <span class="k">VALUES</span> <span class="p">(</span><span class="mi">1</span><span class="p">);</span>
</code></pre></div></div>

<p>After the breakpoint is hit, the following statements should be executed in the debugger until the <a href="https://github.com/postgres/postgres/blob/73dd7163c5d19f93b629d1ccd9d2a2de6e9667f6/src/backend/access/transam/xlog.c#L1137">line</a> <code class="language-plaintext highlighter-rouge">SpinLockAcquire(&amp;Insert-&gt;insertpos_lck);</code> is executed.</p>

<figure class="row">
    
    <div class="column">
        <img class="single" src="/assets/img/spinlock-gdb.png" alt="spinlock-gdb.png" />
    </div>
    
    
    <figcaption class="caption-style"></figcaption>
</figure>

<p>In the second session, another <code class="language-plaintext highlighter-rouge">INSERT</code> statement is executed, which will also try to acquire the same spinlock in <code class="language-plaintext highlighter-rouge">ReserveXLogInsertLocation</code> and wait for the lock to be released by the first session.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">data2</span> <span class="k">VALUES</span> <span class="p">(</span><span class="mi">1</span><span class="p">);</span>
</code></pre></div></div>

<p>Two different tables are used to ensure that the contention is on the spinlock in <code class="language-plaintext highlighter-rouge">ReserveXLogInsertLocation</code> and not on another lock related to the table access.</p>

<h2 id="using-pg_stat_activity">Using pg_stat_activity</h2>

<p>The view <code class="language-plaintext highlighter-rouge">pg_stat_activity</code> of the <a href="https://www.postgresql.org/docs/18/monitoring-stats.html">cumulative statistics system</a> provides information about the current activity of all sessions in the PostgreSQL server. Lock contention can also be seen in this view.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mydb</span><span class="o">=#</span> <span class="k">SELECT</span> <span class="n">pid</span><span class="p">,</span> <span class="n">backend_start</span><span class="p">,</span> <span class="n">wait_event_type</span><span class="p">,</span> <span class="n">wait_event</span><span class="p">,</span> <span class="k">state</span><span class="p">,</span> <span class="n">query</span> <span class="k">from</span> <span class="n">pg_stat_activity</span><span class="p">;</span>
   <span class="n">pid</span>   <span class="o">|</span>         <span class="n">backend_start</span>         <span class="o">|</span> <span class="n">wait_event_type</span> <span class="o">|</span>     <span class="n">wait_event</span>      <span class="o">|</span> <span class="k">state</span>  <span class="o">|</span>  <span class="n">query</span>                                            
<span class="c1">---------+-------------------------------+-----------------+---------------------+--------+-----------------------------</span>
 <span class="mi">2129513</span> <span class="o">|</span> <span class="mi">2026</span><span class="o">-</span><span class="mi">02</span><span class="o">-</span><span class="mi">08</span> <span class="mi">19</span><span class="p">:</span><span class="mi">48</span><span class="p">:</span><span class="mi">26</span><span class="p">.</span><span class="mi">32229</span><span class="o">+</span><span class="mi">01</span>  <span class="o">|</span>                 <span class="o">|</span>                     <span class="o">|</span> <span class="n">active</span> <span class="o">|</span> <span class="k">insert</span> <span class="k">into</span> <span class="n">data1</span> <span class="k">values</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
 <span class="mi">2129736</span> <span class="o">|</span> <span class="mi">2026</span><span class="o">-</span><span class="mi">02</span><span class="o">-</span><span class="mi">08</span> <span class="mi">19</span><span class="p">:</span><span class="mi">49</span><span class="p">:</span><span class="mi">00</span><span class="p">.</span><span class="mi">578201</span><span class="o">+</span><span class="mi">01</span> <span class="o">|</span> <span class="n">Timeout</span>         <span class="o">|</span> <span class="n">SpinDelay</span>           <span class="o">|</span> <span class="n">active</span> <span class="o">|</span> <span class="k">insert</span> <span class="k">into</span> <span class="n">data2</span> <span class="k">values</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
<span class="p">[...]</span>
</code></pre></div></div>

<p>The output shows that the second session (PID 2129736) is waiting for a <code class="language-plaintext highlighter-rouge">SpinDelay</code>, which indicates that it is trying to acquire a spinlock but is currently delayed due to contention. More information about this view and the meaning of the different columns can be found in the <a href="https://www.postgresql.org/docs/18/monitoring-stats.html#MONITORING-PG-STAT-ACTIVITY-VIEW">documentation</a>.</p>

<p>However, this view only provides a high-level overview of the lock contention and does not provide detailed information about the spinlock behavior, such as the number of spins and delays or the current delay value. For that, a more detailed tracing tool is needed.</p>

<h2 id="tracing-spinlocks-with-pg_spinlock_tracer">Tracing Spinlocks with pg_spinlock_tracer</h2>

<p>To trace spinlock contention in PostgreSQL, I implemented <code class="language-plaintext highlighter-rouge">pg_spinlock_tracer</code> as part of the <a href="https://github.com/jnidzwetzki/pg-lock-tracer">pg-lock-tracer project</a>. The tool uses eBPF to instrument the <code class="language-plaintext highlighter-rouge">perform_spin_delay</code> function and prints the contents of the <code class="language-plaintext highlighter-rouge">SpinDelayStatus</code> struct. For instance, it reports the number of spins and delays, the current delay, and the source location where the spinlock is being attempted.</p>

<p>Unlike the PostgreSQL view, <code class="language-plaintext highlighter-rouge">pg_spinlock_tracer</code> shows the internals of spinlock acquisition and contention, which can be useful for understanding behavior. A simple output of the tool looks as follows:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ pg_spinlock_delay_tracer -x /home/jan/postgresql-sandbox/bin/REL_17_1_DEBUG/bin/postgres
[...]
13180680737869452 [Pid 1864403] SpinDelay spins=996 delays=939 cur_delay=566086 at ReserveXLogInsertLocation, xlog.c:1132
13180680737874986 [Pid 1864403] SpinDelay spins=997 delays=939 cur_delay=566086 at ReserveXLogInsertLocation, xlog.c:1132
13180680737880522 [Pid 1864403] SpinDelay spins=998 delays=939 cur_delay=566086 at ReserveXLogInsertLocation, xlog.c:1132
13180680737886009 [Pid 1864403] SpinDelay spins=999 delays=939 cur_delay=566086 at ReserveXLogInsertLocation, xlog.c:1132
13180681304189362 [Pid 1864403] SpinDelay spins=0 delays=940 cur_delay=661655 at ReserveXLogInsertLocation, xlog.c:1132
13180681304227806 [Pid 1864403] SpinDelay spins=1 delays=940 cur_delay=661655 at ReserveXLogInsertLocation, xlog.c:1132
13180681304241759 [Pid 1864403] SpinDelay spins=2 delays=940 cur_delay=661655 at ReserveXLogInsertLocation, xlog.c:1132
13180681304255150 [Pid 1864403] SpinDelay spins=3 delays=940 cur_delay=661655 at ReserveXLogInsertLocation, xlog.c:1132
[...]
</code></pre></div></div>

<p>The output shows that PID 1864403 (the second session to PostgreSQL) is trying to acquire a spinlock in <code class="language-plaintext highlighter-rouge">ReserveXLogInsertLocation</code> (xlog.c:1132). In the example, the process spins up to 999 times; once it reaches the threshold, it sleeps for <code class="language-plaintext highlighter-rouge">cur_delay</code> microseconds, and the spin counter is reset (visible as <code class="language-plaintext highlighter-rouge">spins=0</code>). The delay value then grows for subsequent attempts.</p>

<h1 id="conclusion">Conclusion</h1>
<p>This article provided an overview of spinlocks in PostgreSQL, their implementation details, and how to observe spinlock contention. Spinlocks are a crucial part of PostgreSQL’s locking mechanism for short-term protection of shared resources. Understanding how they work and how to analyze contention can be valuable for diagnosing performance issues in PostgreSQL. The cumulative statistics system provides some insights into lock contention. The new <code class="language-plaintext highlighter-rouge">pg_spinlock_tracer</code> tool offers a more detailed view of the spinlock behavior and contention patterns.</p>]]></content><author><name>Jan Nidzwetzki</name></author><category term="PostgreSQL" /><category term="Performance" /><category term="eBPF" /><category term="Profiling" /><summary type="html"><![CDATA[PostgreSQL uses a process-based architecture where each connection is handled by a separate process. Some data structures are shared between these processes, for example, the shared buffer cache or the write-ahead log (WAL). To coordinate access to these shared resources, PostgreSQL uses several locking mechanisms, including spinlocks. Spinlocks are intended for very short-term protection of shared structures: rather than immediately putting a waiting process to sleep, they busy-wait and repeatedly check whether the lock is free. Under contention, PostgreSQL also applies an adaptive backoff that can include brief sleeps. This article explains what spinlocks are and how they are implemented in PostgreSQL. It also describes how spinlocks can be monitored and demonstrates how my new pg_spinlock_tracer tool can be used to trace spinlock internals using eBPF.]]></summary></entry><entry><title type="html">Analyzing PostgreSQL Performance Using Flame Graphs</title><link href="https://jnidzwetzki.github.io/2025/07/05/postgresql-flamegraph.html" rel="alternate" type="text/html" title="Analyzing PostgreSQL Performance Using Flame Graphs" /><published>2025-07-05T00:00:00+00:00</published><updated>2025-07-05T00:00:00+00:00</updated><id>https://jnidzwetzki.github.io/2025/07/05/postgresql-flamegraph</id><content type="html" xml:base="https://jnidzwetzki.github.io/2025/07/05/postgresql-flamegraph.html"><![CDATA[<p>A flame graph is a graphical representation that helps to quickly understand where a program spends most of its processing time. These graphs are based on sampled information collected by a profiler while the observed software is running. At regular intervals, the profiler captures and stores the current call stack. A flame graph is then generated from this data to provide a visual representation of the functions in which the software spends most of its processing time. This is useful for understanding the characteristics of a program and for improving its performance.</p>

<p>This blog post explores the fundamentals of flame graphs and offers a few practical tips on utilizing them to identify and debug performance bottlenecks in PostgreSQL.</p>

<!--more-->

<p>The content presented in this blog post is based on material found in other articles or <a href="https://www.brendangregg.com/flamegraphs.html">blog posts</a>, as well as in Brendan Gregg’s excellent book on <a href="https://www.brendangregg.com/linuxperf.html">system performance</a>. Over the years, I have collected a number of commands in my lab notebook that I typically use when diagnosing PostgreSQL-related performance problems. I have shared these commands in several emails over the years, so I decided to write a whole blog post on this topic.</p>

<h1 id="flame-graphs">Flame Graphs</h1>

<p>Flame graphs are based on data captured by a profiler. They aggregate call stacks to make it easier to see where a program spends most of its processing time. Without aggregation, it is difficult to see the big picture in the thousands (or more) of call stacks that a profiler collects.</p>

<p>When a flame graph is created, these call stacks are collapsed, and the time spent in similar call stacks is summed up. Based on this data, the flame graph is created. The idea behind this is as follows: the more time a program spends in a particular code path, the more often those call stacks will appear in the samples. Since the resulting graph consists of call stacks of different heights, and the stacks are usually colored in red to yellow tones, it looks like a flame.</p>

<p>Brendan Gregg states in <a href="https://dl.acm.org/doi/10.1145/2927299.2927301">‘The Flame Graph’, ACM Queue, Vol 14, No 2</a>:</p>

<blockquote>
  <p>A flame graph visualizes a collection of stack traces (aka call stacks), shown as an adjacency diagram with an inverted icicle layout.7 Flame graphs are commonly used to visualize CPU profiler output, where stack traces are collected using sampling.</p>
</blockquote>

<p>An example flame graph looks like this:</p>

<figure class="row">
    
    <div class="column">
        <img class="single" src="/assets/img/flamegraph.png" alt="flamegraph.png" />
    </div>
    
    
    <figcaption class="caption-style"></figcaption>
</figure>

<p>In this flame graph, it can be seen that most time is spent in the <code class="language-plaintext highlighter-rouge">ExecModifyTable</code> function, which is part of the PostgreSQL executor. This function calls other functions like <code class="language-plaintext highlighter-rouge">ExecInsert</code> or <code class="language-plaintext highlighter-rouge">ExecProcNode</code>. <code class="language-plaintext highlighter-rouge">ExecInsert</code> is the one which takes the most time. So, when performing a performance analysis (and searching for functions that are worth optimizing), it is important to focus on the functions with a long bar on the x-axis. The longer the bar, the more time is spent in that function.</p>

<h2 id="creating-a-flame-graph">Creating a Flame Graph</h2>
<p>As I primarily work on diagnosing performance issues in PostgreSQL, this blog post will focus on creating flame graphs for PostgreSQL. However, the presented methods can also be used to generate flame graphs for other applications written in C or Rust.</p>

<p>To profile a PostgreSQL backend process, the PID (Process ID) of the process needs to be known. To get the PID of the PostgreSQL backend process, you can use the following command:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">mydb</span><span class="o">=</span><span class="c"># select pg_backend_pid();</span>
 pg_backend_pid
<span class="nt">----------------</span>
        2112031
<span class="o">(</span>1 row<span class="o">)</span>
</code></pre></div></div>

<p>In this example, the PID is <code class="language-plaintext highlighter-rouge">2112031</code>. Now, the data for the flame graph can be collected. Two ways to collect and process the data are available: using the <code class="language-plaintext highlighter-rouge">perf</code> tool or using the <code class="language-plaintext highlighter-rouge">FlameGraph</code> tool by Brendan Gregg. The first method is more straightforward since only one command is needed to collect and process the data. However, the second method is more flexible and allows for more advanced processing of the collected data. So, let’s start with the first <code class="language-plaintext highlighter-rouge">FlameGraph</code> method first.</p>

<h3 id="processing-the-collected-data-using-flamegraph">Processing the Collected Data using FlameGraph</h3>

<p>The FlameGraph tool can be found in the <a href="https://github.com/brendangregg/FlameGraph">FlameGraph repository</a>. To use it, you need to clone the repository:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/brendangregg/FlameGraph
</code></pre></div></div>

<p>This needs to be done only once. Afterward, the <code class="language-plaintext highlighter-rouge">perf</code> tool can be used to collect the data for the flame graph. <a href="https://perfwiki.github.io/main/">Perf</a> is a powerful profiler available for Linux systems.</p>

<p>The following command captures the call stacks of the PostgreSQL backend process with PID <code class="language-plaintext highlighter-rouge">2112031</code>:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>perf record <span class="nt">-a</span> <span class="nt">-g</span> <span class="nt">-F</span> 111 <span class="nt">-o</span> data.perf <span class="nt">-p</span> 2112031
</code></pre></div></div>

<p>The parameter <code class="language-plaintext highlighter-rouge">-a</code> means that all CPUs are monitored, <code class="language-plaintext highlighter-rouge">-g</code> enables call graph recording, <code class="language-plaintext highlighter-rouge">-F 111</code> sets the sampling frequency to 111 Hz, and <code class="language-plaintext highlighter-rouge">-o data.perf</code> specifies the output file. The frequency should be adjusted based on the monitored process (shorter workloads may require a higher frequency). In addition, the frequency should be set to a value that is not used by any cyclic task (e.g., a job that is executed every 100 ms) in the system. Otherwise, the profile will always or never run at the moment the cyclic task is executed. So, the value 111 Hz is a good choice, as it is not a common value for cyclic tasks.</p>

<p>Now, the workload that you want to profile should be executed. For example, the following query can be executed a few times to let the PostgreSQL backend process do some work:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">INSERT</span> <span class="k">INTO</span> <span class="k">data</span> <span class="p">(</span><span class="k">key</span><span class="p">,</span> <span class="n">value</span><span class="p">)</span> <span class="k">SELECT</span> <span class="n">i</span><span class="p">,</span> <span class="n">i</span><span class="p">::</span><span class="nb">text</span> <span class="k">FROM</span> <span class="n">generate_series</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">100000</span><span class="p">)</span> <span class="n">i</span><span class="p">;</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">\watch</code> command of the <code class="language-plaintext highlighter-rouge">psql</code> client can be used to execute the query repeatedly to allow the backend process to do some work. After some time (usually when the workload that should be profiled is complete), the <code class="language-plaintext highlighter-rouge">perf</code> tool can be terminated using CTRL+C and used to process the collected data. The following command generates a text file with the call stacks:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>perf script <span class="nt">-i</span> data.perf <span class="o">&gt;</span> data.stacks
</code></pre></div></div>

<p>This command reads the data from the <code class="language-plaintext highlighter-rouge">data.perf</code> file and writes the call stacks to the <code class="language-plaintext highlighter-rouge">data.stacks</code> file, which can then be used to generate a flame graph. However, before the flame graph can be generated, the <code class="language-plaintext highlighter-rouge">data.stacks</code> file needs to be processed and <code class="language-plaintext highlighter-rouge">folded</code>:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ~/FlameGraph/stackcollapse-perf.pl data.stacks <span class="o">&gt;</span> data.folded
</code></pre></div></div>

<p>Folding means that the call stacks are aggregated so that the time spent in similar call stacks is summed up. Based on this data, the flame graph can be created:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~/FlameGraph/flamegraph.pl data.folded <span class="o">&gt;</span> data.svg
</code></pre></div></div>

<p>The resulting SVG can be found <a href="/assets/misc/flamegraph/flamegraph.svg">here</a> for reference.</p>

<h3 id="processing-the-data-with-perf-script">Processing the Data with ‘perf script’</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>perf script flamegraph <span class="nt">-a</span> <span class="nt">-F</span> 111 <span class="nt">-p</span> 2112031 <span class="nb">sleep </span>60
</code></pre></div></div>

<p>This command unifies all the manual commands we have used so far. It collects the call stacks of the PostgreSQL backend process with PID <code class="language-plaintext highlighter-rouge">2112031</code> for 60 seconds, and then generates a flame graph afterward.</p>

<p>Unfortunately, this method does not work out-of-the-box on Debian based distributions.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Flame Graph template /usr/share/d3-flame-graph/d3-flamegraph-base.html does not exist. Please <span class="nb">install </span>the js-d3-flame-graph <span class="o">(</span>RPM<span class="o">)</span> or libjs-d3-flame-graph <span class="o">(</span>deb<span class="o">)</span> package, specify an existing flame graph template <span class="o">(</span><span class="nt">--template</span> PATH<span class="o">)</span> or another output format <span class="o">(</span><span class="nt">--format</span> FORMAT<span class="o">)</span><span class="nb">.</span>
</code></pre></div></div>

<p>The reason for the error is that the required template file is not available on Debian-based systems. This is a known issue and has been reported in the <a href="https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1002492">Debian bug tracker</a>. Until the bug is fixed and the package is created for Debian, you can use the package from the Fedora repository. To download the package and convert it to a Debian package, use the following commands:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>wget https://rpmfind.net/linux/fedora/linux/development/rawhide/Everything/x86_64/os/Packages/j/js-d3-flame-graph-4.0.7-10.fc42.noarch.rpm
apt-get <span class="nb">install </span>alien
fakeroot alien js-d3-flame-graph-4.0.7-10.fc42.noarch.rpm
<span class="nb">sudo </span>dpkg <span class="nt">-i</span> js-d3-flame-graph_4.0.7-11_all.deb
</code></pre></div></div>

<p>Afterwards, the <code class="language-plaintext highlighter-rouge">perf script</code> command can be used to generate a flame graph. Although the flame graph can be generated using the <code class="language-plaintext highlighter-rouge">perf script</code> command, I prefer to use the <code class="language-plaintext highlighter-rouge">FlameGraph</code> tool directly, as it also allows you to generate differential flame graphs (see below).</p>

<p>The resulting HTML page can be found <a href="/assets/misc/flamegraph/flamegraph.html">here</a> for reference.</p>

<h2 id="build-types">Build Types</h2>
<p>When profiling, it is important to decide which build type should be used. A highly optimized build is typically used in production and can also be profiled. However, many functions are inlined in such builds, which can lead to misleading results in the flame graph. For example, if a function is inlined, it will not appear in the flame graph, even if it consumes a significant amount of time. Alternatively, a debug build can be used for profiling. However, a debug build is not optimized and may exhibit different performance characteristics than a production build. Additionally, some functions might only be executed in a debug build (e.g., assertions), which can also lead to misleading results in the flame graph. Therefore, it is essential to be aware of the build type used for profiling and to interpret the results accordingly.</p>

<p>For some hard-to-catch performance problems, I have created profiles for both build types (optimized and debug) and compared the results.</p>

<p>When compiling a debug version of PostgreSQL, I use the following options:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">CFLAGS</span><span class="o">=</span><span class="s2">"-ggdb -O0 -g3 -fno-omit-frame-pointer"</span>
</code></pre></div></div>

<h2 id="different-types-of-flame-graphs">Different Types of Flame Graphs</h2>
<p>Several types of flame graphs exist. In addition to the standard flame graph, the most common are off-CPU flame graphs and differential flame graphs. These will be discussed in the following sections.</p>

<h3 id="on-cpu--off-cpu-flame-graphs">On-CPU / Off-CPU Flame Graphs</h3>
<p>Usually, the flame graphs show the time spent in the code while the CPU is executing it. However, sometimes it is also useful to see how much time a process spends waiting for resources (e.g., I/O operations). In this case, off-CPU flame graphs can be used. The profiler also adds the time to the call stacks when the process is not running on the CPU.</p>

<p>An off-CPU flame graph can be taken by using the <code class="language-plaintext highlighter-rouge">offcputime</code> binary. For example:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>offcputime-bpfcc <span class="nt">-df</span> <span class="nt">-p</span> 12345 <span class="o">&gt;</span> out.stacks
</code></pre></div></div>

<p>The parameter <code class="language-plaintext highlighter-rouge">-d</code> means that these should be a <em>delimiter between kernel/user stacks</em>, <code class="language-plaintext highlighter-rouge">-f</code> produces the needed <em>folded</em> format. <code class="language-plaintext highlighter-rouge">12345</code> must be replaced with the process ID of the process to be observed.</p>

<p>Afterward, the output can be processed as follows:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~/FlameGraph/flamegraph.pl <span class="nt">--color</span><span class="o">=</span>io <span class="nt">--title</span><span class="o">=</span><span class="s2">"Off-CPU Time Flame Graph"</span> <span class="nt">--countname</span><span class="o">=</span>us &lt; out.stacks <span class="o">&gt;</span> out.svg
</code></pre></div></div>

<h3 id="differential-flame-graphs">Differential Flame Graphs</h3>
<p>Differential flame graphs are used to compare two different profiler runs. Usually, the first run is a baseline run, which is used to compare the second run (e.g., after a potential performance improvement).</p>

<p>Using a differential flame graph makes the differences between the runs more obvious. Otherwise, the width of the bars has to be compared, which is a hard task. To create a differential flame graph, the <code class="language-plaintext highlighter-rouge">.folded</code> files of the two runs need to be created first. Afterward, the script <code class="language-plaintext highlighter-rouge">difffolded.pl</code> from the <code class="language-plaintext highlighter-rouge">FlameGraph</code> tool can be used to create a differential flame graph:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~/FlameGraph/difffolded.pl data.folded data2.folded | ~/FlameGraph/flamegraph.pl <span class="o">&gt;</span> diff.svg
</code></pre></div></div>

<p>The resulting SVG file will clearly highlight the differences between the two profiler runs. For example:</p>

<figure class="row">
    
    <div class="column">
        <img class="single" src="/assets/img/flamegraph_diff.png" alt="flamegraph_diff.png" />
    </div>
    
    
    <figcaption class="caption-style"></figcaption>
</figure>

<p>Functions marked in red are slower in the second run, while functions marked in blue are faster.</p>

<h1 id="conclusion">Conclusion</h1>

<p>This blog post provides an overview of how to create flame graphs for PostgreSQL (and other applications) using the <code class="language-plaintext highlighter-rouge">perf</code> tool and the <code class="language-plaintext highlighter-rouge">FlameGraph</code> tool. I frequently use flame graphs to gain insight into a program’s performance characteristics, identify potential bottlenecks, and find functions that are worth optimizing.</p>

<p>The methods presented work well with C or Rust code. If you want to profile a Java application, I highly recommend using the <a href="https://github.com/async-profiler/async-profiler">async-profiler</a> tool, which is optimized for Java applications and provides similar functionality.</p>]]></content><author><name>Jan Nidzwetzki</name></author><category term="PostgreSQL" /><category term="Performance" /><summary type="html"><![CDATA[A flame graph is a graphical representation that helps to quickly understand where a program spends most of its processing time. These graphs are based on sampled information collected by a profiler while the observed software is running. At regular intervals, the profiler captures and stores the current call stack. A flame graph is then generated from this data to provide a visual representation of the functions in which the software spends most of its processing time. This is useful for understanding the characteristics of a program and for improving its performance. This blog post explores the fundamentals of flame graphs and offers a few practical tips on utilizing them to identify and debug performance bottlenecks in PostgreSQL.]]></summary></entry><entry><title type="html">The Art of SQL Query Optimization</title><link href="https://jnidzwetzki.github.io/2025/06/03/art-of-query-optimization.html" rel="alternate" type="text/html" title="The Art of SQL Query Optimization" /><published>2025-06-03T00:00:00+00:00</published><updated>2025-06-03T00:00:00+00:00</updated><id>https://jnidzwetzki.github.io/2025/06/03/art-of-query-optimization</id><content type="html" xml:base="https://jnidzwetzki.github.io/2025/06/03/art-of-query-optimization.html"><![CDATA[<p>SQL is a declarative language; only the result of the query is specified. The exact steps to produce the result must be determined by the DBMS. Often, multiple ways exist to calculate a query result. For example, the DBMS can choose to use an index or perform a sequential scan on a table to find the needed tuples.</p>

<p>The query optimizer is responsible for finding the most efficient plan for a given query. The plan generator creates possible plans, which are then evaluated based on their costs. Afterward, the cheapest plan is chosen and executed. When the DBMS expects to return a large portion of the table, a full table scan can be more efficient than following the pointers in an index structure. However, it is hard to determine when the DBMS prefers one plan over another and when the switch between plans occurs.</p>

<p>In a few evenings, <a href="https://jnidzwetzki.github.io/2025/05/18/building-a-query-plan-explorer.html">I implemented the plan explorer</a> for PostgreSQL. It iterates over a search space and generates visualizations that show when the plan changes and how many tuples are expected versus the actual number returned. This blog post examines the “art” of query optimization. It discusses the plan explorer tool, the images the tool generates, and the insights the tool provides into the decisions made by the PostgreSQL query optimizer.</p>

<!--more-->

<h1 id="plan-explorer">Plan Explorer</h1>
<p>Plan Explorer is a tool that iterates over a two-dimensional search space and executes a given SQL query for each parameter combination. Based on the returned query plans (and actual query executions), various diagrams are generated. These diagrams have an artistic aspect, but they also provide valuable insights into the workings of the query optimizer and visually represent the decisions made (i.e., at which point PostgreSQL considers using a specific index).</p>

<p>The tool is based on the ideas of <a href="https://dl.acm.org/doi/10.14778/1920841.1921027">Picasso</a>, but implemented as a modern web application.</p>

<blockquote>
  <p>The Picasso database query optimizer visualizer, Jayant R. Haritsa, Proceedings of the VLDB Endowment, Volume 3, Issue 1-2, September 2010</p>
</blockquote>

<h2 id="server-mode">Server Mode</h2>
<p>Initially, the tool was developed as a standalone website that does not require any server component. The WebAssembly build of PostgreSQL, <a href="https://pglite.dev/">PGlite</a>, is used to run PostgreSQL within the user’s browser and extract the needed query plans.</p>

<p>In the most recent version of the tool, an optional server mode was added. Using this server mode, queries are sent to a <em>REST</em> endpoint on a web server. On the web server, a script can take the requested queries and forward them to an actual PostgreSQL server. The server component acts as a proxy for query execution. This is necessary because, due to security constraints, JavaScript running in a browser cannot open network connections to other hosts. So, to communicate with a real PostgreSQL server, the proxy component is needed. However, when an actual PostgreSQL server is queried, large datasets can be preloaded or custom extensions can be integrated into the database server. This allows the tool to run in these environments as well (e.g., a PostgreSQL extension developer could check their custom cost models and whether PostgreSQL picks up the desired query plans).</p>

<p>The architecture of the plan explorer is shown in the diagram below:</p>

<pre><code class="language-mermaid">flowchart LR
    subgraph **Browser**
        UI["WebUI"]
        PGLite["PGlite (WebAssembly PostgreSQL)"]
        UI -- "SQL, Parameters" --&gt; PGLite
        PGLite -- "Query Results, Plans" --&gt; UI
    end
    subgraph "**Proxy Mode** (Optional)"
        Proxy["Proxy Server"]
        PG["PostgreSQL Database"]
        UI -- "HTTP (SQL/Params)" --&gt; Proxy
        Proxy -- "SQL" --&gt; PG
        PG -- "Results" --&gt; Proxy
        Proxy -- "Results" --&gt; UI
    end
</code></pre>

<p>A further feature of the server mode is that the queries can also be executed (<code class="language-plaintext highlighter-rouge">EXPLAIN (ANALYZE)</code>) instead of only being planned (<code class="language-plaintext highlighter-rouge">EXPLAIN</code>). When only the planning of the query is performed, the tool operates much faster and can quickly iterate over the given search space. On the other hand, if the query is actually executed, PostgreSQL provides further information, such as the actual number of returned tuples or the execution time.</p>

<h2 id="generated-drawings">Generated Drawings</h2>
<p>The plan explorer tool creates several different drawings from one query. These drawings are:</p>

<ul>
  <li>The used query plans</li>
  <li>The expected costs of the query</li>
  <li>The actual execution time of the query</li>
  <li>The estimated number of result tuples</li>
  <li>The actual number of result tuples</li>
  <li>The difference between the estimated and actual number of result tuples</li>
</ul>

<p>For a database administrator, the different query plans and their distribution might be the most interesting output of the tool. For a developer who has implemented custom scan nodes and cost models, the other outputs may also be of interest. They enable database developers to determine if the cost models function as expected.</p>

<h1 id="example-query-discussion">Example Query Discussion</h1>

<p>In this section, the output of the tool for the following query is discussed. The query performs a self-join of the table <code class="language-plaintext highlighter-rouge">data</code> on the attribute <code class="language-plaintext highlighter-rouge">key</code>. Two predicates in the <code class="language-plaintext highlighter-rouge">WHERE</code> clause define filter conditions. The exact values for these conditions are determined by the search space. In this example, the search space for both dimensions is the interval <code class="language-plaintext highlighter-rouge">[0; 50000]</code> and steps of 1000 are used.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="k">data</span> <span class="n">d1</span> <span class="k">LEFT</span> <span class="k">JOIN</span> <span class="k">data</span> <span class="n">d2</span> <span class="k">ON</span> <span class="p">(</span><span class="n">d1</span><span class="p">.</span><span class="k">key</span> <span class="o">=</span> <span class="n">d2</span><span class="p">.</span><span class="k">key</span><span class="p">)</span> <span class="k">WHERE</span> <span class="n">d1</span><span class="p">.</span><span class="k">key</span> <span class="o">&gt;</span> <span class="o">%%</span><span class="n">DIMENSION0</span><span class="o">%%</span> <span class="k">AND</span> <span class="n">d2</span><span class="p">.</span><span class="k">key</span> <span class="o">&gt;</span> <span class="o">%%</span><span class="n">DIMENSION1</span><span class="o">%%</span><span class="p">;</span>
</code></pre></div></div>

<p>The query is executed after the database is prepared with the following SQL commands:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="k">data</span><span class="p">(</span><span class="k">key</span> <span class="nb">integer</span><span class="p">,</span> <span class="n">value</span> <span class="nb">text</span><span class="p">);</span>
<span class="k">INSERT</span> <span class="k">INTO</span> <span class="k">data</span> <span class="p">(</span><span class="k">key</span><span class="p">,</span> <span class="n">value</span><span class="p">)</span> <span class="k">SELECT</span> <span class="n">i</span><span class="p">,</span> <span class="n">i</span><span class="p">::</span><span class="nb">text</span> <span class="k">FROM</span> <span class="n">generate_series</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">100000</span><span class="p">)</span> <span class="n">i</span><span class="p">;</span>
<span class="k">CREATE</span> <span class="k">INDEX</span> <span class="k">ON</span> <span class="k">data</span><span class="p">(</span><span class="k">key</span><span class="p">);</span>
<span class="k">ANALYZE</span> <span class="k">data</span><span class="p">;</span>
</code></pre></div></div>

<p>So, the table <code class="language-plaintext highlighter-rouge">data</code> consists of 100000 tuples and has an index on the attribute <code class="language-plaintext highlighter-rouge">key</code>.</p>

<p>The query plan explorer reveals that five different query plans are used by PostgreSQL for the given parameter combinations of the search space.</p>

<h2 id="query-plans">Query Plans</h2>

<figure class="row">
    
    <div class="column">
        <img class="single" src="/assets/img/planexplorer/plans.svg" alt="plans.svg" />
    </div>
    
    
    <figcaption class="caption-style"></figcaption>
</figure>

<p>In the following two subsections, two of the five different query plans are discussed. Query plan 1 is the light blue one in the middle left side of the drawing. Query plan 2 is the area in the upper left corner of the drawing.</p>

<h3 id="query-plan-1">Query Plan 1</h3>
<p>The first query plan has the following fingerprint. Fingerprints are used by the query plan explorer to classify query plans as identical (e.g., the parameters in some nodes can change but the structure of the query plan is the same): <code class="language-plaintext highlighter-rouge">Hash Join &gt; Seq Scan(d1) &gt; Hash &gt; Seq Scan(d2)</code></p>

<p>So, the query plan consists of a hash join. One child node of the hash join is a sequential scan and the other child is a hash operator which also reads the tuples of the <code class="language-plaintext highlighter-rouge">data</code> table. No index scans are used.</p>

<p><strong>Note:</strong> One interesting optimization in PostgreSQL is that the executed join type is changed. In the query, a <code class="language-plaintext highlighter-rouge">LEFT JOIN</code> is performed. However, an <code class="language-plaintext highlighter-rouge">INNER JOIN</code> (<code class="language-plaintext highlighter-rouge">"Join Type": "Inner",</code>) is actually executed. A left join guarantees that every tuple of the left input relation is contained in the output. If no join partner is found, the tuple is added to the output and all attributes of the right relation are populated with <code class="language-plaintext highlighter-rouge">NULL</code> values. However, since a filter is applied on the key attribute of the right relation (<code class="language-plaintext highlighter-rouge">d2.key &gt; %%DIMENSION1%%</code>), none of these artificially generated tuples would fulfill the filter condition (a tuple with a NULL attribute for key will always be eliminated by the filter). So, there is no need to generate these tuples that are removed later. Therefore, PostgreSQL changes the join type to an inner join and does not generate these tuples at all.</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="w">
  </span><span class="p">{</span><span class="w">
    </span><span class="nl">"Plan"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nl">"Node Type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Hash Join"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Parallel Aware"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Async Capable"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Join Type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Inner"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Startup Cost"</span><span class="p">:</span><span class="w"> </span><span class="mi">3040</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Total Cost"</span><span class="p">:</span><span class="w"> </span><span class="mi">6205</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Plan Rows"</span><span class="p">:</span><span class="w"> </span><span class="mi">100000</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Plan Width"</span><span class="p">:</span><span class="w"> </span><span class="mi">18</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Actual Startup Time"</span><span class="p">:</span><span class="w"> </span><span class="mf">20.053</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Actual Total Time"</span><span class="p">:</span><span class="w"> </span><span class="mf">55.887</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Actual Rows"</span><span class="p">:</span><span class="w"> </span><span class="mi">100000</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Actual Loops"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Inner Unique"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Hash Cond"</span><span class="p">:</span><span class="w"> </span><span class="s2">"(d1.key = d2.key)"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Plans"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
        </span><span class="p">{</span><span class="w">
          </span><span class="nl">"Node Type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Seq Scan"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Parent Relationship"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Outer"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Parallel Aware"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Async Capable"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Relation Name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"data"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Alias"</span><span class="p">:</span><span class="w"> </span><span class="s2">"d1"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Startup Cost"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Total Cost"</span><span class="p">:</span><span class="w"> </span><span class="mi">1790</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Plan Rows"</span><span class="p">:</span><span class="w"> </span><span class="mi">100000</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Plan Width"</span><span class="p">:</span><span class="w"> </span><span class="mi">9</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Actual Startup Time"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.006</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Actual Total Time"</span><span class="p">:</span><span class="w"> </span><span class="mf">10.072</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Actual Rows"</span><span class="p">:</span><span class="w"> </span><span class="mi">100000</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Actual Loops"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Filter"</span><span class="p">:</span><span class="w"> </span><span class="s2">"(key &gt; 0)"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Rows Removed by Filter"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="w">
        </span><span class="p">},</span><span class="w">
        </span><span class="p">{</span><span class="w">
          </span><span class="nl">"Node Type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Hash"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Parent Relationship"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Inner"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Parallel Aware"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Async Capable"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Startup Cost"</span><span class="p">:</span><span class="w"> </span><span class="mi">1790</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Total Cost"</span><span class="p">:</span><span class="w"> </span><span class="mi">1790</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Plan Rows"</span><span class="p">:</span><span class="w"> </span><span class="mi">100000</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Plan Width"</span><span class="p">:</span><span class="w"> </span><span class="mi">9</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Actual Startup Time"</span><span class="p">:</span><span class="w"> </span><span class="mf">20.036</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Actual Total Time"</span><span class="p">:</span><span class="w"> </span><span class="mf">20.036</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Actual Rows"</span><span class="p">:</span><span class="w"> </span><span class="mi">100000</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Actual Loops"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Hash Buckets"</span><span class="p">:</span><span class="w"> </span><span class="mi">131072</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Original Hash Buckets"</span><span class="p">:</span><span class="w"> </span><span class="mi">131072</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Hash Batches"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Original Hash Batches"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Peak Memory Usage"</span><span class="p">:</span><span class="w"> </span><span class="mi">5115</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Plans"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
            </span><span class="p">{</span><span class="w">
              </span><span class="nl">"Node Type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Seq Scan"</span><span class="p">,</span><span class="w">
              </span><span class="nl">"Parent Relationship"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Outer"</span><span class="p">,</span><span class="w">
              </span><span class="nl">"Parallel Aware"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
              </span><span class="nl">"Async Capable"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
              </span><span class="nl">"Relation Name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"data"</span><span class="p">,</span><span class="w">
              </span><span class="nl">"Alias"</span><span class="p">:</span><span class="w"> </span><span class="s2">"d2"</span><span class="p">,</span><span class="w">
              </span><span class="nl">"Startup Cost"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w">
              </span><span class="nl">"Total Cost"</span><span class="p">:</span><span class="w"> </span><span class="mi">1790</span><span class="p">,</span><span class="w">
              </span><span class="nl">"Plan Rows"</span><span class="p">:</span><span class="w"> </span><span class="mi">100000</span><span class="p">,</span><span class="w">
              </span><span class="nl">"Plan Width"</span><span class="p">:</span><span class="w"> </span><span class="mi">9</span><span class="p">,</span><span class="w">
              </span><span class="nl">"Actual Startup Time"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.003</span><span class="p">,</span><span class="w">
              </span><span class="nl">"Actual Total Time"</span><span class="p">:</span><span class="w"> </span><span class="mf">9.906</span><span class="p">,</span><span class="w">
              </span><span class="nl">"Actual Rows"</span><span class="p">:</span><span class="w"> </span><span class="mi">100000</span><span class="p">,</span><span class="w">
              </span><span class="nl">"Actual Loops"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w">
              </span><span class="nl">"Filter"</span><span class="p">:</span><span class="w"> </span><span class="s2">"(key &gt; 0)"</span><span class="p">,</span><span class="w">
              </span><span class="nl">"Rows Removed by Filter"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="w">
            </span><span class="p">}</span><span class="w">
          </span><span class="p">]</span><span class="w">
        </span><span class="p">}</span><span class="w">
      </span><span class="p">]</span><span class="w">
    </span><span class="p">},</span><span class="w">
    </span><span class="nl">"Planning Time"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.504</span><span class="p">,</span><span class="w">
    </span><span class="nl">"Triggers"</span><span class="p">:</span><span class="w"> </span><span class="p">[],</span><span class="w">
    </span><span class="nl">"Execution Time"</span><span class="p">:</span><span class="w"> </span><span class="mf">58.059</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span></code></pre></div></div>

<h3 id="query-plan-2">Query Plan 2</h3>
<p>The second query plan <code class="language-plaintext highlighter-rouge">Hash Join &gt; Seq Scan(d1) &gt; Hash &gt; Index Scan(d2)</code> has a similar structure as the first one but with one exception: an index scan is used for the input of the hash operation (<code class="language-plaintext highlighter-rouge">"Node Type": "Index Scan"</code>).</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="w">
  </span><span class="p">{</span><span class="w">
    </span><span class="nl">"Plan"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nl">"Node Type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Hash Join"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Parallel Aware"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Async Capable"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Join Type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Inner"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Startup Cost"</span><span class="p">:</span><span class="w"> </span><span class="mf">2530.96</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Total Cost"</span><span class="p">:</span><span class="w"> </span><span class="mf">5297.79</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Plan Rows"</span><span class="p">:</span><span class="w"> </span><span class="mi">60183</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Plan Width"</span><span class="p">:</span><span class="w"> </span><span class="mi">18</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Actual Startup Time"</span><span class="p">:</span><span class="w"> </span><span class="mf">26.732</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Actual Total Time"</span><span class="p">:</span><span class="w"> </span><span class="mf">48.043</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Actual Rows"</span><span class="p">:</span><span class="w"> </span><span class="mi">60000</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Actual Loops"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Inner Unique"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Hash Cond"</span><span class="p">:</span><span class="w"> </span><span class="s2">"(d1.key = d2.key)"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Plans"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
        </span><span class="p">{</span><span class="w">
          </span><span class="nl">"Node Type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Seq Scan"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Parent Relationship"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Outer"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Parallel Aware"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Async Capable"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Relation Name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"data"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Alias"</span><span class="p">:</span><span class="w"> </span><span class="s2">"d1"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Startup Cost"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Total Cost"</span><span class="p">:</span><span class="w"> </span><span class="mi">1790</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Plan Rows"</span><span class="p">:</span><span class="w"> </span><span class="mi">100000</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Plan Width"</span><span class="p">:</span><span class="w"> </span><span class="mi">9</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Actual Startup Time"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.004</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Actual Total Time"</span><span class="p">:</span><span class="w"> </span><span class="mf">9.898</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Actual Rows"</span><span class="p">:</span><span class="w"> </span><span class="mi">100000</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Actual Loops"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Filter"</span><span class="p">:</span><span class="w"> </span><span class="s2">"(key &gt; 0)"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Rows Removed by Filter"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="w">
        </span><span class="p">},</span><span class="w">
        </span><span class="p">{</span><span class="w">
          </span><span class="nl">"Node Type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Hash"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Parent Relationship"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Inner"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Parallel Aware"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Async Capable"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Startup Cost"</span><span class="p">:</span><span class="w"> </span><span class="mf">1778.67</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Total Cost"</span><span class="p">:</span><span class="w"> </span><span class="mf">1778.67</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Plan Rows"</span><span class="p">:</span><span class="w"> </span><span class="mi">60183</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Plan Width"</span><span class="p">:</span><span class="w"> </span><span class="mi">9</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Actual Startup Time"</span><span class="p">:</span><span class="w"> </span><span class="mf">18.759</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Actual Total Time"</span><span class="p">:</span><span class="w"> </span><span class="mf">18.76</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Actual Rows"</span><span class="p">:</span><span class="w"> </span><span class="mi">60000</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Actual Loops"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Hash Buckets"</span><span class="p">:</span><span class="w"> </span><span class="mi">65536</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Original Hash Buckets"</span><span class="p">:</span><span class="w"> </span><span class="mi">65536</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Hash Batches"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Original Hash Batches"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Peak Memory Usage"</span><span class="p">:</span><span class="w"> </span><span class="mi">2973</span><span class="p">,</span><span class="w">
          </span><span class="nl">"Plans"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
            </span><span class="p">{</span><span class="w">
              </span><span class="nl">"Node Type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Index Scan"</span><span class="p">,</span><span class="w">
              </span><span class="nl">"Parent Relationship"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Outer"</span><span class="p">,</span><span class="w">
              </span><span class="nl">"Parallel Aware"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
              </span><span class="nl">"Async Capable"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
              </span><span class="nl">"Scan Direction"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Forward"</span><span class="p">,</span><span class="w">
              </span><span class="nl">"Index Name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"data_key_idx"</span><span class="p">,</span><span class="w">
              </span><span class="nl">"Relation Name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"data"</span><span class="p">,</span><span class="w">
              </span><span class="nl">"Alias"</span><span class="p">:</span><span class="w"> </span><span class="s2">"d2"</span><span class="p">,</span><span class="w">
              </span><span class="nl">"Startup Cost"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.29</span><span class="p">,</span><span class="w">
              </span><span class="nl">"Total Cost"</span><span class="p">:</span><span class="w"> </span><span class="mf">1778.67</span><span class="p">,</span><span class="w">
              </span><span class="nl">"Plan Rows"</span><span class="p">:</span><span class="w"> </span><span class="mi">60183</span><span class="p">,</span><span class="w">
              </span><span class="nl">"Plan Width"</span><span class="p">:</span><span class="w"> </span><span class="mi">9</span><span class="p">,</span><span class="w">
              </span><span class="nl">"Actual Startup Time"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.05</span><span class="p">,</span><span class="w">
              </span><span class="nl">"Actual Total Time"</span><span class="p">:</span><span class="w"> </span><span class="mf">11.329</span><span class="p">,</span><span class="w">
              </span><span class="nl">"Actual Rows"</span><span class="p">:</span><span class="w"> </span><span class="mi">60000</span><span class="p">,</span><span class="w">
              </span><span class="nl">"Actual Loops"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w">
              </span><span class="nl">"Index Cond"</span><span class="p">:</span><span class="w"> </span><span class="s2">"(key &gt; 40000)"</span><span class="p">,</span><span class="w">
              </span><span class="nl">"Rows Removed by Index Recheck"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="w">
            </span><span class="p">}</span><span class="w">
          </span><span class="p">]</span><span class="w">
        </span><span class="p">}</span><span class="w">
      </span><span class="p">]</span><span class="w">
    </span><span class="p">},</span><span class="w">
    </span><span class="nl">"Planning Time"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.131</span><span class="p">,</span><span class="w">
    </span><span class="nl">"Triggers"</span><span class="p">:</span><span class="w"> </span><span class="p">[],</span><span class="w">
    </span><span class="nl">"Execution Time"</span><span class="p">:</span><span class="w"> </span><span class="mf">49.318</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span></code></pre></div></div>

<h2 id="expected-total-cost-and-actual-time">Expected Total Cost and Actual Time</h2>
<p>Also, drawings for the expected total cost and the actual time are shown. The first image shows that PostgreSQL assumes the highest costs in the lower left corner. This is the moment when both predicates of the <code class="language-plaintext highlighter-rouge">WHERE</code> condition let all tuples (<code class="language-plaintext highlighter-rouge">&gt; 0</code>) pass. The costs decrease when one of the predicates can filter more tuples, and PostgreSQL needs to join fewer tuples. The lowest costs are in the upper right corner of the image. The lines have some irregularities since the query plan changes for the parameter combinations, which also affects the total costs of the query.</p>

<figure class="row">
    
    <div class="column">
        <img class="single" src="/assets/img/planexplorer/total_cost.svg" alt="total_cost.svg" />
    </div>
    
    
    <figcaption class="caption-style"></figcaption>
</figure>

<p>Closely related to the last image is the total execution time. It shows how long the DBMS actually needs to execute the query. The image also shows the trend that the query becomes faster when fewer tuples have to be processed. However, the times are a bit more scattered. This has two reasons: (i) in the current version of the tool, just a single execution of the query is performed and outliers are not handled (e.g., by taking the average value over multiple executions), and (ii) the query is short and changing the parameters have not a big impact on the actual execution time.</p>

<figure class="row">
    
    <div class="column">
        <img class="single" src="/assets/img/planexplorer/total_time.svg" alt="total_time.svg" />
    </div>
    
    
    <figcaption class="caption-style"></figcaption>
</figure>

<h2 id="expected-and-actual-tuples">Expected and Actual Tuples</h2>
<p>The following next three images show the expected, actual returned tuples and the difference between these two values. It can be seen by comparing the first two images that the pattern of the expected and the actual returned tuples is different and PostgreSQL actually performed a misprediction. The first drawing shows diagonal lines and the actual graph shows horizontal lines (i.e., changing one of the parameters leads to more tuples in the output independently from the other parameter) and PostgreSQL expects some correlation (so-called <em>cross-column dependencies</em>) between these parameters. See the function <code class="language-plaintext highlighter-rouge">clauselist_selectivity()</code> and its <a href="https://github.com/postgres/postgres/blob/58fbfde152b28ca119fef4168550a1a4fef61560/src/backend/optimizer/path/clausesel.c#L57-L98">comment</a> for more details. Using <a href="https://www.postgresql.org/docs/17/planner-stats.html">extended statistics</a> might help to get a better prediction (this might be covered in a follow-up blog posts).</p>

<figure class="row">
    
    <div class="column">
        <img class="single" src="/assets/img/planexplorer/expected_tuples.svg" alt="expected_tuples.svg" />
    </div>
    
    
    <figcaption class="caption-style"></figcaption>
</figure>

<figure class="row">
    
    <div class="column">
        <img class="single" src="/assets/img/planexplorer/actual_tuples.svg" alt="actual_tuples.svg" />
    </div>
    
    
    <figcaption class="caption-style"></figcaption>
</figure>

<p>The last drawing shows the difference between these two graphs. Even if a DBMS user is not usually pleased that the number of result tuples has been incorrectly estimated (as this may result in non-optimal plans being used), this error rewards us with a nice-looking picture.</p>

<figure class="row">
    
    <div class="column">
        <img class="single" src="/assets/img/planexplorer/expected_vs_actual_tuples.svg" alt="expected_vs_actual_tuples.svg" />
    </div>
    
    
    <figcaption class="caption-style"></figcaption>
</figure>

<h1 id="conclusion">Conclusion</h1>
<p>This blog post covers the basic functionality of <a href="https://jnidzwetzki.github.io/planexplorer/">plan explorer</a>. The tool is now available as open source at <a href="https://github.com/jnidzwetzki/planexplorer">GitHub</a>. The tool iterates over a two-dimensional search space and executes a query for every parameter combination. Based on the information returned from the database system, the tool draws visualizations of query plans, estimated values like time and returned tuples, and actual values such as the number of returned tuples.</p>

<p>Apart from the artistic aspect of the drawings, the tool gives you insights into the query optimizer decisions of PostgreSQL and highlights mispredictions, such as an incorrect number of returned tuples. So, the tool can be used by anyone who wants to tune cost models or integrate their own cost models into PostgreSQL. The new server mode allows you to connect to real PostgreSQL installations running on another system. Therefore, queries on large (custom) data sets can be analyzed, as well as custom PostgreSQL extensions that are not yet available as a WebAssembly build and integrated into PGlite.</p>]]></content><author><name>Jan Nidzwetzki</name></author><category term="PostgreSQL" /><category term="Query Optimization" /><category term="Research" /><summary type="html"><![CDATA[SQL is a declarative language; only the result of the query is specified. The exact steps to produce the result must be determined by the DBMS. Often, multiple ways exist to calculate a query result. For example, the DBMS can choose to use an index or perform a sequential scan on a table to find the needed tuples. The query optimizer is responsible for finding the most efficient plan for a given query. The plan generator creates possible plans, which are then evaluated based on their costs. Afterward, the cheapest plan is chosen and executed. When the DBMS expects to return a large portion of the table, a full table scan can be more efficient than following the pointers in an index structure. However, it is hard to determine when the DBMS prefers one plan over another and when the switch between plans occurs. In a few evenings, I implemented the plan explorer for PostgreSQL. It iterates over a search space and generates visualizations that show when the plan changes and how many tuples are expected versus the actual number returned. This blog post examines the “art” of query optimization. It discusses the plan explorer tool, the images the tool generates, and the insights the tool provides into the decisions made by the PostgreSQL query optimizer.]]></summary></entry><entry><title type="html">Building a Query Plan Explorer using GitHub Copilot</title><link href="https://jnidzwetzki.github.io/2025/05/18/building-a-query-plan-explorer.html" rel="alternate" type="text/html" title="Building a Query Plan Explorer using GitHub Copilot" /><published>2025-05-18T00:00:00+00:00</published><updated>2025-05-18T00:00:00+00:00</updated><id>https://jnidzwetzki.github.io/2025/05/18/building-a-query-plan-explorer</id><content type="html" xml:base="https://jnidzwetzki.github.io/2025/05/18/building-a-query-plan-explorer.html"><![CDATA[<p><em>Large language models</em> (LLMs) that generate code are nowadays common. Since a <a href="https://code.visualstudio.com/blogs/2025/04/07/agentMode">couple of weeks</a>, VS Code has an agent mode that performs multi-step coding tasks.</p>

<p>I was actively involved in web development roughly 20–25 years ago, when <a href="https://de.wikipedia.org/wiki/Common_Gateway_Interface">CGI</a>, Perl, and early versions of PHP were popular. I have no idea how modern web development actually works. I always had some projects in mind that I wanted to create, but I never had the time to dig into one of the modern JavaScript frameworks like React. GitHub Copilot now seems like a way to create (web) applications just by describing the requirements (i.e., <a href="https://en.wikipedia.org/wiki/Vibe_coding">vibe coding</a>) for an entire application.</p>

<p>This post describes my experience building a PostgreSQL query plan explorer using React and VS Code in two evenings—without writing a single line of code myself.</p>

<!--more-->

<p>When I started with web development, building web applications was quite simple. You had HTML, the first version of CSS, some software, and the <em>Common Gateway Interface</em> (CGI) standard. To build a dynamic website, you just wrote plain HTML to stdout, like:</p>

<div class="language-perl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#!/usr/bin/perl</span>

<span class="k">print</span> <span class="p">"</span><span class="s2">Content-type: text/html</span><span class="se">\n\n</span><span class="p">";</span>
<span class="k">print</span> <span class="p">"</span><span class="s2">&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;Hello World&lt;/body&gt;&lt;/html&gt;</span><span class="se">\n</span><span class="p">";</span>
</code></pre></div></div>

<p>Since that time, the internet has evolved. CSS became increasingly popular, and I switched to PHP for web development. JavaScript, AJAX, and further technologies have changed the way websites work. In my career, I developed mostly backend components and switched to database internals after I finished university. So, I lost track of how modern web applications work.</p>

<p>In 2020, I worked in a management role and led a team of web developers. I learned a bit about recent technologies like React, GraphQL, and Node.js, but never wrote code myself. However, since then, I have wanted to build a modern web application to understand how this works today. But I never had enough time to look deeper into any of these technologies.</p>

<p>The moment the agent mode for GitHub Copilot was released, it became clear to me that I wanted to challenge myself and see if I could build a modern web application without writing a single line of code—just by knowing the requirements, knowing which technologies can be used, and relying on 25-year-old knowledge about web development.</p>

<h1 id="the-project">The Project</h1>
<p>Since I spend most of my day with database internals, it was clear that I wanted to build something DBMS-related. A few weeks ago, a friend of mine drew my attention to <a href="https://dl.acm.org/doi/10.14778/1920841.1921027">Picasso</a>.</p>

<blockquote>
  <p>The Picasso database query optimizer visualizer, Jayant R. Haritsa, Proceedings of the VLDB Endowment, Volume 3, Issue 1-2, September 2010</p>
</blockquote>

<h2 id="picasso-database-query-optimizer-visualizer">Picasso Database Query Optimizer Visualizer</h2>
<p>The Picasso database query optimizer visualizer creates a multi-dimensional search space and executes a database query with different parameters from this search space to determine the query plan used by the database system. The resulting query plans (e.g., an index scan and a full table scan) are fingerprinted and plotted in an output graph. Similar plans are shown with the same color. Thus, the tool visualizes the query plans used and their distribution, generating images from that information.</p>

<h2 id="plan-explorer">Plan Explorer</h2>
<p>The idea was to build a tool similar to Picasso using React as a static website. The required PostgreSQL installation to execute SQL queries and run the query optimizer can be embedded directly into the browser using <a href="https://pglite.dev/">PGlite</a> to get a standalone app without any need for a database server.</p>

<p>PGlite is a <a href="https://en.wikipedia.org/wiki/WebAssembly">WASM</a> (WebAssembly – bytecode that is directly loaded and executed in the browser) build of PostgreSQL that can be loaded and executed in any WASM-capable browser (which is supported by most browsers these days).</p>

<h2 id="query-plans-in-postgresql">Query Plans in PostgreSQL</h2>
<p>In PostgreSQL, you can get the query plan used by prefixing a query with <code class="language-plaintext highlighter-rouge">EXPLAIN</code>. If the keyword <code class="language-plaintext highlighter-rouge">JSON</code> is added as an option, the returned query plan is in JSON format, which is useful for processing the query plans with JavaScript.</p>

<p>For example, if you have the following table structure:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="k">data</span><span class="p">(</span><span class="k">key</span> <span class="nb">integer</span><span class="p">,</span> <span class="n">value</span> <span class="nb">text</span><span class="p">);</span>
<span class="k">INSERT</span> <span class="k">INTO</span> <span class="k">data</span> <span class="p">(</span><span class="k">key</span><span class="p">,</span> <span class="n">value</span><span class="p">)</span> <span class="k">SELECT</span> <span class="n">i</span><span class="p">,</span> <span class="n">i</span><span class="p">::</span><span class="nb">text</span> <span class="k">FROM</span> <span class="n">generate_series</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">100000</span><span class="p">)</span> <span class="n">i</span><span class="p">;</span>
<span class="k">CREATE</span> <span class="k">INDEX</span> <span class="k">ON</span> <span class="k">data</span><span class="p">(</span><span class="k">key</span><span class="p">);</span>
</code></pre></div></div>

<p>And you perform a <code class="language-plaintext highlighter-rouge">SELECT</code> statement on the table with the following <code class="language-plaintext highlighter-rouge">WHERE</code> clause <code class="language-plaintext highlighter-rouge">key &gt; ...</code></p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="k">data</span> <span class="k">WHERE</span> <span class="k">key</span> <span class="o">&gt;</span> <span class="p">...;</span>
</code></pre></div></div>

<p>PostgreSQL could use:</p>

<ul>
  <li>An index scan to get the tuples that qualify.</li>
  <li>Or a full table scan and apply the filter on each of the scanned tuples.</li>
</ul>

<p>The query optimizer determines the fastest query plan. When the query returns many tuples, the index scan is very costly. The index structure has to be traversed, and the referenced tuples have to be read from the table. Therefore, a lot of random I/O is performed, and traversing the index also adds overhead. If the query returns a large fraction of the table, it could be faster to perform a full table scan and apply a filter condition to each of the tuples.</p>

<p>In contrast, if only a small fraction of the table is returned by the query (i.e., the selectivity of the predicate in the <code class="language-plaintext highlighter-rouge">WHERE</code> clause is low), the overhead added by the index scan is less than applying the filter condition to all tuples. So, it is beneficial to use the index.</p>

<p>However, there is no easy way to determine when this change between the two query plans will happen. This tool will run the query with different <code class="language-plaintext highlighter-rouge">WHERE</code> clauses to answer this question. The example <a href="#select-using-a-table-scan-or-an-index">section</a>
shows how this decision will look for a concrete query.</p>

<h2 id="tool-features">Tool Features</h2>
<p>So, the main tasks of the tool are:</p>

<ul>
  <li>Iterating over a one- or two-dimensional search space.</li>
  <li>Letting PGlite generate the query plan for each parameter combination of the search space.</li>
  <li>Fingerprinting the returned query plans and finding similar query plans (i.e., plans with the same structure).</li>
  <li>Generating a clear visualization from the gathered data.</li>
</ul>

<p>Using GitHub Copilot running in agent mode and GPT 4.1, I was able to build the desired tool in two evenings (roughly 2 × 2.5 hours) and a few small corrections in the days afterwards. I described the requirements (e.g., “the user should be able to determine a one or two-dimensional search space”, “dimension 1 of the search space should be optional”, “the tool should have a modern UI”).</p>

<p>The tool looks like this:</p>

<figure class="row">
    
    <div class="column">
        <img class="single" src="/assets/img/plan_explorer.png" alt="plan_explorer.png" />
    </div>
    
    
    <figcaption class="caption-style"></figcaption>
</figure>

<p>In the upper part of the tool, up to two dimensions, ranges, and steps can be defined. These describe the one- or two-dimensional search space that should be iterated by the tool. Afterwards, preparation steps to set up the database can be defined (e.g., creating tables and filling data). Then, the actual query with placeholders (<code class="language-plaintext highlighter-rouge">%%DIMENSION0%%</code> for the value of the first dimension and <code class="language-plaintext highlighter-rouge">%%DIMENSION1%%</code> for the value of the second dimension) can be defined.</p>

<p>A build of the tool can be found at <a href="https://jnidzwetzki.github.io/planexplorer/">https://jnidzwetzki.github.io/planexplorer/</a> if you want to try it yourself.</p>

<h2 id="example-query-plans">Example Query Plans</h2>
<p>The tool provides useful insights into the decisions made by the query planner. In this section, the tool output for three different queries is discussed.</p>

<h3 id="select-using-a-table-scan-or-an-index">Select Using a Table Scan or an Index</h3>
<p>In the query plan section, the case that PostgreSQL can pick an index scan if it is more efficient than a full table scan was already discussed. The following image shows the output of the tool, when dimension 0 (the first dimension) from 0 to 50000 in steps of 10000 is iterated and the query:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="k">data</span> <span class="k">WHERE</span> <span class="k">key</span> <span class="o">&gt;</span> <span class="p">...;</span>
</code></pre></div></div>

<p>was executed.</p>

<figure class="row">
    
    <div class="column">
        <img class="single" src="/assets/img/plan_explorer_query1.png" alt="plan_explorer_query1.png" />
    </div>
    
    
    <figcaption class="caption-style"></figcaption>
</figure>

<p>It can be seen in the produced image that after the value of 40000, the query plan changes (light blue vs. dark blue). This is also the expected behavior. When fewer tuples are returned by the query, it is beneficial to use the index.</p>

<p>Below is the visualization of the query plan, the actually used query plans are shown. The first one is the actual full table scan:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="w">
 </span><span class="p">{</span><span class="w">
    </span><span class="nl">"Plan"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nl">"Node Type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Seq Scan"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Parallel Aware"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Async Capable"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Relation Name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"data"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Alias"</span><span class="p">:</span><span class="w"> </span><span class="s2">"data"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Startup Cost"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Total Cost"</span><span class="p">:</span><span class="w"> </span><span class="mi">1790</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Plan Rows"</span><span class="p">:</span><span class="w"> </span><span class="mi">100000</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Plan Width"</span><span class="p">:</span><span class="w"> </span><span class="mi">9</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Filter"</span><span class="p">:</span><span class="w"> </span><span class="s2">"(key &gt; 0)"</span><span class="w">
 </span><span class="p">}</span><span class="w">
 </span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span></code></pre></div></div>

<p>The second query plan is the index scan using the index on the <code class="language-plaintext highlighter-rouge">key</code> attribute.</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="w">
 </span><span class="p">{</span><span class="w">
    </span><span class="nl">"Plan"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nl">"Node Type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Index Scan"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Parallel Aware"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Async Capable"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Scan Direction"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Forward"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Index Name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"data_key_idx"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Relation Name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"data"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Alias"</span><span class="p">:</span><span class="w"> </span><span class="s2">"data"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Startup Cost"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.29</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Total Cost"</span><span class="p">:</span><span class="w"> </span><span class="mf">1769.94</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Plan Rows"</span><span class="p">:</span><span class="w"> </span><span class="mi">59896</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Plan Width"</span><span class="p">:</span><span class="w"> </span><span class="mi">9</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Index Cond"</span><span class="p">:</span><span class="w"> </span><span class="s2">"(key &gt; 40000)"</span><span class="w">
 </span><span class="p">}</span><span class="w">
 </span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span></code></pre></div></div>

<h3 id="changing-the-random-page-costs">Changing the Random Page Costs</h3>
<p>The next example shows when PostgreSQL changes from a table scan to an index scan as the selectivity of a predicate changes. A one-dimensional search space was used to execute the query.</p>

<p>However, the exact point at which PostgreSQL switches from one plan to another also depends on the set costs for page access. For instance, the setting <code class="language-plaintext highlighter-rouge">random_page_cost</code> has a default value of 4 and describes the costs that occur when a page is accessed in random order (in contrast to <code class="language-plaintext highlighter-rouge">seq_page_cost</code> with a default value of 1 when a page is accessed sequentially). In this experiment, the same query is performed, but a two-dimensional search space is used. The value for <code class="language-plaintext highlighter-rouge">random_page_cost</code> is changed from 0 to 8 in steps of 0.25. The result can be seen in the following image:</p>

<figure class="row">
    
    <div class="column">
        <img class="single" src="/assets/img/plan_explorer_query2.svg" alt="plan_explorer_query2.svg" />
    </div>
    
    
    <figcaption class="caption-style"></figcaption>
</figure>

<p>Light blue is again the query plan that uses the sequential (full table) scan and dark blue is the query plan that uses the index scan.</p>

<p>It can be seen that PostgreSQL uses the index scan much earlier when the <code class="language-plaintext highlighter-rouge">random_page_cost</code> is low (i.e., the penalty of following the pointers in the index and accessing pages in random order is lower). In contrast, when the <code class="language-plaintext highlighter-rouge">random_page_cost</code> is high, PostgreSQL starts to use the index scan much later.</p>

<h3 id="performing-a-self-join">Performing a Self-Join</h3>
<p>The last example shows a more complex scenario with more than two different query plans. The example query now performs a self-join and has the same <code class="language-plaintext highlighter-rouge">WHERE</code> clause as the previous examples:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SET</span> <span class="n">random_page_cost</span> <span class="o">=</span> <span class="o">%%</span><span class="n">DIMENSION1</span><span class="o">%%</span><span class="p">;</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="k">data</span> <span class="n">d1</span> <span class="k">LEFT</span> <span class="k">JOIN</span> <span class="k">data</span> <span class="n">d2</span> <span class="k">ON</span> <span class="p">(</span><span class="n">d1</span><span class="p">.</span><span class="k">key</span> <span class="o">=</span> <span class="n">d2</span><span class="p">.</span><span class="k">key</span><span class="p">)</span> <span class="k">WHERE</span> <span class="n">d1</span><span class="p">.</span><span class="k">key</span> <span class="o">&gt;</span> <span class="o">%%</span><span class="n">DIMENSION0</span><span class="o">%%</span><span class="p">;</span>
</code></pre></div></div>

<p>Again, dimension 0 changes the selectivity of the <code class="language-plaintext highlighter-rouge">WHERE</code> clause and dimension 1 changes the <code class="language-plaintext highlighter-rouge">random_page_cost</code>.</p>

<figure class="row">
    
    <div class="column">
        <img class="single" src="/assets/img/plan_explorer_query3.svg" alt="plan_explorer_query3.svg" />
    </div>
    
    
    <figcaption class="caption-style"></figcaption>
</figure>

<p>The generated image shows that PostgreSQL now uses four different query plans to execute the query. Even in this new query, PostgreSQL can decide whether to use the index or not. Furthermore, the join order can be changed and further optimizations can be applied (however, the details will not be covered in this blog post).</p>

<h1 id="conclusion">Conclusion</h1>
<p>I was able to build a modern web application that uses an in-browser version of PostgreSQL to visualize the query plans used for particular queries in just a few hours, despite having only minimal skills in modern web development. GitHub Copilot with GPT 4.1 did a very good job, and vibe coding really seems to be a viable approach for building simple web apps.</p>

<p>I definitely learned less than I would have by building this tool in plain React and reading all the needed tutorials. But I spent just a couple of hours on the problem and have a usable tool. Otherwise, this would have been a multi-week project and I would never have taken up this development.</p>

<p>The created tool is available at <a href="https://jnidzwetzki.github.io/planexplorer/">https://jnidzwetzki.github.io/planexplorer/</a> and could be used by the database (research) community to explore the generated query plans by PostgreSQL. It might also be a valuable tool for PostgreSQL extension developers who create their own operators and cost models and want to understand when a particular query plan is chosen by the query optimizer.</p>]]></content><author><name>Jan Nidzwetzki</name></author><category term="PostgreSQL" /><category term="Query Optimization" /><category term="Research" /><summary type="html"><![CDATA[Large language models (LLMs) that generate code are nowadays common. Since a couple of weeks, VS Code has an agent mode that performs multi-step coding tasks. I was actively involved in web development roughly 20–25 years ago, when CGI, Perl, and early versions of PHP were popular. I have no idea how modern web development actually works. I always had some projects in mind that I wanted to create, but I never had the time to dig into one of the modern JavaScript frameworks like React. GitHub Copilot now seems like a way to create (web) applications just by describing the requirements (i.e., vibe coding) for an entire application. This post describes my experience building a PostgreSQL query plan explorer using React and VS Code in two evenings—without writing a single line of code myself.]]></summary></entry><entry><title type="html">Introduction to Snapshots and Tuple Visibility in PostgreSQL</title><link href="https://jnidzwetzki.github.io/2024/04/03/postgres-and-snapshots.html" rel="alternate" type="text/html" title="Introduction to Snapshots and Tuple Visibility in PostgreSQL" /><published>2024-04-03T00:00:00+00:00</published><updated>2024-04-03T00:00:00+00:00</updated><id>https://jnidzwetzki.github.io/2024/04/03/postgres-and-snapshots</id><content type="html" xml:base="https://jnidzwetzki.github.io/2024/04/03/postgres-and-snapshots.html"><![CDATA[<p>Like many relational DBMSs, PostgreSQL uses multi-version concurrency control (MVCC) to support parallel transactions and coordinate concurrent access to tuples. Snapshots are used to determine which version of a tuple is visible in a given transaction. Each transaction that modifies data has a transaction ID (<code class="language-plaintext highlighter-rouge">txid</code>). Tuples are stored with two attributes (<code class="language-plaintext highlighter-rouge">xmin</code>, <code class="language-plaintext highlighter-rouge">xmax</code>) that determine in which snapshots (and transactions) they are visible.</p>

<p>This blog post discusses some implementation details of snapshots.</p>

<!--more-->

<h2 id="tuple-visibility">Tuple Visibility</h2>

<p>The following table is used in this article to illustrate how snapshots work in PostgreSQL.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">temperature</span> <span class="p">(</span>
  <span class="nb">time</span> <span class="n">timestamptz</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
  <span class="n">value</span> <span class="nb">float</span>
<span class="p">);</span>
</code></pre></div></div>

<p>Let’s insert the first record into this table. This is done by creating a new transaction, checking the current transaction ID (if assigned), inserting a new tuple, checking the transaction ID again, and committing the transaction.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">BEGIN</span><span class="p">;</span>

<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">txid_current_if_assigned</span><span class="p">();</span>
 <span class="n">txid_current_if_assigned</span>
<span class="c1">--------------------------</span>

<span class="p">(</span><span class="mi">1</span> <span class="k">row</span><span class="p">)</span>

<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">temperature</span> <span class="k">VALUES</span><span class="p">(</span><span class="n">now</span><span class="p">(),</span> <span class="mi">4</span><span class="p">);</span>

<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">txid_current_if_assigned</span><span class="p">();</span>
 <span class="n">txid_current_if_assigned</span>
<span class="c1">--------------------------</span>
                  <span class="mi">5062286</span>
<span class="p">(</span><span class="mi">1</span> <span class="k">row</span><span class="p">)</span>

<span class="k">COMMIT</span><span class="p">;</span>
</code></pre></div></div>

<p>An important observation in this example is that PostgreSQL only assigns a transaction ID to a transaction when data is modified. This optimization prevents unnecessary work and avoids transaction ID exhaustion. Even though the transaction ID is a 32-bit integer, it can eventually be exhausted. PostgreSQL handles this overflow by freezing tuples to manage <a href="https://www.postgresql.org/docs/current/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND">transaction ID wraparounds</a> properly.</p>

<p>The <a href="https://www.postgresql.org/docs/16/ddl-system-columns.html">system attributes</a> <code class="language-plaintext highlighter-rouge">xmin</code> and <code class="language-plaintext highlighter-rouge">xmax</code> determine the first and last transactions that can see a particular tuple. Additionally, the <code class="language-plaintext highlighter-rouge">ctid</code> attribute indicates the tuple’s position on the corresponding page. These attributes are displayed when explicitly mentioned in a <code class="language-plaintext highlighter-rouge">SELECT</code> statement:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="n">xmin</span><span class="p">,</span> <span class="n">xmax</span><span class="p">,</span> <span class="n">ctid</span><span class="p">,</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">temperature</span><span class="p">;</span>
  <span class="n">xmin</span>   <span class="o">|</span> <span class="n">xmax</span> <span class="o">|</span>  <span class="n">ctid</span> <span class="o">|</span>             <span class="nb">time</span>              <span class="o">|</span> <span class="n">value</span>
<span class="c1">---------+------+-------+-------------------------------+-------</span>
 <span class="mi">5062286</span> <span class="o">|</span>    <span class="mi">0</span> <span class="o">|</span> <span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">)</span> <span class="o">|</span> <span class="mi">2024</span><span class="o">-</span><span class="mi">04</span><span class="o">-</span><span class="mi">02</span> <span class="mi">22</span><span class="p">:</span><span class="mi">06</span><span class="p">:</span><span class="mi">03</span><span class="p">.</span><span class="mi">035868</span><span class="o">+</span><span class="mi">02</span> <span class="o">|</span>     <span class="mi">4</span>
<span class="p">(</span><span class="mi">1</span> <span class="k">row</span><span class="p">)</span>
</code></pre></div></div>

<p>The output indicates that all transactions with a transaction ID <code class="language-plaintext highlighter-rouge">&gt;= 5062286</code> can see this tuple. When the tuple is deleted, the <code class="language-plaintext highlighter-rouge">xmax</code> value is updated with the transaction ID of the deleting transaction. The <code class="language-plaintext highlighter-rouge">ctid</code> value <code class="language-plaintext highlighter-rouge">(0,1)</code> means the tuple is the first on page 0. Now, let’s delete the tuple:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">BEGIN</span><span class="p">;</span>

<span class="k">DELETE</span> <span class="k">FROM</span> <span class="n">temperature</span><span class="p">;</span>

<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">txid_current_if_assigned</span><span class="p">();</span>
 <span class="n">txid_current_if_assigned</span>
<span class="c1">--------------------------</span>
                  <span class="mi">5062291</span>
<span class="p">(</span><span class="mi">1</span> <span class="k">row</span><span class="p">)</span>

<span class="k">COMMIT</span><span class="p">;</span>
</code></pre></div></div>

<p>However, when a <code class="language-plaintext highlighter-rouge">SELECT</code> statement is executed, no rows are returned, even though the tuple has <code class="language-plaintext highlighter-rouge">xmin</code> and <code class="language-plaintext highlighter-rouge">xmax</code> values.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="n">xmin</span><span class="p">,</span> <span class="n">xmax</span><span class="p">,</span> <span class="n">ctid</span><span class="p">,</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">temperature</span><span class="p">;</span>
 <span class="n">xmin</span> <span class="o">|</span> <span class="n">xmax</span> <span class="o">|</span> <span class="n">ctid</span> <span class="o">|</span> <span class="nb">time</span> <span class="o">|</span> <span class="n">value</span>
<span class="c1">------+------+------+------+-------</span>
<span class="p">(</span><span class="mi">0</span> <span class="k">rows</span><span class="p">)</span>
</code></pre></div></div>

<p>This behavior is due to the internal scanner. If a tuple is not visible in the current transaction snapshot, it is skipped. To retrieve these values, we need to use lower-level tools instead of a simple <code class="language-plaintext highlighter-rouge">SELECT</code>.</p>

<p>The <a href="https://www.postgresql.org/docs/current/pageinspect.html">pageinspect</a> extension for PostgreSQL allows us to examine all tuples stored on a page and decode their internal flags and attributes. After loading the extension, we can inspect the pages of a relation.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- Load the extension</span>
<span class="k">CREATE</span> <span class="n">EXTENSION</span> <span class="n">pageinspect</span><span class="p">;</span>

<span class="c1">-- Get the tuples of the first page of the relation 'temperature'</span>
<span class="k">SELECT</span> <span class="n">lp</span><span class="p">,</span> <span class="n">t_xmin</span><span class="p">,</span> <span class="n">t_xmax</span> <span class="k">FROM</span> <span class="n">heap_page_items</span><span class="p">(</span><span class="n">get_raw_page</span><span class="p">(</span><span class="s1">'temperature'</span><span class="p">,</span> <span class="mi">0</span><span class="p">));</span>

 <span class="n">lp</span> <span class="o">|</span> <span class="n">t_xmin</span>  <span class="o">|</span> <span class="n">t_xmax</span>
<span class="c1">----+---------+---------</span>
  <span class="mi">1</span> <span class="o">|</span> <span class="mi">5062286</span> <span class="o">|</span> <span class="mi">5062291</span>
</code></pre></div></div>

<p>The output shows that the first tuple on page 0 (with <code class="language-plaintext highlighter-rouge">ctid</code> <code class="language-plaintext highlighter-rouge">(0,1)</code>) has a <code class="language-plaintext highlighter-rouge">t_xmax</code> value of <code class="language-plaintext highlighter-rouge">5062291</code>, which matches the transaction ID that deleted the tuple. Thus, any transaction with a transaction ID greater than <code class="language-plaintext highlighter-rouge">5062291</code> will not see this tuple.</p>

<h2 id="snapshots">Snapshots</h2>

<p>When PostgreSQL scans a table, a snapshot must be specified. The <code class="language-plaintext highlighter-rouge">table_beginscan</code> function takes the snapshot data as its second parameter:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kr">inline</span> <span class="n">TableScanDesc</span> <span class="n">table_beginscan</span><span class="p">(</span><span class="n">Relation</span> <span class="n">rel</span><span class="p">,</span>
    <span class="n">Snapshot</span> <span class="n">snapshot</span><span class="p">,</span> <span class="kt">int</span> <span class="n">nkeys</span><span class="p">,</span> <span class="k">struct</span> <span class="n">ScanKeyData</span> <span class="o">*</span><span class="n">key</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="internal-data-structures">Internal Data Structures</h3>

<p>Typically, the <a href="https://github.com/postgres/postgres/blob/06c418e163e913966e17cb2d3fb1c5f8a8d58308/src/backend/utils/time/snapmgr.c#L216">transaction snapshot</a> is used as a parameter for this function. The structure <a href="https://github.com/postgres/postgres/blob/06c418e163e913966e17cb2d3fb1c5f8a8d58308/src/include/utils/snapshot.h#L142">SnapshotData</a> contains all the information that is part of a snapshot. In this blog post, we focus on the following attributes:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="n">SnapshotData</span>
<span class="p">{</span>
  <span class="p">[...]</span>
	<span class="cm">/*
	 * An MVCC snapshot can never see the effects of XIDs &gt;= xmax. It can see
	 * the effects of all older XIDs except those listed in the snapshot. xmin
	 * is stored as an optimization to avoid needing to search the XID arrays
	 * for most tuples.
	 */</span>
	<span class="n">TransactionId</span> <span class="n">xmin</span><span class="p">;</span>			<span class="cm">/* all XID &lt; xmin are visible to me */</span>
	<span class="n">TransactionId</span> <span class="n">xmax</span><span class="p">;</span>			<span class="cm">/* all XID &gt;= xmax are invisible to me */</span>

	<span class="cm">/*
	 * For normal MVCC snapshot this contains the all xact IDs that are in
	 * progress, unless the snapshot was taken during recovery in which case
	 * it's empty. For historic MVCC snapshots, the meaning is inverted, i.e.
	 * it contains *committed* transactions between xmin and xmax.
	 *
	 * note: all ids in xip[] satisfy xmin &lt;= xip[i] &lt; xmax
	 */</span>
	<span class="n">TransactionId</span> <span class="o">*</span><span class="n">xip</span><span class="p">;</span>
	<span class="n">uint32</span>		<span class="n">xcnt</span><span class="p">;</span>			<span class="cm">/* # of xact ids in xip[] */</span>
  <span class="p">[...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">xmin</code> field defines the oldest active transaction in the system. All transactions with a transaction ID lower than this value have already been committed. Thus, all tuples with a lower transaction ID should be visible in this snapshot. The <code class="language-plaintext highlighter-rouge">xmax</code> field contains the most recent transaction ID known to the snapshot. All tuples with a transaction ID greater than <code class="language-plaintext highlighter-rouge">xmax</code> are invisible in the current snapshot.</p>

<p>Why are the <code class="language-plaintext highlighter-rouge">xip</code> and <code class="language-plaintext highlighter-rouge">xcnt</code> fields needed? For transaction IDs between <code class="language-plaintext highlighter-rouge">xmin</code> and <code class="language-plaintext highlighter-rouge">xmax</code>, it must be determined whether the transaction was committed or in progress when the snapshot was created.</p>

<p>A DBMS processes queries from multiple users, who can start transactions at any time. The start and commit times of these transactions are not ordered. This means there might be transactions with a transaction ID larger than <code class="language-plaintext highlighter-rouge">xmin</code> that were already committed when the snapshot was created. However, some other transactions in the range <code class="language-plaintext highlighter-rouge">[xmin, xmax]</code> might still be uncommitted. Since the data of committed and uncommitted transactions must be handled properly, an array of transaction IDs <code class="language-plaintext highlighter-rouge">xip</code> of length <code class="language-plaintext highlighter-rouge">xcnt</code> is defined. It contains all transactions larger than <code class="language-plaintext highlighter-rouge">xmin</code> and lower than <code class="language-plaintext highlighter-rouge">xmax</code> that were in progress when the snapshot was taken.</p>

<h3 id="example">Example</h3>

<p>To illustrate this behavior, let’s perform a practical example using three transactions.</p>

<h4 id="transaction-1">Transaction 1</h4>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">BEGIN</span><span class="p">;</span>

<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">temperature</span> <span class="k">VALUES</span><span class="p">(</span><span class="n">now</span><span class="p">(),</span> <span class="mi">5</span><span class="p">);</span>

<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">txid_current_if_assigned</span><span class="p">();</span>
 <span class="n">txid_current_if_assigned</span>
<span class="c1">--------------------------</span>
                  <span class="mi">5062310</span>
<span class="p">(</span><span class="mi">1</span> <span class="k">row</span><span class="p">)</span>
</code></pre></div></div>

<p>The first transaction inserts new data into the <code class="language-plaintext highlighter-rouge">temperature</code> table but remains uncommitted. The transaction has a transaction ID of <code class="language-plaintext highlighter-rouge">5062310</code>.</p>

<h4 id="transaction-2">Transaction 2</h4>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">BEGIN</span><span class="p">;</span>

<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">temperature</span> <span class="k">VALUES</span><span class="p">(</span><span class="n">now</span><span class="p">(),</span> <span class="mi">5</span><span class="p">);</span>

<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">txid_current_if_assigned</span><span class="p">();</span>
 <span class="n">txid_current_if_assigned</span>
<span class="c1">--------------------------</span>
                  <span class="mi">5062311</span>
<span class="p">(</span><span class="mi">1</span> <span class="k">row</span><span class="p">)</span>
</code></pre></div></div>

<p>The second transaction also inserts data into the same table but remains uncommitted. The transaction ID is <code class="language-plaintext highlighter-rouge">5062311</code>.</p>

<h4 id="transaction-3">Transaction 3</h4>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">pg_current_snapshot</span><span class="p">();</span>
 <span class="n">pg_current_snapshot</span>
<span class="c1">---------------------</span>
 <span class="mi">5062310</span><span class="p">:</span><span class="mi">5062310</span><span class="p">:</span>
<span class="p">(</span><span class="mi">1</span> <span class="k">row</span><span class="p">)</span>
</code></pre></div></div>

<p>The third transaction uses the <code class="language-plaintext highlighter-rouge">pg_current_snapshot</code> function to get the current snapshot. The output indicates that all changes by transactions with an ID lower than <code class="language-plaintext highlighter-rouge">5062310</code> are visible. Changes equal to or larger than transaction ID <code class="language-plaintext highlighter-rouge">5062310</code> are not visible, and no uncommitted transactions exist at this point.</p>

<p>So, what happened to the still-pending transactions <code class="language-plaintext highlighter-rouge">5062310</code> and <code class="language-plaintext highlighter-rouge">5062311</code>? Since no further transactions have been committed in this demo system, PostgreSQL has not changed the current transaction ID. However, this can be changed:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">pg_current_xact_id_if_assigned</span><span class="p">();</span>
 <span class="n">pg_current_xact_id_if_assigned</span>
<span class="c1">--------------------------------</span>

<span class="p">(</span><span class="mi">1</span> <span class="k">row</span><span class="p">)</span>

<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">pg_current_xact_id</span><span class="p">();</span>
 <span class="n">pg_current_xact_id</span>
<span class="c1">--------------------</span>
            <span class="mi">5062312</span>
<span class="p">(</span><span class="mi">1</span> <span class="k">row</span><span class="p">)</span>

<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">pg_current_snapshot</span><span class="p">();</span>
       <span class="n">pg_current_snapshot</span>
<span class="c1">---------------------------------</span>
 <span class="mi">5062310</span><span class="p">:</span><span class="mi">5062313</span><span class="p">:</span><span class="mi">5062310</span><span class="p">,</span><span class="mi">5062311</span>
<span class="p">(</span><span class="mi">1</span> <span class="k">row</span><span class="p">)</span>
</code></pre></div></div>

<p>Unlike the <code class="language-plaintext highlighter-rouge">pg_current_xact_id_if_assigned</code> function, the <code class="language-plaintext highlighter-rouge">pg_current_xact_id</code> function forces the assignment of a transaction ID to the current transaction. In this case, the transaction ID is <code class="language-plaintext highlighter-rouge">5062312</code>. Using this transaction ID also updates the snapshot.</p>

<p>The first value remains the same: all tuples modified by transactions with an ID lower than <code class="language-plaintext highlighter-rouge">5062310</code> are visible in the current snapshot. However, the upper limit (<code class="language-plaintext highlighter-rouge">xmax</code>) has changed. Now, all changes equal to or larger than <code class="language-plaintext highlighter-rouge">5062313</code> are not visible in the current snapshot. Since our transaction ID is <code class="language-plaintext highlighter-rouge">5062312</code>, it makes sense that these changes should not be visible. What about the new part <code class="language-plaintext highlighter-rouge">5062310,5062311</code>? This is the <code class="language-plaintext highlighter-rouge">xip</code> part of the snapshot, indicating that transactions <code class="language-plaintext highlighter-rouge">5062310</code> and <code class="language-plaintext highlighter-rouge">5062311</code> were uncommitted when the snapshot was taken. Therefore, these changes should also not be visible in the current snapshot. As soon as one of these transactions commits and we take a new snapshot, the transaction ID is removed from <code class="language-plaintext highlighter-rouge">xip</code>, and the changes become visible in the current snapshot.</p>

<h3 id="exporting-snapshots">Exporting Snapshots</h3>

<p>Another interesting feature of PostgreSQL is the ability to <a href="https://www.postgresql.org/docs/current/functions-admin.html">export snapshots</a> and load them in other sessions. A snapshot can be exported by calling the <code class="language-plaintext highlighter-rouge">pg_export_snapshot</code> function. The function returns the snapshot ID and creates a corresponding file in the <code class="language-plaintext highlighter-rouge">pg_snapshots</code> folder of the data directory.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">BEGIN</span><span class="p">;</span>

<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">pg_export_snapshot</span><span class="p">();</span>
 <span class="n">pg_export_snapshot</span>
<span class="c1">---------------------</span>
 <span class="mi">0000000</span><span class="k">C</span><span class="o">-</span><span class="mi">000005</span><span class="n">F6</span><span class="o">-</span><span class="mi">1</span>
<span class="p">(</span><span class="mi">1</span> <span class="k">row</span><span class="p">)</span>
</code></pre></div></div>

<p>This file contains the same information as returned by <code class="language-plaintext highlighter-rouge">pg_current_snapshot</code>, which we discussed earlier. Additionally, it includes further information about the isolation level and the database ID.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">cat</span> ~/postgresql-sandbox/data/REL_15_1_DEBUG/pg_snapshots/0000000C-000005F6-1
vxid:12/1526
pid:1362769
dbid:706615
iso:1
ro:0
xmin:5062310
xmax:5062313
xcnt:2
xip:5062310
xip:5062311
sof:0
sxcnt:0
rec:0
</code></pre></div></div>

<p>This exported snapshot can be loaded into another transaction by calling <code class="language-plaintext highlighter-rouge">SET TRANSACTION SNAPSHOT '0000000C-000005F6-1'</code> to run with the same snapshot as the transaction that created it.</p>

<h3 id="snapshots-and-transaction-isolation-level">Snapshots and Transaction Isolation Level</h3>

<p>Depending on the <a href="https://www.postgresql.org/docs/current/transaction-iso.html">isolation level</a>, the snapshot is taken when the transaction starts (<em>Repeatable Read</em>) or for every statement in the transaction (<em>Read Committed</em>). When a new snapshot is created for each statement inside a transaction, committed data from other transactions becomes visible in the current transaction. If only one snapshot is created for the entire transaction, the <code class="language-plaintext highlighter-rouge">xmax</code> value remains constant, no new data from transactions with a higher ID becomes visible, and reads are repeatable.</p>

<h2 id="summary">Summary</h2>

<p>This blog post discusses the basics of multi-version concurrency control in PostgreSQL. It then introduces snapshots and explains how they control the visibility of tuples. The integration with the table scan API is also discussed.</p>]]></content><author><name>Jan Nidzwetzki</name></author><category term="PostgreSQL" /><category term="Snapshots" /><category term="MVCC" /><summary type="html"><![CDATA[Like many relational DBMSs, PostgreSQL uses multi-version concurrency control (MVCC) to support parallel transactions and coordinate concurrent access to tuples. Snapshots are used to determine which version of a tuple is visible in a given transaction. Each transaction that modifies data has a transaction ID (txid). Tuples are stored with two attributes (xmin, xmax) that determine in which snapshots (and transactions) they are visible. This blog post discusses some implementation details of snapshots.]]></summary></entry><entry><title type="html">Trace PostgreSQL Row-Level Locks with pg_row_lock_tracer</title><link href="https://jnidzwetzki.github.io/2024/02/28/trace-postgresql-row-level-locks.html" rel="alternate" type="text/html" title="Trace PostgreSQL Row-Level Locks with pg_row_lock_tracer" /><published>2024-02-28T00:00:00+00:00</published><updated>2024-02-28T00:00:00+00:00</updated><id>https://jnidzwetzki.github.io/2024/02/28/trace-postgresql-row-level-locks</id><content type="html" xml:base="https://jnidzwetzki.github.io/2024/02/28/trace-postgresql-row-level-locks.html"><![CDATA[<p>PostgreSQL uses several types of locks to coordinate parallel transactions and manage access to resources like tuples, tables, and in-memory data structures.</p>

<p>Heavyweight locks are used to control access to tables. Lightweight locks (LWLocks) manage access to data structures, such as adding data to the write-ahead log (WAL). Row-level locks control access to individual tuples. For example, tuples need to be locked when executing an SQL statement like <code class="language-plaintext highlighter-rouge">SELECT * FROM table WHERE i &gt; 10 FOR UPDATE;</code>. The tuples returned by the query are internally locked with an exclusive lock (<code class="language-plaintext highlighter-rouge">LOCK_TUPLE_EXCLUSIVE</code>). Another transaction attempting to lock the same tuples must wait until the first transaction releases the locks.</p>

<p>In this article, we discuss the tool <code class="language-plaintext highlighter-rouge">pg_row_lock_tracer</code>, which uses eBPF and UProbes to trace PostgreSQL’s row-locking behavior. The tool can be downloaded from the <a href="https://github.com/jnidzwetzki/pg-lock-tracer">pg-lock-tracer project website</a>.</p>

<p>This is the third article in a series about tracing PostgreSQL locks. The first article covers the <a href="/2023/01/11/trace-postgresql-locks-with-pg-lock-tracer.html">tracing of heavyweight locks</a>, and the second article focuses on <a href="/2023/01/17/trace-postgresql-lw-locks.html">LW locks</a>.</p>

<!--more-->

<h2 id="background">Background</h2>
<p>PostgreSQL implements <a href="https://www.postgresql.org/docs/current/explicit-locking.html#LOCKING-ROWS">four different row lock modes</a>. These can be requested by adding <code class="language-plaintext highlighter-rouge">FOR UPDATE</code>, <code class="language-plaintext highlighter-rouge">FOR NO KEY UPDATE</code>, <code class="language-plaintext highlighter-rouge">FOR SHARE</code>, or <code class="language-plaintext highlighter-rouge">FOR KEY SHARE</code> to a SELECT statement. Additionally, operations like updates automatically acquire these locks before modifying a tuple. For example, when a transaction successfully performs a <code class="language-plaintext highlighter-rouge">FOR UPDATE</code> lock on a tuple, an update operation by another parallel transaction is blocked until the first transaction releases the lock. Row locks can be requested by calling the function <a href="https://github.com/postgres/postgres/blob/2a6b47cb50eb9b62b050de2cddd03a9ac267e61f/src/backend/access/heap/heapam_handler.c#L359">heapam_tuple_lock</a>.</p>

<h3 id="lock-types">Lock Types</h3>
<p>Internally, these locks are called <code class="language-plaintext highlighter-rouge">LockTupleKeyShare</code>, <code class="language-plaintext highlighter-rouge">LockTupleShare</code>, <code class="language-plaintext highlighter-rouge">LockTupleNoKeyExclusive</code>, and <code class="language-plaintext highlighter-rouge">LockTupleExclusive</code>. They are defined in the enum <a href="https://github.com/postgres/postgres/blob/f0827b443e6014a9d9fdcdd099603576154a3733/src/include/nodes/lockoptions.h#L49">LockTupleMode</a>. These locks have varying strengths, and some are <em>compatible</em> (i.e., multiple transactions can hold locks simultaneously on the same row), while others are <em>conflicting</em> (i.e., only one lock can be held at a time, and a conflicting lock request must wait).</p>

<h3 id="lock-behavior">Lock Behavior</h3>
<p>Users can specify various lock behaviors in addition to different lock modes. For instance, if a tuple is already locked and a second transaction requests a conflicting lock, the user can choose to skip the lock instead of waiting. The possible behaviors are defined in the enum <a href="https://github.com/postgres/postgres/blob/f0827b443e6014a9d9fdcdd099603576154a3733/src/include/nodes/lockoptions.h#L36">LockWaitPolicy</a>.</p>

<p>For example, the following SQL query acquires a <code class="language-plaintext highlighter-rouge">LockTupleExclusive</code> row lock if it does not conflict with existing locks. Any already locked tuples are skipped by the current transaction:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="k">table</span> <span class="k">WHERE</span> <span class="n">i</span> <span class="o">&gt;</span> <span class="mi">10</span> <span class="k">FOR</span> <span class="k">UPDATE</span> <span class="n">SKIP</span> <span class="n">LOCKED</span><span class="p">;</span>
</code></pre></div></div>

<p>A transaction that successfully acquires these locks can assume that no other transaction will modify the tuples in parallel. The returned values from the SELECT statement can then be processed, modified, and updated in subsequent UPDATE statements before being committed.</p>

<h3 id="lock-results">Lock Results</h3>
<p>The possible outcomes of a lock operation are defined in the enum <a href="https://github.com/postgres/postgres/blob/f0827b443e6014a9d9fdcdd099603576154a3733/src/include/access/tableam.h#L71">TM_Result</a>. A lock can be granted (<code class="language-plaintext highlighter-rouge">TM_Ok</code>), or it may fail for various reasons: the tuple is invisible to the current snapshot (<code class="language-plaintext highlighter-rouge">TM_Invisible</code>), already modified by the same backend process (<code class="language-plaintext highlighter-rouge">TM_SelfModified</code>), updated (<code class="language-plaintext highlighter-rouge">TM_Updated</code>), or deleted (<code class="language-plaintext highlighter-rouge">TM_Deleted</code>). Additionally, if the lock is instructed not to wait, it may return <code class="language-plaintext highlighter-rouge">TM_BeingModified</code> if another transaction is currently modifying the tuple, or <code class="language-plaintext highlighter-rouge">TM_WouldBlock</code> if the lock would otherwise block.</p>

<h2 id="pg_row_lock_tracer">pg_row_lock_tracer</h2>
<p><code class="language-plaintext highlighter-rouge">pg_row_lock_tracer</code> enables real-time tracing of PostgreSQL row-level locks using eBPF and UProbes. It also provides statistics about requested locks and their outcomes.</p>

<h2 id="download-and-usage">Download and Usage</h2>

<p>The lock tracer can be installed via the Python package installer <code class="language-plaintext highlighter-rouge">pip</code>:</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>pg-lock-tracer
</code></pre></div></div>

<p>Once installed, the locks of one or more running processes can be traced:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Trace the row locks of the given PostgreSQL binary
pg_row_lock_tracer -x /home/jan/postgresql-sandbox/bin/REL_14_9_DEBUG/bin/postgres

# Trace the row locks of PID 1234
pg_row_lock_tracer -p 1234 -x /home/jan/postgresql-sandbox/bin/REL_14_9_DEBUG/bin/postgres

# Trace the row locks of PIDs 1234 and 5678
pg_row_lock_tracer -p 1234 -p 5678 -x /home/jan/postgresql-sandbox/bin/REL_14_9_DEBUG/bin/postgres

# Trace the row locks of PID 1234 with verbose output
pg_row_lock_tracer -p 1234 -x /home/jan/postgresql-sandbox/bin/REL_14_9_DEBUG/bin/postgres -v

# Trace the row locks and display statistics
pg_row_lock_tracer -x /home/jan/postgresql-sandbox/bin/REL_14_9_DEBUG/bin/postgres --statistics
</code></pre></div></div>

<p>A sample output of the tool looks as follows:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[...]
2783502701862408 [Pid 2604491] LOCK_TUPLE_END TM_OK in 13100 ns
2783502701877081 [Pid 2604491] LOCK_TUPLE (Tablespace 1663 database 305234 relation 313419) - (Block and offset 7 143) - LOCK_TUPLE_EXCLUSIVE LOCK_WAIT_BLOCK
2783502701972367 [Pid 2604491] LOCK_TUPLE_END TM_OK in 95286 ns
2783502701988387 [Pid 2604491] LOCK_TUPLE (Tablespace 1663 database 305234 relation 313419) - (Block and offset 7 144) - LOCK_TUPLE_EXCLUSIVE LOCK_WAIT_BLOCK
2783502702001690 [Pid 2604491] LOCK_TUPLE_END TM_OK in 13303 ns
2783502702016387 [Pid 2604491] LOCK_TUPLE (Tablespace 1663 database 305234 relation 313419) - (Block and offset 7 145) - LOCK_TUPLE_EXCLUSIVE LOCK_WAIT_BLOCK
2783502702029375 [Pid 2604491] LOCK_TUPLE_END TM_OK in 12988 ns
</code></pre></div></div>

<p>The tool’s output shows the tuples being locked, the type of locks used, and additional options such as <code class="language-plaintext highlighter-rouge">LOCK_WAIT_BLOCK</code>. It also includes the result of the lock operation (<code class="language-plaintext highlighter-rouge">TM_OK</code>).</p>

<p>When the <code class="language-plaintext highlighter-rouge">--statistics</code> option is used, the tool collects and displays statistics about the traced locks upon termination (e.g., after pressing CTRL+C):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Lock statistics:
================

Used wait policies:
+---------+-----------------+----------------+-----------------+
|   PID   | LOCK_WAIT_BLOCK | LOCK_WAIT_SKIP | LOCK_WAIT_ERROR |
+---------+-----------------+----------------+-----------------+
| 2604491 |       1440      |       0        |        0        |
+---------+-----------------+----------------+-----------------+

Lock modes:
+---------+---------------------+------------------+---------------------------+----------------------+
|   PID   | LOCK_TUPLE_KEYSHARE | LOCK_TUPLE_SHARE | LOCK_TUPLE_NOKEYEXCLUSIVE | LOCK_TUPLE_EXCLUSIVE |
+---------+---------------------+------------------+---------------------------+----------------------+
| 2604491 |          0          |        0         |             0             |         1440         |
+---------+---------------------+------------------+---------------------------+----------------------+

Lock results:
+---------+-------+--------------+-----------------+------------+------------+------------------+---------------+
|   PID   | TM_OK | TM_INVISIBLE | TM_SELFMODIFIED | TM_UPDATED | TM_DELETED | TM_BEINGMODIFIED | TM_WOULDBLOCK |
+---------+-------+--------------+-----------------+------------+------------+------------------+---------------+
| 2604491 |  1440 |      0       |        0        |     0      |     0      |        0         |       0       |
+---------+-------+--------------+-----------------+------------+------------+------------------+---------------+
</code></pre></div></div>

<h2 id="summary">Summary</h2>
<p><code class="language-plaintext highlighter-rouge">pg_row_lock_tracer</code> is a tool for tracing PostgreSQL row-level locks. It is available for download on <a href="https://github.com/jnidzwetzki/pg-lock-tracer/">GitHub</a>. Using eBPF and UProbes, it enables real-time tracing of row lock activity. Like its related tools (<code class="language-plaintext highlighter-rouge">pg_lock_tracer</code> and <code class="language-plaintext highlighter-rouge">pg_lw_lock_tracer</code>), it is designed for debugging and analyzing lock behavior and performance issues.</p>

<p>This is the third article in a series about tracing PostgreSQL locks. The first part discusses a lock tracer for heavyweight locks, while the second part focuses on tracing LW locks.</p>]]></content><author><name>Jan Nidzwetzki</name></author><category term="PostgreSQL" /><category term="Tracing" /><summary type="html"><![CDATA[PostgreSQL uses several types of locks to coordinate parallel transactions and manage access to resources like tuples, tables, and in-memory data structures. Heavyweight locks are used to control access to tables. Lightweight locks (LWLocks) manage access to data structures, such as adding data to the write-ahead log (WAL). Row-level locks control access to individual tuples. For example, tuples need to be locked when executing an SQL statement like SELECT * FROM table WHERE i &gt; 10 FOR UPDATE;. The tuples returned by the query are internally locked with an exclusive lock (LOCK_TUPLE_EXCLUSIVE). Another transaction attempting to lock the same tuples must wait until the first transaction releases the locks. In this article, we discuss the tool pg_row_lock_tracer, which uses eBPF and UProbes to trace PostgreSQL’s row-locking behavior. The tool can be downloaded from the pg-lock-tracer project website. This is the third article in a series about tracing PostgreSQL locks. The first article covers the tracing of heavyweight locks, and the second article focuses on LW locks.]]></summary></entry><entry><title type="html">Index the PostgreSQL Source Code with Elixir</title><link href="https://jnidzwetzki.github.io/2024/01/11/index-postgresql-source-code-with-elixir.html" rel="alternate" type="text/html" title="Index the PostgreSQL Source Code with Elixir" /><published>2024-01-11T00:00:00+00:00</published><updated>2024-01-11T00:00:00+00:00</updated><id>https://jnidzwetzki.github.io/2024/01/11/index-postgresql-source-code-with-elixir</id><content type="html" xml:base="https://jnidzwetzki.github.io/2024/01/11/index-postgresql-source-code-with-elixir.html"><![CDATA[<p>When working with the internals of PostgreSQL, it is helpful to navigate the source code quickly and look up symbols and definitions efficiently. I use VS Code for programming. However, finding definitions does not always work reliably, and the full-text search is slow and often returns too many results, missing the desired hit (e.g., the definition of a function). For a long time, I kept the <a href="https://doxygen.postgresql.org/">Doxygen build</a> of PostgreSQL open in my browser. However, Doxygen can be cumbersome to use and only shows the current version of PostgreSQL. Sometimes, the source code for an older version is needed. To address these issues, I set up a local copy of the <a href="https://github.com/bootlin/elixir">Elixir Cross Referencer</a>.</p>

<!--more-->

<p>The <a href="https://github.com/bootlin/elixir">Elixir Cross Referencer</a> is a source code indexer that provides a web interface and an API for quickly looking up symbols. I had used it several times while navigating the <a href="https://elixir.bootlin.com/linux/latest/source">Linux source code</a> and wondered what it would take to set up a local installation for PostgreSQL.</p>

<p>To my surprise, it was easier than expected. Elixir can be installed using Docker, and custom images for new projects can be created effortlessly. For instance, to create a new Docker image containing a copy of the PostgreSQL source code, the following commands need to be executed:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ git clone https://github.com/bootlin/elixir.git

$ cd elixir

$ docker build -t elixir:postgresql-11-01-2024 --build-arg GIT_REPO_URL=https://github.com/postgres/postgres.git --build-arg PROJECT=postgresql . -f docker/debian/Dockerfile
</code></pre></div></div>

<p>The last command builds a new Docker image called <code class="language-plaintext highlighter-rouge">elixir:postgresql-11-01-2024</code>. This process takes some time to complete. The two <code class="language-plaintext highlighter-rouge">build-arg</code> parameters are sufficient to clone and index the PostgreSQL repository. Once the image is created, it should appear as an available image in the local Docker installation.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ docker images

REPOSITORY                                                       TAG                     IMAGE ID       CREATED        SIZE
elixir                                                           postgresql-11-01-2024   fb993f66c1cc   2 hours ago    2.38GB
</code></pre></div></div>

<p>Next, a new container can be started using the image. I use the parameter <code class="language-plaintext highlighter-rouge">-p 8081:80</code> to map port 80 of the container to port 8081 on my local system.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ docker run elixir:postgresql-11-01-2024 -d -p 8081:80
</code></pre></div></div>

<p>Once the container is running, the PostgreSQL source code can be accessed by opening the URL <code class="language-plaintext highlighter-rouge">http://172.17.0.2:8081/postgresql/latest/source</code>.</p>

<figure class="row">
    
    <div class="column">
        <img src="/assets/img/elixir-postgresql.jpg" alt="elixir-postgresql.jpg" />
    </div>
    
    <div class="column">
        <img src="/assets/img/elixir-postgresql2.jpg" alt="elixir-postgresql2.jpg" />
    </div>
    
    
    <figcaption class="caption-style"></figcaption>
</figure>

<p>If you want to customize the header of the Elixir installation, you can modify the file <code class="language-plaintext highlighter-rouge">templates/header.html</code> before building the Docker image. More information about customizing the image can be found in <a href="https://github.com/bootlin/elixir#building-docker-images">the project’s documentation</a>.</p>]]></content><author><name>Jan Nidzwetzki</name></author><category term="PostgreSQL" /><category term="Development" /><summary type="html"><![CDATA[When working with the internals of PostgreSQL, it is helpful to navigate the source code quickly and look up symbols and definitions efficiently. I use VS Code for programming. However, finding definitions does not always work reliably, and the full-text search is slow and often returns too many results, missing the desired hit (e.g., the definition of a function). For a long time, I kept the Doxygen build of PostgreSQL open in my browser. However, Doxygen can be cumbersome to use and only shows the current version of PostgreSQL. Sometimes, the source code for an older version is needed. To address these issues, I set up a local copy of the Elixir Cross Referencer.]]></summary></entry><entry><title type="html">Using Bpftrace to Trace PostgreSQL Vacuum Operations</title><link href="https://jnidzwetzki.github.io/2023/08/23/using-bpftrace-to-trace-postgresql.html" rel="alternate" type="text/html" title="Using Bpftrace to Trace PostgreSQL Vacuum Operations" /><published>2023-08-23T00:00:00+00:00</published><updated>2023-08-23T00:00:00+00:00</updated><id>https://jnidzwetzki.github.io/2023/08/23/using-bpftrace-to-trace-postgresql</id><content type="html" xml:base="https://jnidzwetzki.github.io/2023/08/23/using-bpftrace-to-trace-postgresql.html"><![CDATA[<p>The <a href="https://ebpf.io/">eBPF technology</a> of the Linux kernel allows it to monitor applications with minimal overhead. <a href="https://github.com/torvalds/linux/blob/master/kernel/events/uprobes.c">UProbes</a> can be used to trace the invocation and exit of functions in programs. Modern tools to observe databases (like <a href="https://jnidzwetzki.github.io/2023/01/11/trace-postgresql-locks-with-pg-lock-tracer.html">pg-lock-tracer</a>) are built on top of eBPF. However, these fully flagged tools are often written in C and Python and require some development effort. Sometimes, a ‘quick and dirty’ solution for a particular observation would be sufficient. With bpftrace, users can create eBPF programs with a few lines of code. In this article, we develop a simple bpftrace program to observe the execution of vacuum calls in PostgreSQL and analyze the delay.</p>

<!--more-->

<blockquote>
  <p>⚠️ An updated and slightly revised version of this post is available in the <a href="https://www.timescale.com/blog/using-bpftrace-to-trace-postgresql-vacuum-operations/">Timescale company blog</a>.</p>
</blockquote>

<h2 id="used-environment">Used Environment</h2>

<p>PostgreSQL is a database management system that uses <a href="https://www.postgresql.org/docs/current/sql-vacuum.html">vacuum operations</a> to reclaim space from dead (e.g., updated or deleted) tuples. 
In this post, we will trace the vacuum calls and determine the needed time for the vacuum operations per table.</p>

<p>In the following examples, a PostgreSQL 14 server is used. The PostgreSQL binary is located at <code class="language-plaintext highlighter-rouge">/home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres</code>. In addition, the examples are executed in a database with these two tables:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">testtable1</span> <span class="p">(</span>
   <span class="n">id</span> <span class="nb">int</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
   <span class="n">value</span> <span class="nb">int</span> <span class="k">NOT</span> <span class="k">NULL</span>
<span class="p">);</span>

<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">testtable2</span> <span class="p">(</span>
   <span class="n">id</span> <span class="nb">int</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
   <span class="n">value</span> <span class="nb">int</span> <span class="k">NOT</span> <span class="k">NULL</span>
<span class="p">);</span>
</code></pre></div></div>

<p><strong>Note:</strong> Depending on the used C compiler and applied optimizations, the symbols of internal (i.e., as <code class="language-plaintext highlighter-rouge">static</code> declared) functions could not be visible. In this case, uprobes can not be used to trace the function invocations. To address this issue, there are two possible solutions: (1) remove the <code class="language-plaintext highlighter-rouge">static</code> modifier from the function declaration and recompile PostgreSQL, or (2) create a full <a href="https://github.com/jnidzwetzki/pg-lock-tracer/#postgresql-build">debug build</a> of PostgreSQL.</p>

<h2 id="using-funclatency-bpfcc-to-trace-function-calls">Using funclatency-bpfcc to Trace Function Calls</h2>

<p>Let’s explore the solutions that already exist before developing our tool to trace the vacuum operations. The tool <code class="language-plaintext highlighter-rouge">funclatency-bpfcc</code> is available for most Linux distributions (on Debian, it is contained in the package <em>bpfcc-tools</em>) and allows it to trace a function enter and exit and measure the function latency (i.e., the time the function needs to complete).</p>

<p>In PostgreSQL, the function <code class="language-plaintext highlighter-rouge">vacuum_rel</code> is invoked when a vacuum operation on a relation is performed. To trace these function calls with <code class="language-plaintext highlighter-rouge">funclatency-bpfcc</code>, the path of the PostgreSQL binary and the function name have to be provided. For instance:</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">sudo </span>funclatency-bpfcc <span class="nt">-r</span> /home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres:vacuum_rel

Tracing 1 functions <span class="k">for</span> <span class="s2">"/home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres:vacuum_rel"</span>... Hit Ctrl-C to end.
</code></pre></div></div>

<p>Afterward, a eBPF program is loaded into the Linux kernel, a uprobe is defined on the function enter and one uprobe is defined on the function exit. The latency between these two events is measured and stored.</p>

<p>To execute some vacuum operations, we perform the following SQL statement in a second session:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">database</span><span class="o">=#</span> <span class="k">VACUUM</span> <span class="k">FULL</span><span class="p">;</span>
<span class="k">VACUUM</span> <span class="k">FULL</span>
</code></pre></div></div>

<p>This SQL statement triggers PostgreSQL to perform a vacuum operation of all tables of the currently open database. After the vacuum operations are done, the <code class="language-plaintext highlighter-rouge">funclatency-bpfcc</code> program can be stopped (by executing CTRL+C). This ends the observation of the binary and shows the recorded execution times on the terminal.</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">sudo </span>funclatency-bpfcc <span class="nt">-r</span> /home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres:vacuum_rel
<span class="o">[</span>...]
^C
Function <span class="o">=</span> b<span class="s1">'vacuum_rel'</span> <span class="o">[</span>876997]
     nsecs               : count     distribution
         0 -&gt; 1          : 0        |                                        |
         2 -&gt; 3          : 0        |                                        |
         4 -&gt; 7          : 0        |                                        |
         8 -&gt; 15         : 0        |                                        |
        16 -&gt; 31         : 0        |                                        |
        32 -&gt; 63         : 0        |                                        |
        64 -&gt; 127        : 0        |                                        |
       128 -&gt; 255        : 0        |                                        |
       256 -&gt; 511        : 0        |                                        |
       512 -&gt; 1023       : 0        |                                        |
      1024 -&gt; 2047       : 0        |                                        |
      2048 -&gt; 4095       : 0        |                                        |
      4096 -&gt; 8191       : 0        |                                        |
      8192 -&gt; 16383      : 0        |                                        |
     16384 -&gt; 32767      : 0        |                                        |
     32768 -&gt; 65535      : 0        |                                        |
     65536 -&gt; 131071     : 0        |                                        |
    131072 -&gt; 262143     : 0        |                                        |
    262144 -&gt; 524287     : 0        |                                        |
    524288 -&gt; 1048575    : 0        |                                        |
   1048576 -&gt; 2097151    : 0        |                                        |
   2097152 -&gt; 4194303    : 0        |                                        |
   4194304 -&gt; 8388607    : 2        |<span class="k">*</span>                                       |
   8388608 -&gt; 16777215   : 13       |<span class="k">***********</span>                             |
  16777216 -&gt; 33554431   : 44       |<span class="k">****************************************</span>|
  33554432 -&gt; 67108863   : 7        |<span class="k">******</span>                                  |
  67108864 -&gt; 134217727  : 1        |                                        |

avg <span class="o">=</span> 22765358 nsecs, total: 1525279002 nsecs, count: 67

Detaching...
</code></pre></div></div>

<p>The output contains the information that the function <code class="language-plaintext highlighter-rouge">vacuum_rel</code> was called 67 times and the average function time is <code class="language-plaintext highlighter-rouge">22765358 nsecs</code>. In addition, a histogram of the function latency is printed. This gives a lot of helpful information, but it might be helpful to get the information which vacuum calls for which relation needs how much time. This is something that is not supported by this tool because it does not evaluate the parameters of the function (e.g., the OID of relation that the current function invocation should vacuum). However, this is something that we can do with <code class="language-plaintext highlighter-rouge">bpftrace</code>.</p>

<h2 id="tracing-function-entries">Tracing Function Entries</h2>

<p>Let’s start with a very simple bpftrace program that prints a line once the <code class="language-plaintext highlighter-rouge">vacuum_rel</code> function is invoked in the PostgreSQL binary. <code class="language-plaintext highlighter-rouge">bpftrace</code> is called with the eBPF program that should be loaded into the Linux kernel. The eBPF programs that are passed to bpftrace have the following <a href="https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#language">syntax</a>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;probe1&gt; {
        &lt;Actions&gt;
}

[...]

&lt;probeN&gt; {
        &lt;Actions&gt;
}
</code></pre></div></div>

<p>The syntax to define a uprobe on a userland binary is: <code class="language-plaintext highlighter-rouge">uprobe:library_name:function_name[+offset]</code>. For instance, to define an uprobe on the function invocation of <code class="language-plaintext highlighter-rouge">vacuum_rel</code> in the binary <code class="language-plaintext highlighter-rouge">/home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres</code> and print the line <code class="language-plaintext highlighter-rouge">Vacuum started</code>, the following bpftrace call can be used:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">$</span> <span class="n">sudo</span> <span class="n">bpftrace</span> <span class="o">-</span><span class="n">e</span> <span class="err">'</span>
<span class="n">uprobe</span><span class="o">:/</span><span class="n">home</span><span class="o">/</span><span class="n">jan</span><span class="o">/</span><span class="n">postgresql</span><span class="o">-</span><span class="n">sandbox</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">REL_14_2_DEBUG</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">postgres</span><span class="o">:</span><span class="n">vacuum_rel</span> <span class="p">{</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"Vacuum started</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="p">}</span>
<span class="err">'</span>

<span class="n">Attaching</span> <span class="mi">1</span> <span class="n">probe</span><span class="p">...</span>
<span class="n">Vacuum</span> <span class="n">started</span>
<span class="n">Vacuum</span> <span class="n">started</span>
<span class="n">Vacuum</span> <span class="n">started</span>
<span class="n">Vacuum</span> <span class="n">started</span>
<span class="n">Vacuum</span> <span class="n">started</span>
<span class="n">Vacuum</span> <span class="n">started</span>
<span class="n">Vacuum</span> <span class="n">started</span>
<span class="n">Vacuum</span> <span class="n">started</span>
<span class="n">Vacuum</span> <span class="n">started</span>
<span class="p">[...]</span>
</code></pre></div></div>

<p>As soon as the <code class="language-plaintext highlighter-rouge">VACUUM FULL</code> SQL statement in PostgreSQL is executed in another terminal session, the program starts to print the message on the screen. This is a good start, but we still have less information available than output by the existing tool <code class="language-plaintext highlighter-rouge">funclatency-bpfcc</code>. The latency of the function calls is missing.</p>

<h1 id="tracing-function-returns--latency">Tracing Function Returns / Latency</h1>

<p>To measure the latency of the function invocations, we need two things:</p>

<ul>
  <li>We need to define a second probe that is invoked when the function observed returns; this can be done by a <code class="language-plaintext highlighter-rouge">uretproble</code>.</li>
  <li>The time between the function invocation and the return has to be measured.</li>
</ul>

<p>A <code class="language-plaintext highlighter-rouge">uretproble</code> in bpftrace can be defined using the same syntax (<code class="language-plaintext highlighter-rouge">uretprobe:binary:function</code>) as the already defined <code class="language-plaintext highlighter-rouge">uprobe</code>. In addition, bpftrace allows it to create variables like associative arrays. We use such an array to capture the start time of a function invocation <code class="language-plaintext highlighter-rouge">@start[tid] = nsecs;</code>. The key of the array is the id of the current thread <code class="language-plaintext highlighter-rouge">tid</code>. So, multiple threads (and processes like in our case with PostgreSQL) can be traced simultaneously without overriding the last function invitation start time.</p>

<p>In the uretprobe we take the current time and subtract the time of the function invocation (<code class="language-plaintext highlighter-rouge">nsecs - @start[tid]</code>) to get the time the function call needs. In addition, we use a function predicate (<code class="language-plaintext highlighter-rouge">/@start[tid]/</code>) to let bpftrace know that we only want to execute the function body of the <code class="language-plaintext highlighter-rouge">uretprobe</code> as soon as this array value is defined. Using this predicate, we prevent handling a function return without seeing the function enter before (e.g., we start the bpftrace program in the middle of a running function call, and we get only the <code class="language-plaintext highlighter-rouge">uretprobe</code> invocation for this function call).</p>

<p><strong>Note:</strong> Is it not guaranteed that the eBPF events are delivered and processed in-order by bpftrace. Especially when a function call is short and we have a lot of function invocations, the events could be processed out-of-order (e.g., we see two function enter events followed by two function return events). In this case, function latency observations with bpftrace become imprecise. To avoid this, we use <code class="language-plaintext highlighter-rouge">VACUUM FULL</code> calls instead of <code class="language-plaintext highlighter-rouge">vacuum</code> calls. These calls are <a href="https://www.postgresql.org/docs/current/sql-vacuum.html">much more expensive</a> since they rewrite the table. Therefore, they take longer and can be reliably observed by bpftrace.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">$</span> <span class="n">sudo</span> <span class="n">bpftrace</span> <span class="o">-</span><span class="n">e</span> <span class="err">'</span>
<span class="n">uprobe</span><span class="o">:/</span><span class="n">home</span><span class="o">/</span><span class="n">jan</span><span class="o">/</span><span class="n">postgresql</span><span class="o">-</span><span class="n">sandbox</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">REL_14_2_DEBUG</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">postgres</span><span class="o">:</span><span class="n">vacuum_rel</span>
<span class="p">{</span>
        <span class="n">printf</span><span class="p">(</span><span class="s">"Performing vacuum</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
        <span class="err">@</span><span class="n">start</span><span class="p">[</span><span class="n">tid</span><span class="p">]</span> <span class="o">=</span> <span class="n">nsecs</span><span class="p">;</span>
<span class="p">}</span>

<span class="n">uretprobe</span><span class="o">:/</span><span class="n">home</span><span class="o">/</span><span class="n">jan</span><span class="o">/</span><span class="n">postgresql</span><span class="o">-</span><span class="n">sandbox</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">REL_14_2_DEBUG</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">postgres</span><span class="o">:</span><span class="n">vacuum_rel</span>
<span class="o">/</span><span class="err">@</span><span class="n">start</span><span class="p">[</span><span class="n">tid</span><span class="p">]</span><span class="o">/</span>
<span class="p">{</span>
        <span class="n">printf</span><span class="p">(</span><span class="s">"Vacuum call took %d ns</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">nsecs</span> <span class="o">-</span> <span class="err">@</span><span class="n">start</span><span class="p">[</span><span class="n">tid</span><span class="p">]);</span>
        <span class="n">delete</span><span class="p">(</span><span class="err">@</span><span class="n">start</span><span class="p">[</span><span class="n">tid</span><span class="p">]);</span>
<span class="p">}</span>
<span class="err">'</span>
</code></pre></div></div>

<p>After running this bpftrace call and executing <code class="language-plaintext highlighter-rouge">VACUUM FULL</code> in a second session, we see the following output:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Attaching 2 probes...
Performing vacuum
Vacuum call took 37486735 ns
Performing vacuum
Vacuum call took 16491130 ns
Performing vacuum
Vacuum call took 32443568 ns
Performing vacuum
Vacuum call took 17959933 ns
[...]
</code></pre></div></div>

<p>For each call of the <code class="language-plaintext highlighter-rouge">vacuum_rel</code> in PostgreSQL, we measure the time the vacuum operation needs. However, it would be convenient if we could also trace the OID or the name of the relation that is vacuumed by the current vacuum operation. This requires the handling of the function parameters of the observed function.</p>

<h2 id="handle-function-parameters">Handle Function Parameters</h2>

<p>The function <code class="language-plaintext highlighter-rouge">vacuum_rel</code> has the following signature in PostgreSQL 14. The first parameter is the <code class="language-plaintext highlighter-rouge">Oid</code> (an <a href="https://github.com/postgres/postgres/blob/1951d21b29939ddcb0e30a018cf413b949e40d97/src/include/postgres_ext.h#L31">unsigned int</a>) of the processed relation. The second parameter is a <code class="language-plaintext highlighter-rouge">RageVar</code> struct, which <em>could</em> contain the name of the relation. The third parameter is a <code class="language-plaintext highlighter-rouge">VacuumParams</code> struct, which contains additional parameters for the vacuum operation and the last parameter is a <code class="language-plaintext highlighter-rouge">BufferAccessStrategy</code>, which defines the access strategy of the used buffer.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">bool</span> <span class="n">vacuum_rel</span><span class="p">(</span><span class="n">Oid</span> <span class="n">relid</span><span class="p">,</span>
        <span class="n">RangeVar</span> <span class="o">*</span><span class="n">relation</span><span class="p">,</span>
        <span class="n">VacuumParams</span> <span class="o">*</span><span class="n">params</span><span class="p">,</span>
        <span class="n">BufferAccessStrategy</span> <span class="n">bstrategy</span> 
<span class="p">)</span>
</code></pre></div></div>

<p>Bpftrace allows it to access the function parameter using the keywords <code class="language-plaintext highlighter-rouge">arg0</code>, <code class="language-plaintext highlighter-rouge">arg1</code>, …, <code class="language-plaintext highlighter-rouge">argN</code>. To include the Oid in the output our logging, we need only to print the first parameter of the function.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">$</span> <span class="n">sudo</span> <span class="n">bpftrace</span> <span class="o">-</span><span class="n">e</span> <span class="err">'</span>

<span class="n">uprobe</span><span class="o">:/</span><span class="n">home</span><span class="o">/</span><span class="n">jan</span><span class="o">/</span><span class="n">postgresql</span><span class="o">-</span><span class="n">sandbox</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">REL_14_2_DEBUG</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">postgres</span><span class="o">:</span><span class="n">vacuum_rel</span>
<span class="p">{</span>
        <span class="n">printf</span><span class="p">(</span><span class="s">"Performing vacuum of OID %d</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">arg0</span><span class="p">);</span>
        <span class="err">@</span><span class="n">start</span><span class="p">[</span><span class="n">tid</span><span class="p">]</span> <span class="o">=</span> <span class="n">nsecs</span><span class="p">;</span>
<span class="p">}</span>

<span class="n">uretprobe</span><span class="o">:/</span><span class="n">home</span><span class="o">/</span><span class="n">jan</span><span class="o">/</span><span class="n">postgresql</span><span class="o">-</span><span class="n">sandbox</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">REL_14_2_DEBUG</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">postgres</span><span class="o">:</span><span class="n">vacuum_rel</span>
<span class="o">/</span><span class="err">@</span><span class="n">start</span><span class="p">[</span><span class="n">tid</span><span class="p">]</span><span class="o">/</span>
<span class="p">{</span>
        <span class="n">printf</span><span class="p">(</span><span class="s">"Vacuum call took %d ns</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">nsecs</span> <span class="o">-</span> <span class="err">@</span><span class="n">start</span><span class="p">[</span><span class="n">tid</span><span class="p">]);</span>
        <span class="n">delete</span><span class="p">(</span><span class="err">@</span><span class="n">start</span><span class="p">[</span><span class="n">tid</span><span class="p">]);</span>
<span class="p">}</span>
<span class="err">'</span>
</code></pre></div></div>

<p>When the <code class="language-plaintext highlighter-rouge">VACUUM FULL</code> operation is executed again in a second terminal, the output looks as follows:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Attaching 2 probes...
[...]
Performing vacuum of OID 1153888
Vacuum call took 37486734 ns
Performing vacuum of OID 1153891
Vacuum call took 49535256 ns
Performing vacuum of OID 2619
Vacuum call took 39575635 ns
Performing vacuum of OID 2840
Vacuum call took 40683526 ns
Performing vacuum of OID 1247
Vacuum call took 14683600 ns
Performing vacuum of OID 4171
Vacuum call took 20587503 ns
</code></pre></div></div>

<p>To determine which Oid belongs to which relation, the following SQL statement can be executed:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">blog</span><span class="o">=#</span> <span class="k">SELECT</span> <span class="n">oid</span><span class="p">,</span> <span class="n">relname</span> <span class="k">FROM</span> <span class="n">pg_class</span> <span class="k">WHERE</span> <span class="n">oid</span> <span class="k">IN</span> <span class="p">(</span><span class="mi">1153888</span><span class="p">,</span> <span class="mi">1153891</span><span class="p">);</span>
   <span class="n">oid</span>   <span class="o">|</span>  <span class="n">relname</span>   
<span class="c1">---------+------------</span>
 <span class="mi">1153888</span> <span class="o">|</span> <span class="n">testtable1</span>
 <span class="mi">1153891</span> <span class="o">|</span> <span class="n">testtable2</span>
<span class="p">(</span><span class="mi">2</span> <span class="k">rows</span><span class="p">)</span>
</code></pre></div></div>

<p>The result shows that the Oids <code class="language-plaintext highlighter-rouge">1153888</code> and <code class="language-plaintext highlighter-rouge">1153891</code> belong to the tables <code class="language-plaintext highlighter-rouge">testtable1</code> and <code class="language-plaintext highlighter-rouge">testtable2</code>, which we have created in one of the first sections of this article. These values belong to our test environment. In your environment, different Oids might be shown.</p>

<h2 id="handle-function-struct-parameters">Handle Function Struct Parameters</h2>

<p>So far, we have processed simple parameters with <code class="language-plaintext highlighter-rouge">bpftrace</code> (like Oids, which are unsigned integers). However, many parameters in PostgreSQL are structs. Furthermore, these structs can be handled in bpftrace programs as well.</p>

<p>The second parameter of the <code class="language-plaintext highlighter-rouge">vacuum_rel</code> function is a RangeVar struct. This struct is <a href="https://github.com/postgres/postgres/blob/2a8b40e3681921943a2989fd4ec6cdbf8766566c/src/include/nodes/primnodes.h#L63">defined in PostgreSQL 14</a> as follows:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="n">RangeVar</span>
<span class="p">{</span>
	<span class="n">NodeTag</span>	<span class="n">type</span><span class="p">;</span>
	<span class="kt">char</span> <span class="o">*</span><span class="n">catalogname</span><span class="p">;</span>
	<span class="kt">char</span> <span class="o">*</span><span class="n">schemaname</span><span class="p">;</span>
	<span class="kt">char</span> <span class="o">*</span><span class="n">relname</span><span class="p">;</span>
	<span class="p">[...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>To process the struct, the following bpftrace program can be used. Please note, that the internal <code class="language-plaintext highlighter-rouge">NodeTag</code> data type of PostgreSQL is replaced by a simple int. The <code class="language-plaintext highlighter-rouge">NodeTag</code> data type is an <code class="language-plaintext highlighter-rouge">enum</code>. Enums are backed by the integer data type in C. To handle this enum correctly, we could (1) also copy the enum definition into the eBPF program, or (2) we could replace it with a data type of the same length. To keep the bpftrace program simple, the second option is used here. The next three struct members are char pointer which contains the catalogname, the schema, and the name of the relation. The <code class="language-plaintext highlighter-rouge">schemaname</code> and the <code class="language-plaintext highlighter-rouge">relname</code> are the fields we are interested in. The struct contains more members, but these members are ignored to keep the example clear.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">$</span> <span class="n">sudo</span> <span class="n">bpftrace</span> <span class="o">-</span><span class="n">e</span> <span class="err">'</span>
<span class="k">struct</span> <span class="n">RangeVar</span>
<span class="p">{</span>
	<span class="kt">int</span> <span class="n">type</span><span class="p">;</span>
	<span class="kt">char</span> <span class="o">*</span><span class="n">catalogname</span><span class="p">;</span>
	<span class="kt">char</span> <span class="o">*</span><span class="n">schemaname</span><span class="p">;</span>
	<span class="kt">char</span> <span class="o">*</span><span class="n">relname</span><span class="p">;</span>
<span class="p">};</span>

<span class="n">uprobe</span><span class="o">:/</span><span class="n">home</span><span class="o">/</span><span class="n">jan</span><span class="o">/</span><span class="n">postgresql</span><span class="o">-</span><span class="n">sandbox</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">REL_14_2_DEBUG</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">postgres</span><span class="o">:</span><span class="n">vacuum_rel</span>
<span class="p">{</span>
        <span class="n">printf</span><span class="p">(</span><span class="s">"[PID %d] Performing vacuum of OID %d (%s.%s)</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="n">arg0</span><span class="p">,</span> <span class="n">str</span><span class="p">(((</span><span class="k">struct</span> <span class="n">RangeVar</span><span class="o">*</span><span class="p">)</span> <span class="n">arg1</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">schemaname</span><span class="p">),</span> <span class="n">str</span><span class="p">(((</span><span class="k">struct</span> <span class="n">RangeVar</span><span class="o">*</span><span class="p">)</span> <span class="n">arg1</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">relname</span><span class="p">));</span>
        <span class="err">@</span><span class="n">start</span><span class="p">[</span><span class="n">tid</span><span class="p">]</span> <span class="o">=</span> <span class="n">nsecs</span><span class="p">;</span>
<span class="p">}</span>

<span class="n">uretprobe</span><span class="o">:/</span><span class="n">home</span><span class="o">/</span><span class="n">jan</span><span class="o">/</span><span class="n">postgresql</span><span class="o">-</span><span class="n">sandbox</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">REL_14_2_DEBUG</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">postgres</span><span class="o">:</span><span class="n">vacuum_rel</span>
<span class="o">/</span><span class="err">@</span><span class="n">start</span><span class="p">[</span><span class="n">tid</span><span class="p">]</span><span class="o">/</span>
<span class="p">{</span>
        <span class="n">printf</span><span class="p">(</span><span class="s">"[PID %d] Vacuum call took %d ns</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="n">nsecs</span> <span class="o">-</span> <span class="err">@</span><span class="n">start</span><span class="p">[</span><span class="n">tid</span><span class="p">]);</span>
        <span class="n">delete</span><span class="p">(</span><span class="err">@</span><span class="n">start</span><span class="p">[</span><span class="n">tid</span><span class="p">]);</span>
<span class="p">}</span>
<span class="err">'</span>
</code></pre></div></div>

<p>After the struct is defined, the members of the struct can be accessed as in a regular C program. For example: <code class="language-plaintext highlighter-rouge">((struct RangeVar*) arg1)-&gt;schemaname</code>. In addition, we also print the process id (PID) of the program that has triggered the uprobe. This allows it to identify the process that has performed the vacuum operation.</p>

<p>When running the following SQL statements in a second terminal:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">VACUUM</span> <span class="k">FULL</span> <span class="k">public</span><span class="p">.</span><span class="n">testtable1</span><span class="p">;</span>
<span class="k">VACUUM</span> <span class="k">FULL</span> <span class="k">public</span><span class="p">.</span><span class="n">testtable2</span><span class="p">;</span>
</code></pre></div></div>

<p>The bpftrace program shows the following output:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Attaching 2 probes...
[PID 616516] Performing vacuum of OID 1153888 (public.testtable1)
[PID 616516] Vacuum call took 23683600 ns
[PID 616516] Performing vacuum of OID 1153891 (public.testtable2)
[PID 616516] Vacuum call took 24240837 ns
</code></pre></div></div>

<p>The table names are extracted from the <code class="language-plaintext highlighter-rouge">RangeVar</code> data structure and shown in the output. However, this data structure is not always populated by PostgreSQL. The data structure might be empty when running <code class="language-plaintext highlighter-rouge">VACUUM FULL</code> without specifying a table name. Therefore, we use two single invocations with explicit table names to force PostgreSQL to populate this data structure.</p>

<h2 id="optimizing-the-bpftrace-program-using-maps">Optimizing the Bpftrace Program Using Maps</h2>

<p>The bpftrace programs we have developed so far use one or more <code class="language-plaintext highlighter-rouge">printf</code> statements directly. A <code class="language-plaintext highlighter-rouge">printf</code> call is slow and reduces the throughput the bpftrace program can monitor.</p>

<p>This can be optimized by storing the data in a map that is printed when bpftrace is stopped. To do this, we introduce three new maps <code class="language-plaintext highlighter-rouge">@start</code>, <code class="language-plaintext highlighter-rouge">@oid</code>, and <code class="language-plaintext highlighter-rouge">@vacuum</code>. The first two maps are populated in the uprobe event of the <code class="language-plaintext highlighter-rouge">vacuum_rel</code> function. The map <code class="language-plaintext highlighter-rouge">@start</code> contains the time when the probe is triggered, and the map <code class="language-plaintext highlighter-rouge">@oid</code> contains the oid of the parameter function.</p>

<p>When the function is left and the <code class="language-plaintext highlighter-rouge">uretprobe</code> is activated, the <code class="language-plaintext highlighter-rouge">@vacuum</code>  map is populated. The key is the Oid and the value are the needed time to perform the vacuum operation. In addition, the keys of the first two maps are removed.</p>

<p>When bpftrace exits (i.e., by pressing CRTL+C), all populated maps are printed automatically. By using these three maps, we have separated the actual monitoring from the output; the expensive printf function is called after the monitoring is done.</p>

<p>In addition, in the following program, we use the two functions <code class="language-plaintext highlighter-rouge">BEGIN</code> and <code class="language-plaintext highlighter-rouge">END</code> that are called by bpftrace when the observation begins and ends.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">$</span> <span class="n">sudo</span> <span class="n">sudo</span> <span class="n">bpftrace</span> <span class="o">-</span><span class="n">e</span> <span class="err">'</span>

<span class="n">uprobe</span><span class="o">:/</span><span class="n">home</span><span class="o">/</span><span class="n">jan</span><span class="o">/</span><span class="n">postgresql</span><span class="o">-</span><span class="n">sandbox</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">REL_14_2_DEBUG</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">postgres</span><span class="o">:</span><span class="n">vacuum_rel</span>
<span class="p">{</span>
        <span class="err">@</span><span class="n">start</span><span class="p">[</span><span class="n">tid</span><span class="p">]</span> <span class="o">=</span> <span class="n">nsecs</span><span class="p">;</span>
        <span class="err">@</span><span class="n">oid</span><span class="p">[</span><span class="n">tid</span><span class="p">]</span> <span class="o">=</span> <span class="n">arg0</span><span class="p">;</span>
<span class="p">}</span>

<span class="n">uretprobe</span><span class="o">:/</span><span class="n">home</span><span class="o">/</span><span class="n">jan</span><span class="o">/</span><span class="n">postgresql</span><span class="o">-</span><span class="n">sandbox</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">REL_14_2_DEBUG</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">postgres</span><span class="o">:</span><span class="n">vacuum_rel</span>
<span class="o">/</span><span class="err">@</span><span class="n">start</span><span class="p">[</span><span class="n">tid</span><span class="p">]</span><span class="o">/</span>
<span class="p">{</span>

        <span class="err">@</span><span class="n">vacuum</span><span class="p">[</span><span class="err">@</span><span class="n">oid</span><span class="p">[</span><span class="n">tid</span><span class="p">]]</span> <span class="o">=</span> <span class="n">nsecs</span> <span class="o">-</span> <span class="err">@</span><span class="n">start</span><span class="p">[</span><span class="n">tid</span><span class="p">];</span>
        <span class="n">delete</span><span class="p">(</span><span class="err">@</span><span class="n">start</span><span class="p">[</span><span class="n">tid</span><span class="p">]);</span>
        <span class="n">delete</span><span class="p">(</span><span class="err">@</span><span class="n">oid</span><span class="p">[</span><span class="n">tid</span><span class="p">]);</span>

<span class="p">}</span>

<span class="n">BEGIN</span>
<span class="p">{</span>
        <span class="n">printf</span><span class="p">(</span><span class="s">"VACUUM calles are traced, press CTRL+C to stop tracing</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="p">}</span>

<span class="n">END</span> 
<span class="p">{</span>
        <span class="n">printf</span><span class="p">(</span><span class="s">"</span><span class="se">\n\n</span><span class="s">Needed time in ns to perform VACUUM FULL per Oid</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="p">}</span>
<span class="err">'</span>
</code></pre></div></div>

<p>After bpftrace is started, the first message is printed. After the program is stopped, the second message is printed. In addition, the content of the <code class="language-plaintext highlighter-rouge">@vacuum</code> map is printed. For each Oid, the needed time for the vacuum operations is shown.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>VACUUM calles are traced, press CTRL+C to stop tracing
^C

Needed time in ns to perform VACUUM FULL per Oid

@vacuum[1153888]: 7526823
@vacuum[1153891]: 8462672
@vacuum[2613]: 10764797
@vacuum[2995]: 11429589
@vacuum[6102]: 11436539
@vacuum[12801]: 14373934
@vacuum[6106]: 14396012
@vacuum[3118]: 14507167
@vacuum[3596]: 14695385
@vacuum[12811]: 14871237
@vacuum[3429]: 15106778
@vacuum[3350]: 15158742
@vacuum[2611]: 15432053
@vacuum[3764]: 15534169
@vacuum[2601]: 16055863
@vacuum[3602]: 16128624
@vacuum[2605]: 16405419
@vacuum[2616]: 16914195
@vacuum[3576]: 17003920
[...]
</code></pre></div></div>

<h2 id="conclusion">Conclusion</h2>
<p>This article provides a brief overview of eBPF. To trace the function latency of PostgreSQL vacuum calls, we used the tool <code class="language-plaintext highlighter-rouge">funclatency-bpfcc</code>. Additionally, we utilized bpftrace to create a tool that allows for more in-depth observation of the calls. Our bpftrace script also takes into account the parameters of the PostgreSQL <code class="language-plaintext highlighter-rouge">vacuum_rel</code> function, enabling us to monitor the vacuum time per relation.</p>]]></content><author><name>Jan Nidzwetzki</name></author><category term="PostgreSQL" /><category term="eBPF" /><category term="Debugging" /><summary type="html"><![CDATA[The eBPF technology of the Linux kernel allows it to monitor applications with minimal overhead. UProbes can be used to trace the invocation and exit of functions in programs. Modern tools to observe databases (like pg-lock-tracer) are built on top of eBPF. However, these fully flagged tools are often written in C and Python and require some development effort. Sometimes, a ‘quick and dirty’ solution for a particular observation would be sufficient. With bpftrace, users can create eBPF programs with a few lines of code. In this article, we develop a simple bpftrace program to observe the execution of vacuum calls in PostgreSQL and analyze the delay.]]></summary></entry><entry><title type="html">GDB Pretty Print Extension for PostgreSQL Bitmapsets</title><link href="https://jnidzwetzki.github.io/2023/04/09/gdb-pretty-print-for-postgresql-bitmapset.html" rel="alternate" type="text/html" title="GDB Pretty Print Extension for PostgreSQL Bitmapsets" /><published>2023-04-09T00:00:00+00:00</published><updated>2023-04-09T00:00:00+00:00</updated><id>https://jnidzwetzki.github.io/2023/04/09/gdb-pretty-print-for-postgresql-bitmapset</id><content type="html" xml:base="https://jnidzwetzki.github.io/2023/04/09/gdb-pretty-print-for-postgresql-bitmapset.html"><![CDATA[<p>To store sets of integer values efficiently, PostgreSQL uses internally a data structure called <a href="https://github.com/postgres/postgres/blob/master/src/include/nodes/bitmapset.h">Bitmapset</a>. A wide range of operations are supported on the <code class="language-plaintext highlighter-rouge">Bitmapset</code>.</p>

<!--more-->

<p>This data structure is widely used in PostgreSQL code. Internally, so-called <code class="language-plaintext highlighter-rouge">words</code> of bits are used and store the information on which element is part of the set. For instance, this data structure supports efficient tests if an integer is part of the set (using the <code class="language-plaintext highlighter-rouge">bms_is_member</code> function), to add new values (using the <code class="language-plaintext highlighter-rouge">bms_add_member</code>, <code class="language-plaintext highlighter-rouge">bms_add_members</code>, or <code class="language-plaintext highlighter-rouge">bms_add_range</code> functions), or to iterate over the values (using the <code class="language-plaintext highlighter-rouge">bms_next_member</code> and <code class="language-plaintext highlighter-rouge">bms_prev_member</code> functions).</p>

<h2 id="dumping-the-content-of-the-bitmapset">Dumping the Content of the Bitmapset</h2>
<p>However, the content of this data structure is difficult to debug. The debugger does not show the stored content due to the lack of knowledge about the semantics of the bits. A lot of internal PostgreSQL data structures can be dumped using the <code class="language-plaintext highlighter-rouge">pprint</code> <a href="https://github.com/postgres/postgres/blob/c8e1ba736b2b9e8c98d37a5b77c4ed31baf94147/src/backend/nodes/print.c#L54">function</a>. Unfortunately, the <code class="language-plaintext highlighter-rouge">pprint</code> function is unable to print the content of the Bitmapset.</p>

<p>For instance, when the GDB should print the content of the set, it looks as follows:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(gdb) print *node_state-&gt;unused_batch_states
$1 = {nwords = 1, words = 0x5588689773f0}
</code></pre></div></div>

<p>The output indicates that one <code class="language-plaintext highlighter-rouge">word</code> (consisting of 32 bits) is used to represent the stored values. Unfortunately, in the output, it can not be seen which values are stored exactly.</p>

<p>On the PostgreSQL developer mailing list was a <a href="https://postgrespro.com/list/thread-id/1900731">patch</a> discussed to introduce a function called <code class="language-plaintext highlighter-rouge">bmsToString</code>. This function can also be used to display the content of a Bitmapset. However, this function can be only called when PostgreSQL is running. When a core dump of a crashed PostgreSQL process is examined with GDB, the function cannot be used.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git a(gdb) call bmsToString(chunk_state-&gt;unused_batch_states)
$6 = 0x5588689a8818 "(b 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15)"
</code></pre></div></div>

<p>Because the Bitmapset data structure is used heavily inside of PostgreSQL and the database server has no reliable way to print the content during debugging, I have developed a GDB extension to solve this problem. This article presents a GDB extension, which provides a remedy and makes the content displayable in the debugger.</p>

<h2 id="a-gdb-extension-to-show-the-content-of-the-bitmapset">A GDB extension to show the content of the Bitmapset</h2>

<p>The debugger GDB can be <a href="https://sourceware.org/gdb/onlinedocs/gdb/Python.html">extended</a> using python scripts. The <em>Pretty Printing API</em> can be used to develop <a href="https://sourceware.org/gdb/onlinedocs/gdb/Pretty-Printing-API.html">Pretty Printer</a> to analyze data structures and to improve the output of the debugger when they are displayed.</p>

<p>The following python script shows such an extension. It registers a new set of pretty printers via the <code class="language-plaintext highlighter-rouge">RegexpCollectionPrettyPrinter</code> function. These printers are called when a <code class="language-plaintext highlighter-rouge">Bitmapset</code> or a <code class="language-plaintext highlighter-rouge">Relids</code> data type should be printed by GDB. It decodes the <code class="language-plaintext highlighter-rouge">words</code> of the Bitmapset into decimal values, adds these values to a list and converts this list into a string.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">gdb.printing</span> <span class="kn">import</span> <span class="n">PrettyPrinter</span><span class="p">,</span> <span class="n">register_pretty_printer</span>
<span class="kn">import</span> <span class="nn">gdb</span>

<span class="k">class</span> <span class="nc">BitmapsetPrettyPrinter</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">val</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">val</span> <span class="o">=</span> <span class="n">val</span>

    <span class="k">def</span> <span class="nf">to_string</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="n">values</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="n">bits_per_word</span> <span class="o">=</span> <span class="mi">32</span>

        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">val</span> <span class="ow">is</span> <span class="bp">None</span> <span class="ow">or</span> <span class="bp">self</span><span class="p">.</span><span class="n">val</span><span class="p">.</span><span class="nb">type</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
           <span class="k">return</span> <span class="s">"0x0"</span>

        <span class="n">words</span> <span class="o">=</span> <span class="bp">None</span>

        <span class="k">try</span><span class="p">:</span>
           <span class="n">words</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">val</span><span class="p">[</span><span class="s">"nwords"</span><span class="p">]</span>
        <span class="k">except</span> <span class="nb">Exception</span><span class="p">:</span>
          <span class="k">return</span> <span class="s">'is not iterable'</span>

        <span class="k">for</span> <span class="n">word_no</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">words</span><span class="p">):</span>
           <span class="n">word</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">val</span><span class="p">[</span><span class="s">"words"</span><span class="p">][</span><span class="n">word_no</span><span class="p">]</span>
           <span class="k">for</span> <span class="n">bit</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">bits_per_word</span><span class="p">):</span>
              <span class="k">if</span> <span class="n">word</span> <span class="o">&amp;</span> <span class="p">(</span><span class="mi">1</span> <span class="o">&lt;&lt;</span> <span class="n">bit</span><span class="p">):</span>
                  <span class="n">values</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">word_no</span> <span class="o">*</span> <span class="n">bits_per_word</span> <span class="o">+</span> <span class="n">bit</span><span class="p">)</span>

        <span class="k">return</span> <span class="sa">f</span><span class="s">"PGBitmapset (</span><span class="si">{</span><span class="nb">str</span><span class="p">(</span><span class="n">values</span><span class="p">)</span><span class="si">}</span><span class="s">)"</span>

    <span class="k">def</span> <span class="nf">display_hint</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="k">return</span> <span class="s">'PGBitmapset'</span>

<span class="k">def</span> <span class="nf">build_pretty_printer</span><span class="p">():</span>
    <span class="n">pp</span> <span class="o">=</span> <span class="n">gdb</span><span class="p">.</span><span class="n">printing</span><span class="p">.</span><span class="n">RegexpCollectionPrettyPrinter</span><span class="p">(</span><span class="s">"PostgreSQLPrettyPrinter"</span><span class="p">)</span>
    <span class="n">pp</span><span class="p">.</span><span class="n">add_printer</span><span class="p">(</span><span class="s">'Bitmapset'</span><span class="p">,</span> <span class="s">'^Bitmapset$'</span><span class="p">,</span> <span class="n">BitmapsetPrettyPrinter</span><span class="p">)</span>
    <span class="n">pp</span><span class="p">.</span><span class="n">add_printer</span><span class="p">(</span><span class="s">'Relids'</span><span class="p">,</span> <span class="s">'^Relids$'</span><span class="p">,</span> <span class="n">BitmapsetPrettyPrinter</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">pp</span>

<span class="n">register_pretty_printer</span><span class="p">(</span><span class="bp">None</span><span class="p">,</span> <span class="n">build_pretty_printer</span><span class="p">(),</span> <span class="n">replace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="registering-the-pretty-printer">Registering the Pretty Printer</h2>

<p>This Python script can be stored in a new file and loaded via the <code class="language-plaintext highlighter-rouge">source</code> command into GDB.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(gdb) source /home/jan/dev/postgresql_printer.py
</code></pre></div></div>

<p>After the file is loaded, the two pretty printers are registered. By using the command <code class="language-plaintext highlighter-rouge">info pretty-printer</code>, GDB shows which pretty printers are registered. After loading the two new prints, the output looks as follows:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(gdb) info pretty-printer
global pretty-printers:
  PostgreSQLPrettyPrinter
    Bitmapset
    Relids
  builtin
    mpx_bound128
[...]
</code></pre></div></div>

<p>When the content of the variable <code class="language-plaintext highlighter-rouge">unused_batch_states</code> is now printed in GDB, it looks as follows.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(gdb) print *node_state-&gt;unused_batch_states
$3 = PGBitmapset ([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])
</code></pre></div></div>

<p>The output now clearly shows which integer values are part of the bitmap set. This is similar to the output of the <code class="language-plaintext highlighter-rouge">bmsToString</code> function shown above. The main difference is that the GDB extension also works when coredump files are analyzed and PostgreSQL is not running.</p>

<p>The pretty printer has to be loaded via the <code class="language-plaintext highlighter-rouge">source</code> command every time GDB is restarted. This is cumbersome. To ease the work with this extension, the command can be added to the <code class="language-plaintext highlighter-rouge">~/.gdbinit</code> file. The commands of this file are automatically executed every time GDB is invoked.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cat ~/.gdbinit 
source /home/jan/dev/postgresql_printer.py
</code></pre></div></div>]]></content><author><name>Jan Nidzwetzki</name></author><category term="PostgreSQL" /><category term="GDB" /><category term="Debugging" /><summary type="html"><![CDATA[To store sets of integer values efficiently, PostgreSQL uses internally a data structure called Bitmapset. A wide range of operations are supported on the Bitmapset.]]></summary></entry></feed>