Jan’s website and blog

Introduction to Snapshots and Tuple Visibility in PostgreSQL

2024-04-03T00:00:00+00:00

Like many relational DBMSs, PostgreSQL uses multi-version concurrency control (MVCC) to support parallel running transactions and coordinate parallel access to tuples. Snapshots are used to determine which version of a tuple is visible in which transaction. Each transaction that modifies data, has a transaction ID (txid). Tuples are stored together with two attributes (xmin, xmax) that determine in which snapshots (and in which transactions) they are visible.

This blog post discusses some of the implementation details of snapshots.

Tuple Visibility

The following table is used in this article to illustrate how snapshots work in PostgreSQL.

CREATE TABLE temperature (
  time timestamptz NOT NULL,
  value float
);

So, let’s insert the first record in this table. This is done by creating a new transaction, getting the current transaction ID if available, inserting a new tuple, getting the transaction ID again, and committing the transaction.

BEGIN;

SELECT * FROM txid_current_if_assigned();
 txid_current_if_assigned
--------------------------

(1 row)

INSERT INTO temperature VALUES(now(), 4);

SELECT * FROM txid_current_if_assigned();
 txid_current_if_assigned
--------------------------
                  5062286
(1 row)

COMMIT;

An important thing that can be seen in this example is that PostgreSQL only assigns a transaction ID to the transaction as soon as data is modified. This is done to prevent unneeded work and to prevent transaction IDs from exhaustion. Even if the transaction ID is a 32-bit integer, the value is exhausted at some point. PostgreSQL can deal with this overflow (i.e., tuples can be frozen to handle transaction ID wrap-arounds properly).

The system attributes xmin and xmax determine the first transaction and the last transaction that are able to see a particular tuple. In addition, the ctid attribute shows the number of the tuple on the corresponding page. The values for these attributes are shown when they are mentioned explicitly in a SELECT statement:

SELECT xmin, xmax, ctid, * FROM temperature;
  xmin   | xmax |  ctid |             time              | value
---------+------+-------+-------------------------------+-------
 5062286 |    0 | (0,1) | 2024-04-02 22:06:03.035868+02 |     4
(1 row)

The output means that all transactions that have a transaction ID >= 5062286 see this tuple. When the tuple is deleted, the xmax value is populated with the largest transaction ID that can see this tuple. The ctid of 0,1 means that the tuple is the first tuple on page 0. Now we delete the tuple:

BEGIN;

DELETE FROM temperature;

SELECT * FROM txid_current_if_assigned();
 txid_current_if_assigned
--------------------------
                  5062291
(1 row)

COMMIT;

However, when a SELECT statement is performed, nothing is returned, instead of a tuple with a populated xmin and xmax value.

SELECT xmin, xmax, ctid, * FROM temperature;
 xmin | xmax | ctid | time | value
------+------+------+------+-------
(0 rows)

The reason for this behavior is the internal scanner. If a tuple is not visible by the current transaction snapshot. To get these values from the tuple, we need to use more low-level tools instead of a simple SELECT.

The pageinspect extension for PostgreSQL allows us to get all tuples that are stored on a page and also decode the internal flags and attributes. The extension needs to be loaded and afterward, the pages of a relation can be examined.

-- Load the extension
CREATE EXTENSION pageinspect;

-- Get the tuples of the first page of the relation 'temperature'
SELECT lp, t_xmin, t_xmax FROM heap_page_items(get_raw_page('temperature', 0));

 lp | t_xmin  | t_xmax
----+---------+---------
  1 | 5062286 | 5062291

The output shows that the first tuple of page 0 (the ctid of (0,1) in the output above) has a t_max value of 5062291, which is identical to the transaction ID, which has deleted the tuple. So, every transaction with a transaction ID larger than 5062291 does not see this tuple.

Snapshots

When PostgreSQL scans a table, a snapshot has to be specified. See the table_beginscan function, which takes the snapshot data as the second parameter:

static inline TableScanDesc table_beginscan(Relation rel,
    Snapshot snapshot, int nkeys, struct ScanKeyData *key)

Internal Data Structures

Usually, the transaction snapshot is used as a parameter for this function. The structure SnapshotData contains all the information that are part of a snapshot. In this blog post, we will focus on the following attributes:

typedef struct SnapshotData
{
  [...]
	/*
	 * An MVCC snapshot can never see the effects of XIDs >= xmax. It can see
	 * the effects of all older XIDs except those listed in the snapshot. xmin
	 * is stored as an optimization to avoid needing to search the XID arrays
	 * for most tuples.
	 */
	TransactionId xmin;			/* all XID < xmin are visible to me */
	TransactionId xmax;			/* all XID >= xmax are invisible to me */

	/*
	 * For normal MVCC snapshot this contains the all xact IDs that are in
	 * progress, unless the snapshot was taken during recovery in which case
	 * it's empty. For historic MVCC snapshots, the meaning is inverted, i.e.
	 * it contains *committed* transactions between xmin and xmax.
	 *
	 * note: all ids in xip[] satisfy xmin <= xip[i] < xmax
	 */
	TransactionId *xip;
	uint32		xcnt;			/* # of xact ids in xip[] */
  [...]
}

The field xmin defines the oldest active transaction in the system. All transactions with a txid lower than this value have already been committed. So, all tuples which have a lower txid should be visible in this snapshot. xmax contains the most recent transaction ID known by the snapshot. All tuples with a txid > xmax are invisible by the current snapshot.

For what reason are the fields xip and xcnt needed? For the transaction IDs between xmin and xmax, it needs to be determined if the transaction was committed or in progress when the snapshot was created.

A DBMS processes the queries of multiple users. They can start transactions at any time. The start time and the commit time of these transactions are not ordered. This means that there might be transactions with a transaction ID larger than xmin that are already committed when the snapshot is created. However, some other transactions in the range [xmin, xmax] have still not been committed. Since the data of the committed and uncommitted transactions needs to be handled properly, an array of transaction IDs xip of the length xcnt is defined. It contains all transactions that are larger than xmin and lower than xmax, which were in progress when the snapshot was taken.

Example

To illustrate the behavior, let’s perform a practical example using three transactions.

Transaction 1

BEGIN;

INSERT INTO temperature VALUES(now(), 5);

SELECT * FROM txid_current_if_assigned();
 txid_current_if_assigned
--------------------------
                  5062310
(1 row)

The first transaction inserts new data into the table temperature but stays uncommitted. The transaction has a transaction ID of 5062310.

Transaction 2

BEGIN;

INSERT INTO temperature VALUES(now(), 5);

SELECT * FROM txid_current_if_assigned();
 txid_current_if_assigned
--------------------------
                  5062311
(1 row)

Also, the second transaction inserts data into the same table but also stays uncommitted. The ID of this transaction is 5062311.

Transaction 3

SELECT * FROM pg_current_snapshot();
 pg_current_snapshot
---------------------
 5062310:5062310:
(1 row)

The third transaction uses the function pg_current_snapshot to get the current snapshot. The output of the function means that all changes by transactions with an ID lower than 5062310 are visible. Changes that are equal to or larger than transaction ID 5062310 are not visible, and no uncommitted transaction exists at this point.

So, what happened to the still pending transactions 5062310 and 5062311? Since no further transactions have been committed so far in this demo system, PostgreSQL has not changed the current transaction ID. However, this can be changed:

 SELECT * FROM pg_current_xact_id_if_assigned();
 pg_current_xact_id_if_assigned
--------------------------------

(1 row)

SELECT * FROM pg_current_xact_id();
 pg_current_xact_id
--------------------
            5062312
(1 row)

SELECT * FROM pg_current_snapshot();
       pg_current_snapshot
---------------------------------
 5062310:5062313:5062310,5062311
(1 row)

In contrast to the function pg_current_xact_id_if_assigned, the function pg_current_xact_id forces to assign a transaction ID to the current transaction. In our case, this is 5062312. The usage of this transaction ID also leads to an update of the snapshot.

The first value stays the same. Still, all tuples that are modified by transactions with an ID lower than 5062310 are visible in the current snapshot. However, the upper limit (xmax) has changed. Now, all changes that are equal to or larger than 5062313 are not visible in the current snapshot. Since our transaction ID is 5062312, it makes sense that these changes should not be visible. What about the new part 5062310,5062311? This is the xip part of the snapshot and means that the two transactions, 5062310 and 5062311, were uncommitted at the moment when the snapshot was taken. Therefore, these changes should also not be visible in the current snapshot. As soon as one of these transactions commits and we take a new snapshot, the transaction ID is removed from zip,, and therefore, the changes become visible in the current snapshot.

Exporting Snapshots

Another interesting feature of PostgreSQL is the ability to ¢export snapshots and load them in other sessions. The export of a snapshot can be done by calling the function pg_export_snapshot. The function returns the ID of the snapshot and creates a corresponding file in the pg_snapshots folder of the data directory.

BEGIN;

SELECT * FROM pg_export_snapshot();
 pg_export_snapshot
---------------------
 0000000C-000005F6-1
(1 row)

This file contains the same information as returned by pg_current_snapshot, which we discussed above. In addition, it contains further information about the used isolation level or the used database ID.

$ cat ~/postgresql-sandbox/data/REL_15_1_DEBUG/pg_snapshots/0000000C-000005F6-1
vxid:12/1526
pid:1362769
dbid:706615
iso:1
ro:0
xmin:5062310
xmax:5062313
xcnt:2
xip:5062310
xip:5062311
sof:0
sxcnt:0
rec:0

This exported snapshot could be loaded into another transaction by calling SET TRANSACTION SNAPSHOT 0000000C-000005F6-1 to run with the same snapshot as the transaction that created the snapshot.

Snapshots and Transaction Isolation Level

Depending on the isolation level, the snapshot is taken when the transaction is started (Repeatable read) or for every statement in the transaction (Read committed). When a new snapshot is created for each statement inside of a transaction, the committed data from other transactions becomes visible in the current transaction. If only one snapshot is created for the entire transaction, the xmax value stays constant, no new data from transactions with a higher ID becomes visible, and reads are repeatable.

Summary

This blog post discusses the basics of multi-version concurrency control in PostgreSQL. Afterward, snapshots are introduced and how they control the visibility of tuples. Also the integration with the table scan API is discussed.

Trace PostgreSQL Row-Level Locks with pg_row_lock_tracer

2024-02-28T00:00:00+00:00

PostgreSQL uses several types of locks to coordinate parallel running transactions and grant access to resources like tuples, tables, and in-memory data structures.

Heavy locks are used to control the access to tables. Lightweight locks (LWLocks) control access to data structures, such as adding data to the write-ahead-log (WAL). Row-level locks are used to control access to tuples. For example, individual tuples need to be locked when an SQL statement like SELECT * FROM table WHERE i > 10 FOR UPDATE;. The tuples that are returned by the query are internally locked with an exclusive lock (LOCK_TUPLE_EXCLUSIVE). Another transaction that tries to lock the same tuples has to wait until the first transaction unlocks the tuples.

In this article, the tool pg_row_lock_tracer is discussed. The tool employs eBPF and UProbes to trace the row-locking behavior of PostgreSQL. It can be downloaded from the website of the pg-lock-tracer project.

This is the third article that deals with the tracing of PostgreSQL locks. The first article deals with the tracing of heavyweight locks. The second article deals with LW locks.

Background

PostgreSQL implements four different row lock modes. They can be requested by adding FOR UPDATE, FOR NO KEY UPDATE, FOR SHARE, or FOR KEY SHARE to a SELECT statement. Also, operations like updates acquire these locks automatically before a tuple is updated. For example, when a transaction successfully performs a FOR UPDATE lock on a tuple, an update operation of another parallel running transaction is blocked until the lock of the first transaction is released. Row-locks can be requested by calling the function heapam_tuple_lock.

Lock Types

Internally, these locks are called LockTupleKeyShare, LockTupleShare, LockTupleNoKeyExclusive, and LockTupleExclusive. They are defined in the enum LockTupleMode. These locks have different strengths and some locks are compatible (i.e., multiple transactions can hold locks at the same time for the same row) or locks can be conflicting (i.e., only one lock can be taken at the same time and before a conflicting lock is granted, the requesting transaction has to wait).

Lock Behavior

The user has the ability to specify various lock behaviors in addition to different lock modes. For instance, if a tuple is already locked and a second transaction requests a conflicting lock and would have to wait, the user can choose to skip the lock. The possible behaviors are defined in the enum LockWaitPolicy.

For example, the following SQL query acquires a LockTupleExclusive row lock if the lock would not be conflicting. All already locked tuples are not tried to lock by the current transaction.

SELECT * FROM table WHERE i > 10 FOR UPDATE SKIP LOCKED;

The transaction that has successfully acquired the locks can assume that nobody else could modify the tuples in parallel. So, the returned values by the SELECT statement could be processed, modified, and changed in subsequent UPDATE statements and COMMITTED afterward.

Lock Results

The possible results of the lock operation are defined in the enum TM_Result. The lock can be granted TM_Ok, or the lock can not be granted since the tuple is invisible for the used snapshot TM_Invisible, already modified by the same backend progress TM_SelfModified, updated TM_Updated or deletedTM_Deleted. In addition, when the lock is instructed not to wait, it could return TM_BeingModified when another transaction currently modifies the tuple, or it would block TM_WouldBlock.

pg_row_lock_tracer

pg_row_lock_trace makes it possible to trace the locking behavior of these row-level locks of a PostgreSQL process in real time using eBPF and UProbes. In addition, statistics about the requested locks and the locking results can be generated.

Download and Usage

The lock tracer can be installed via the Python package installer pip:

pip install pg-lock-tracer

Afterward, the locks of one or more running processes can be traced:

# Trace the row locks of the given PostgreSQL binary
pg_row_lock_tracer -x /home/jan/postgresql-sandbox/bin/REL_14_9_DEBUG/bin/postgres

# Trace the row locks of the PID 1234
pg_row_lock_tracer -p 1234 -x /home/jan/postgresql-sandbox/bin/REL_14_9_DEBUG/bin/postgres

# Trace the row locks of the PID 1234 and 5678
pg_row_lock_tracer -p 1234 -p 5678 -x /home/jan/postgresql-sandbox/bin/REL_14_9_DEBUG/bin/postgres

# Trace the row locks of the PID 1234 and be verbose
pg_row_lock_tracer -p 1234 -x /home/jan/postgresql-sandbox/bin/REL_14_9_DEBUG/bin/postgres -v

# Trace the row locks and show statistics
pg_row_lock_tracer -x /home/jan/postgresql-sandbox/bin/REL_14_9_DEBUG/bin/postgres --statistics

A sample output of the tool looks as follows:

[...]
2783502701862408 [Pid 2604491] LOCK_TUPLE_END TM_OK in 13100 ns
2783502701877081 [Pid 2604491] LOCK_TUPLE (Tablespace 1663 database 305234 relation 313419) - (Block and offset 7 143) - LOCK_TUPLE_EXCLUSIVE LOCK_WAIT_BLOCK
2783502701972367 [Pid 2604491] LOCK_TUPLE_END TM_OK in 95286 ns
2783502701988387 [Pid 2604491] LOCK_TUPLE (Tablespace 1663 database 305234 relation 313419) - (Block and offset 7 144) - LOCK_TUPLE_EXCLUSIVE LOCK_WAIT_BLOCK
2783502702001690 [Pid 2604491] LOCK_TUPLE_END TM_OK in 13303 ns
2783502702016387 [Pid 2604491] LOCK_TUPLE (Tablespace 1663 database 305234 relation 313419) - (Block and offset 7 145) - LOCK_TUPLE_EXCLUSIVE LOCK_WAIT_BLOCK
2783502702029375 [Pid 2604491] LOCK_TUPLE_END TM_OK in 12988 ns

The tool’s output contains the tuples that are being locked and it shows the used type of locks. Tuples are identified by the block and offset in a particular page of a relation (Block and offset 7 145). The output also contains additional options of the lock call, such as LOCK_WAIT_BLOCK. Additionally, the result of the lock operation (TM_OK) is also included in the output.

When the option --statistics is used, statistics about the traced locks can be collected. The statistics are shown during the termination of the tool (after hitting CTRL+C).

Lock statistics:
================

Used wait policies:
+---------+-----------------+----------------+-----------------+
|   PID   | LOCK_WAIT_BLOCK | LOCK_WAIT_SKIP | LOCK_WAIT_ERROR |
+---------+-----------------+----------------+-----------------+
| 2604491 |       1440      |       0        |        0        |
+---------+-----------------+----------------+-----------------+

Lock modes:
+---------+---------------------+------------------+---------------------------+----------------------+
|   PID   | LOCK_TUPLE_KEYSHARE | LOCK_TUPLE_SHARE | LOCK_TUPLE_NOKEYEXCLUSIVE | LOCK_TUPLE_EXCLUSIVE |
+---------+---------------------+------------------+---------------------------+----------------------+
| 2604491 |          0          |        0         |             0             |         1440         |
+---------+---------------------+------------------+---------------------------+----------------------+

Lock results:
+---------+-------+--------------+-----------------+------------+------------+------------------+---------------+
|   PID   | TM_OK | TM_INVISIBLE | TM_SELFMODIFIED | TM_UPDATED | TM_DELETED | TM_BEINGMODIFIED | TM_WOULDBLOCK |
+---------+-------+--------------+-----------------+------------+------------+------------------+---------------+
| 2604491 |  1440 |      0       |        0        |     0      |     0      |        0         |       0       |
+---------+-------+--------------+-----------------+------------+------------+------------------+---------------+

Summary

pg_row_lock_tracer is a tracer for PostgreSQL row-level locks. The tool is available on GitHub for download. It uses eBPF and UProbes to trace the row lock activity in real-time. Like the related programs (pg_lock_tracer and pg_lw_lock_tracer), this tool is also intended for debugging and analyzing lock behavior and performance problems.

This is the third article that deals with tracing PostgreSQL locks. A description of a lock tracer for heavyweight locks can be found in the first part of this article series about locks. Tracing LW locks is discussed in the second part of the series about lock tracing in PostgreSQL.

Index the PostgreSQL Source Code with Elixir

2024-01-11T00:00:00+00:00

While working with the internals of PostgreSQL, it is helpful to be able to navigate around the source code quickly and look up symbols and definitions fast. I use VS Studio code for programming. However, finding definitions does not always work reliably, and the full-text search is slow and often returns many results and not the desired hot (e.g., a definition of a function). For a long time, I had the Doxygen build of PostgreSQL open in my browser. However, Doxygen is sometimes cumbersome to use and it only shows the current version of PostgreSQL. Sometimes, the source code for an older version is needed. To solve these problems, I set up a local copy of the Elixir Cross Referencer.

The Elixir Cross Referencer is a source code indexer that provides a web interface and an API to quickly look up symbols. I used it several times when I navigated through the Linux source code, and I was wondering what needs to be done to set up a local installation for PostgreSQL.

To my surprise, this is easier than expected. Elixir can be installed using Docker, and custom images for new projects can be created easily. For instance, to create a new Docker image which contains a copy of the PostgreSQL source code, the following commands have to be executed:

$ git clone https://github.com/bootlin/elixir.git

$ cd elixir

$ docker build -t elixir:postgresql-11-01-2024 --build-arg GIT_REPO_URL=https://github.com/postgres/postgres.git --build-arg PROJECT=postgresql . -f docker/debian/Dockerfile

The last command builds a new Docker image called elixir:postgresql-11-01-2024. This command takes some time to complete. The two build-arg parameters are enough to clone and index the PostgreSQL repository. After the image is created, it should be shown as an available image of the local Docker installation.

$ docker images

REPOSITORY                                                       TAG                     IMAGE ID       CREATED        SIZE
elixir                                                           postgresql-11-01-2024   fb993f66c1cc   2 hours ago    2.38GB

Afterward, a new container with the image can be started. I use the parameter -p 8081:80 to make port 80 of the container available as port 8081 of my local system.

$ docker run elixir:postgresql-11-01-2024 -d -p 8081:80

After the container is started, the PostgreSQL source code an be accessed by opening the URL http://172.17.0.2:8081/postgresql/latest/source.

If you want to modify the header of the Elixir installation, you can modify the file templates/header.html before building the Docker image. More information about customizing the image can be found in the documentation of the project.

Using Bpftrace to Trace PostgreSQL Vacuum Operations

2023-08-23T00:00:00+00:00

The eBPF technology of the Linux kernel allows it to monitor applications with minimal overhead. UProbes can be used to trace the invocation and exit of functions in programs. Modern tools to observe databases (like pg-lock-tracer) are built on top of eBPF. However, these fully flagged tools are often written in C and Python and require some development effort. Sometimes, a ‘quick and dirty’ solution for a particular observation would be sufficient. With bpftrace, users can create eBPF programs with a few lines of code. In this article, we develop a simple bpftrace program to observe the execution of vacuum calls in PostgreSQL and analyze the delay.

⚠️ An updated and slightly revised version of this post is available in the Timescale company blog.

Used Environment

PostgreSQL is a database management system that uses vacuum operations to reclaim space from dead (e.g., updated or deleted) tuples. In this post, we will trace the vacuum calls and determine the needed time for the vacuum operations per table.

In the following examples, a PostgreSQL 14 server is used. The PostgreSQL binary is located at /home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres. In addition, the examples are executed in a database with these two tables:

CREATE TABLE testtable1 (
   id int NOT NULL,
   value int NOT NULL
);

CREATE TABLE testtable2 (
   id int NOT NULL,
   value int NOT NULL
);

Note: Depending on the used C compiler and applied optimizations, the symbols of internal (i.e., as static declared) functions could not be visible. In this case, uprobes can not be used to trace the function invocations. To address this issue, there are two possible solutions: (1) remove the static modifier from the function declaration and recompile PostgreSQL, or (2) create a full debug build of PostgreSQL.

Using funclatency-bpfcc to Trace Function Calls

Let’s explore the solutions that already exist before developing our tool to trace the vacuum operations. The tool funclatency-bpfcc is available for most Linux distributions (on Debian, it is contained in the package bpfcc-tools) and allows it to trace a function enter and exit and measure the function latency (i.e., the time the function needs to complete).

In PostgreSQL, the function vacuum_rel is invoked when a vacuum operation on a relation is performed. To trace these function calls with funclatency-bpfcc, the path of the PostgreSQL binary and the function name have to be provided. For instance:

$ sudo funclatency-bpfcc -r /home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres:vacuum_rel

Tracing 1 functions for "/home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres:vacuum_rel"... Hit Ctrl-C to end.

Afterward, a eBPF program is loaded into the Linux kernel, a uprobe is defined on the function enter and one uprobe is defined on the function exit. The latency between these two events is measured and stored.

To execute some vacuum operations, we perform the following SQL statement in a second session:

database=# VACUUM FULL;
VACUUM FULL

This SQL statement triggers PostgreSQL to perform a vacuum operation of all tables of the currently open database. After the vacuum operations are done, the funclatency-bpfcc program can be stopped (by executing CTRL+C). This ends the observation of the binary and shows the recorded execution times on the terminal.

$ sudo funclatency-bpfcc -r /home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres:vacuum_rel
[...]
^C
Function = b'vacuum_rel' [876997]
     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 0        |                                        |
      8192 -> 16383      : 0        |                                        |
     16384 -> 32767      : 0        |                                        |
     32768 -> 65535      : 0        |                                        |
     65536 -> 131071     : 0        |                                        |
    131072 -> 262143     : 0        |                                        |
    262144 -> 524287     : 0        |                                        |
    524288 -> 1048575    : 0        |                                        |
   1048576 -> 2097151    : 0        |                                        |
   2097152 -> 4194303    : 0        |                                        |
   4194304 -> 8388607    : 2        |*                                       |
   8388608 -> 16777215   : 13       |***********                             |
  16777216 -> 33554431   : 44       |****************************************|
  33554432 -> 67108863   : 7        |******                                  |
  67108864 -> 134217727  : 1        |                                        |

avg = 22765358 nsecs, total: 1525279002 nsecs, count: 67

Detaching...

The output contains the information that the function vacuum_rel was called 67 times and the average function time is 22765358 nsecs. In addition, a histogram of the function latency is printed. This gives a lot of helpful information, but it might be helpful to get the information which vacuum calls for which relation needs how much time. This is something that is not supported by this tool because it does not evaluate the parameters of the function (e.g., the OID of relation that the current function invocation should vacuum). However, this is something that we can do with bpftrace.

Tracing Function Entries

Let’s start with a very simple bpftrace program that prints a line once the vacuum_rel function is invoked in the PostgreSQL binary. bpftrace is called with the eBPF program that should be loaded into the Linux kernel. The eBPF programs that are passed to bpftrace have the following syntax:

 {
        
}

[...]

 {
        
}

The syntax to define a uprobe on a userland binary is: uprobe:library_name:function_name[+offset]. For instance, to define an uprobe on the function invocation of vacuum_rel in the binary /home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres and print the line Vacuum started, the following bpftrace call can be used:

$ sudo bpftrace -e '
uprobe:/home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres:vacuum_rel {
    printf("Vacuum started\n");
}
'

Attaching 1 probe...
Vacuum started
Vacuum started
Vacuum started
Vacuum started
Vacuum started
Vacuum started
Vacuum started
Vacuum started
Vacuum started
[...]

As soon as the VACUUM FULL SQL statement in PostgreSQL is executed in another terminal session, the program starts to print the message on the screen. This is a good start, but we still have less information available than output by the existing tool funclatency-bpfcc. The latency of the function calls is missing.

Tracing Function Returns / Latency

To measure the latency of the function invocations, we need two things:

We need to define a second probe that is invoked when the function observed returns; this can be done by a uretproble.
The time between the function invocation and the return has to be measured.

A uretproble in bpftrace can be defined using the same syntax (uretprobe:binary:function) as the already defined uprobe. In addition, bpftrace allows it to create variables like associative arrays. We use such an array to capture the start time of a function invocation @start[tid] = nsecs;. The key of the array is the id of the current thread tid. So, multiple threads (and processes like in our case with PostgreSQL) can be traced simultaneously without overriding the last function invitation start time.

In the uretprobe we take the current time and subtract the time of the function invocation (nsecs - @start[tid]) to get the time the function call needs. In addition, we use a function predicate (/@start[tid]/) to let bpftrace know that we only want to execute the function body of the uretprobe as soon as this array value is defined. Using this predicate, we prevent handling a function return without seeing the function enter before (e.g., we start the bpftrace program in the middle of a running function call, and we get only the uretprobe invocation for this function call).

Note: Is it not guaranteed that the eBPF events are delivered and processed in-order by bpftrace. Especially when a function call is short and we have a lot of function invocations, the events could be processed out-of-order (e.g., we see two function enter events followed by two function return events). In this case, function latency observations with bpftrace become imprecise. To avoid this, we use VACUUM FULL calls instead of vacuum calls. These calls are much more expensive since they rewrite the table. Therefore, they take longer and can be reliably observed by bpftrace.

$ sudo bpftrace -e '
uprobe:/home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres:vacuum_rel
{
        printf("Performing vacuum\n");
        @start[tid] = nsecs;
}

uretprobe:/home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres:vacuum_rel
/@start[tid]/
{
        printf("Vacuum call took %d ns\n", nsecs - @start[tid]);
        delete(@start[tid]);
}
'

After running this bpftrace call and executing VACUUM FULL in a second session, we see the following output:

Attaching 2 probes...
Performing vacuum
Vacuum call took 37486735 ns
Performing vacuum
Vacuum call took 16491130 ns
Performing vacuum
Vacuum call took 32443568 ns
Performing vacuum
Vacuum call took 17959933 ns
[...]

For each call of the vacuum_rel in PostgreSQL, we measure the time the vacuum operation needs. However, it would be convenient if we could also trace the OID or the name of the relation that is vacuumed by the current vacuum operation. This requires the handling of the function parameters of the observed function.

Handle Function Parameters

The function vacuum_rel has the following signature in PostgreSQL 14. The first parameter is the Oid (an unsigned int) of the processed relation. The second parameter is a RageVar struct, which could contain the name of the relation. The third parameter is a VacuumParams struct, which contains additional parameters for the vacuum operation and the last parameter is a BufferAccessStrategy, which defines the access strategy of the used buffer.

static bool vacuum_rel(Oid relid,
        RangeVar *relation,
        VacuumParams *params,
        BufferAccessStrategy bstrategy 
)

Bpftrace allows it to access the function parameter using the keywords arg0, arg1, …, argN. To include the Oid in the output our logging, we need only to print the first parameter of the function.

$ sudo bpftrace -e '

uprobe:/home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres:vacuum_rel
{
        printf("Performing vacuum of OID %d\n", arg0);
        @start[tid] = nsecs;
}

uretprobe:/home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres:vacuum_rel
/@start[tid]/
{
        printf("Vacuum call took %d ns\n", nsecs - @start[tid]);
        delete(@start[tid]);
}
'

When the VACUUM FULL operation is executed again in a second terminal, the output looks as follows:

Attaching 2 probes...
[...]
Performing vacuum of OID 1153888
Vacuum call took 37486734 ns
Performing vacuum of OID 1153891
Vacuum call took 49535256 ns
Performing vacuum of OID 2619
Vacuum call took 39575635 ns
Performing vacuum of OID 2840
Vacuum call took 40683526 ns
Performing vacuum of OID 1247
Vacuum call took 14683600 ns
Performing vacuum of OID 4171
Vacuum call took 20587503 ns

To determine which Oid belongs to which relation, the following SQL statement can be executed:

blog=# SELECT oid, relname FROM pg_class WHERE oid IN (1153888, 1153891);
   oid   |  relname   
---------+------------
 1153888 | testtable1
 1153891 | testtable2
(2 rows)

The result shows that the Oids 1153888 and 1153891 belong to the tables testtable1 and testtable2, which we have created in one of the first sections of this article. These values belong to our test environment. In your environment, different Oids might be shown.

Handle Function Struct Parameters

So far, we have processed simple parameters with bpftrace (like Oids, which are unsigned integers). However, many parameters in PostgreSQL are structs. Furthermore, these structs can be handled in bpftrace programs as well.

The second parameter of the vacuum_rel function is a RangeVar struct. This struct is defined in PostgreSQL 14 as follows:

typedef struct RangeVar
{
	NodeTag	type;
	char *catalogname;
	char *schemaname;
	char *relname;
	[...]
}

To process the struct, the following bpftrace program can be used. Please note, that the internal NodeTag data type of PostgreSQL is replaced by a simple int. The NodeTag data type is an enum. Enums are backed by the integer data type in C. To handle this enum correctly, we could (1) also copy the enum definition into the eBPF program, or (2) we could replace it with a data type of the same length. To keep the bpftrace program simple, the second option is used here. The next three struct members are char pointer which contains the catalogname, the schema, and the name of the relation. The schemaname and the relname are the fields we are interested in. The struct contains more members, but these members are ignored to keep the example clear.

$ sudo bpftrace -e '
struct RangeVar
{
	int type;
	char *catalogname;
	char *schemaname;
	char *relname;
};

uprobe:/home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres:vacuum_rel
{
        printf("[PID %d] Performing vacuum of OID %d (%s.%s)\n", pid, arg0, str(((struct RangeVar*) arg1)->schemaname), str(((struct RangeVar*) arg1)->relname));
        @start[tid] = nsecs;
}

uretprobe:/home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres:vacuum_rel
/@start[tid]/
{
        printf("[PID %d] Vacuum call took %d ns\n", pid, nsecs - @start[tid]);
        delete(@start[tid]);
}
'

After the struct is defined, the members of the struct can be accessed as in a regular C program. For example: ((struct RangeVar*) arg1)->schemaname. In addition, we also print the process id (PID) of the program that has triggered the uprobe. This allows it to identify the process that has performed the vacuum operation.

When running the following SQL statements in a second terminal:

VACUUM FULL public.testtable1;
VACUUM FULL public.testtable2;

The bpftrace program shows the following output:

Attaching 2 probes...
[PID 616516] Performing vacuum of OID 1153888 (public.testtable1)
[PID 616516] Vacuum call took 23683600 ns
[PID 616516] Performing vacuum of OID 1153891 (public.testtable2)
[PID 616516] Vacuum call took 24240837 ns

The table names are extracted from the RangeVar data structure and shown in the output. However, this data structure is not always populated by PostgreSQL. The data structure might be empty when running VACUUM FULL without specifying a table name. Therefore, we use two single invocations with explicit table names to force PostgreSQL to populate this data structure.

Optimizing the Bpftrace Program Using Maps

The bpftrace programs we have developed so far use one or more printf statements directly. A printf call is slow and reduces the throughput the bpftrace program can monitor.

This can be optimized by storing the data in a map that is printed when bpftrace is stopped. To do this, we introduce three new maps @start, @oid, and @vacuum. The first two maps are populated in the uprobe event of the vacuum_rel function. The map @start contains the time when the probe is triggered, and the map @oid contains the oid of the parameter function.

When the function is left and the uretprobe is activated, the @vacuum map is populated. The key is the Oid and the value are the needed time to perform the vacuum operation. In addition, the keys of the first two maps are removed.

When bpftrace exits (i.e., by pressing CRTL+C), all populated maps are printed automatically. By using these three maps, we have separated the actual monitoring from the output; the expensive printf function is called after the monitoring is done.

In addition, in the following program, we use the two functions BEGIN and END that are called by bpftrace when the observation begins and ends.

$ sudo sudo bpftrace -e '

uprobe:/home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres:vacuum_rel
{
        @start[tid] = nsecs;
        @oid[tid] = arg0;
}

uretprobe:/home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres:vacuum_rel
/@start[tid]/
{

        @vacuum[@oid[tid]] = nsecs - @start[tid];
        delete(@start[tid]);
        delete(@oid[tid]);

}

BEGIN
{
        printf("VACUUM calles are traced, press CTRL+C to stop tracing\n");
}

END 
{
        printf("\n\nNeeded time in ns to perform VACUUM FULL per Oid\n");
}
'

After bpftrace is started, the first message is printed. After the program is stopped, the second message is printed. In addition, the content of the @vacuum map is printed. For each Oid, the needed time for the vacuum operations is shown.

VACUUM calles are traced, press CTRL+C to stop tracing
^C

Needed time in ns to perform VACUUM FULL per Oid

@vacuum[1153888]: 7526823
@vacuum[1153891]: 8462672
@vacuum[2613]: 10764797
@vacuum[2995]: 11429589
@vacuum[6102]: 11436539
@vacuum[12801]: 14373934
@vacuum[6106]: 14396012
@vacuum[3118]: 14507167
@vacuum[3596]: 14695385
@vacuum[12811]: 14871237
@vacuum[3429]: 15106778
@vacuum[3350]: 15158742
@vacuum[2611]: 15432053
@vacuum[3764]: 15534169
@vacuum[2601]: 16055863
@vacuum[3602]: 16128624
@vacuum[2605]: 16405419
@vacuum[2616]: 16914195
@vacuum[3576]: 17003920
[...]

Conclusion

This article provides a brief overview of eBPF. To trace the function latency of PostgreSQL vacuum calls, we used the tool funclatency-bpfcc. Additionally, we utilized bpftrace to create a tool that allows for more in-depth observation of the calls. Our bpftrace script also takes into account the parameters of the PostgreSQL vacuum_rel function, enabling us to monitor the vacuum time per relation.

GDB Pretty Print Extension for PostgreSQL Bitmapsets

2023-04-09T00:00:00+00:00

To store sets of integer values efficiently, PostgreSQL uses internally a data structure called Bitmapset. A wide range of operations are supported on the Bitmapset.

This data structure is widely used in PostgreSQL code. Internally, so-called words of bits are used and store the information on which element is part of the set. For instance, this data structure supports efficient tests if an integer is part of the set (using the bms_is_member function), to add new values (using the bms_add_member, bms_add_members, or bms_add_range functions), or to iterate over the values (using the bms_next_member and bms_prev_member functions).

Dumping the Content of the Bitmapset

However, the content of this data structure is difficult to debug. The debugger does not show the stored content due to the lack of knowledge about the semantics of the bits. A lot of internal PostgreSQL data structures can be dumped using the pprint function. Unfortunately, the pprint function is unable to print the content of the Bitmapset.

For instance, when the GDB should print the content of the set, it looks as follows:

(gdb) print *node_state->unused_batch_states
$1 = {nwords = 1, words = 0x5588689773f0}

The output indicates that one word (consisting of 32 bits) is used to represent the stored values. Unfortunately, in the output, it can not be seen which values are stored exactly.

On the PostgreSQL developer mailing list was a patch discussed to introduce a function called bmsToString. This function can also be used to display the content of a Bitmapset. However, this function can be only called when PostgreSQL is running. When a core dump of a crashed PostgreSQL process is examined with GDB, the function cannot be used.

git a(gdb) call bmsToString(chunk_state->unused_batch_states)
$6 = 0x5588689a8818 "(b 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15)"

Because the Bitmapset data structure is used heavily inside of PostgreSQL and the database server has no reliable way to print the content during debugging, I have developed a GDB extension to solve this problem. This article presents a GDB extension, which provides a remedy and makes the content displayable in the debugger.

A GDB extension to show the content of the Bitmapset

The debugger GDB can be extended using python scripts. The Pretty Printing API can be used to develop Pretty Printer to analyze data structures and to improve the output of the debugger when they are displayed.

The following python script shows such an extension. It registers a new set of pretty printers via the RegexpCollectionPrettyPrinter function. These printers are called when a Bitmapset or a Relids data type should be printed by GDB. It decodes the words of the Bitmapset into decimal values, adds these values to a list and converts this list into a string.

from gdb.printing import PrettyPrinter, register_pretty_printer
import gdb

class BitmapsetPrettyPrinter(object):
    def __init__(self, val):
        self.val = val

    def to_string(self):
        values = []
        bits_per_word = 32

        if self.val is None or self.val.type is None:
           return "0x0"

        words = None

        try:
           words = self.val["nwords"]
        except Exception:
          return 'is not iterable'

        for word_no in range(words):
           word = self.val["words"][word_no]
           for bit in range(bits_per_word):
              if word & (1 << bit):
                  values.append(word_no * bits_per_word + bit)

        return f"PGBitmapset ({str(values)})"

    def display_hint(self):
        return 'PGBitmapset'

def build_pretty_printer():
    pp = gdb.printing.RegexpCollectionPrettyPrinter("PostgreSQLPrettyPrinter")
    pp.add_printer('Bitmapset', '^Bitmapset$', BitmapsetPrettyPrinter)
    pp.add_printer('Relids', '^Relids$', BitmapsetPrettyPrinter)
    return pp

register_pretty_printer(None, build_pretty_printer(), replace=True)

Registering the Pretty Printer

This Python script can be stored in a new file and loaded via the source command into GDB.

(gdb) source /home/jan/dev/postgresql_printer.py

After the file is loaded, the two pretty printers are registered. By using the command info pretty-printer, GDB shows which pretty printers are registered. After loading the two new prints, the output looks as follows:

(gdb) info pretty-printer
global pretty-printers:
  PostgreSQLPrettyPrinter
    Bitmapset
    Relids
  builtin
    mpx_bound128
[...]

When the content of the variable unused_batch_states is now printed in GDB, it looks as follows.

(gdb) print *node_state->unused_batch_states
$3 = PGBitmapset ([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])

The output now clearly shows which integer values are part of the bitmap set. This is similar to the output of the bmsToString function shown above. The main difference is that the GDB extension also works when coredump files are analyzed and PostgreSQL is not running.

The pretty printer has to be loaded via the source command every time GDB is restarted. This is cumbersome. To ease the work with this extension, the command can be added to the ~/.gdbinit file. The commands of this file are automatically executed every time GDB is invoked.

cat ~/.gdbinit 
source /home/jan/dev/postgresql_printer.py

Trace PostgreSQL LWLocks with pg_lw_lock_tracer

2023-01-17T00:00:00+00:00

The Database Management System PostgreSQL uses lightweight locks (LWLocks) to control access to shared memory data structures. In this article, the tool pg_lw_lock_tracer is presented that allows tracing these kinds of locks. The tool can be downloaded from the website of the project.

This is the second article that deals with tracing PostgreSQL locks. The first article deals with the tracing of heavyweight locks and can be found here.

Goal of the Tool

pg_lw_lock_tracer is a tracer for lightweight locks. It allows attaching to a running PostgreSQL process and trace (see the lock and unlock) events of lightweight locks. A LWLock can be taken as a shared LW_SHARED or as an exclusive LW_EXCLUSIVE lock. In addition, a special LW_WAIT_UNTIL_FREE mode is implemented in PostgreSQL to wait until a LWLock becomes free. In addition, statistics about the acquired locks and wait times are gathered by pg_lw_lock_tracer.

Trace Points

The LWLock events are traced by pg_lw_lock_tracer in real-time. The tool uses Userland Statically Defined Tracing (USDT) to trace these events. These are static trace point that are defined in the source code of PostgreSQL. To enable this functionality, PostgreSQL has to be compiled with --enable-dtrace.

To check if a PostgreSQL binary was compiled with active trace points, the program bpftrace can be used. It allows to list all in a binary defined USDT trace points. For example, the following command can be used to list all trace points of the binary /home/jan/postgresql-sandbox/bin/REL_15_1_DEBUG/bin/postgres.

sudo bpftrace -l "usdt:/home/jan/postgresql-sandbox/bin/REL_15_1_DEBUG/bin/postgres:*"

When it returns a output as follows, the PostgreSQL binary was compiled with enabled trace points:

[...]
usdt:/home/jan/postgresql-sandbox/bin/REL_15_1_DEBUG/bin/postgres:postgresql:clog__checkpoint__start
usdt:/home/jan/postgresql-sandbox/bin/REL_15_1_DEBUG/bin/postgres:postgresql:clog__checkpoint__done
usdt:/home/jan/postgresql-sandbox/bin/REL_15_1_DEBUG/bin/postgres:postgresql:multixact__checkpoint__start
usdt:/home/jan/postgresql-sandbox/bin/REL_15_1_DEBUG/bin/postgres:postgresql:multixact__checkpoint__done
usdt:/home/jan/postgresql-sandbox/bin/REL_15_1_DEBUG/bin/postgres:postgresql:subtrans__checkpoint__start
usdt:/home/jan/postgresql-sandbox/bin/REL_15_1_DEBUG/bin/postgres:postgresql:subtrans__checkpoint__done
usdt:/home/jan/postgresql-sandbox/bin/REL_15_1_DEBUG/bin/postgres:postgresql:twophase__checkpoint__start
usdt:/home/jan/postgresql-sandbox/bin/REL_15_1_DEBUG/bin/postgres:postgresql:twophase__checkpoint__done
[...]

If it returns an empty output, no trace points are defined in the binary and PostgreSQL needs to be re-compiled with --enable-dtrace to use pg_lw_lock_tracer.

Download and Usage

The lock tracer can be installed via the Python package installer pip:

pip install pg-lock-tracer

Afterward, the locks of one or more running processes can be traced:

# Trace the LW locks of the PID 1234
pg_lw_lock_tracer -p 1234

# Trace the LW locks of the PIDs 1234 and 5678
pg_lw_lock_tracer -p 1234 -p 5678

# Trace the LW locks of the PID 1234 and be verbose
pg_lw_lock_tracer -p 1234 -v

# Trace the LW locks of the PID 1234 and collect statistics
pg_lw_lock_tracer -p 1234 -v --statistics

A sample output looks as follows:

===> Ready to trace
2904552881615298 [Pid 1704367] Acquired lock LockFastPath (mode LW_EXCLUSIVE) / LWLockAcquire()
2904552881673849 [Pid 1704367] Unlock LockFastPath
2904552881782910 [Pid 1704367] Acquired lock ProcArray (mode LW_SHARED) / LWLockAcquire()
2904552881803614 [Pid 1704367] Unlock ProcArray
2904552881865272 [Pid 1704367] Acquired lock LockFastPath (mode LW_EXCLUSIVE) / LWLockAcquire()
2904552881883641 [Pid 1704367] Unlock LockFastPath
2904552882095131 [Pid 1704367] Acquired lock ProcArray (mode LW_SHARED) / LWLockAcquire()
2904552882114171 [Pid 1704367] Unlock ProcArray
2904552882225372 [Pid 1704367] Acquired lock XidGen (mode LW_EXCLUSIVE) / LWLockAcquire()
2904552882246673 [Pid 1704367] Unlock XidGen
2904552882270279 [Pid 1704367] Acquired lock LockManager (mode LW_EXCLUSIVE) / LWLockAcquire()
2904552882296782 [Pid 1704367] Unlock LockManager
2904552882335466 [Pid 1704367] Acquired lock BufferMapping (mode LW_SHARED) / LWLockAcquire()
2904552882358198 [Pid 1704367] Unlock BufferMapping
2904552882379951 [Pid 1704367] Acquired lock BufferContent (mode LW_EXCLUSIVE) / LWLockAcquire()
2904552882415333 [Pid 1704367] Acquired lock WALInsert (mode LW_EXCLUSIVE) / LWLockAcquire()
2904552882485459 [Pid 1704367] Unlock WALInsert
2904552882506167 [Pid 1704367] Unlock BufferContent
2904552882590752 [Pid 1704367] Acquired lock WALInsert (mode LW_EXCLUSIVE) / LWLockAcquire()
2904552882611656 [Pid 1704367] Unlock WALInsert
2904552882638194 [Pid 1704367] Wait for WALWrite
2904554401202251 [Pid 1704367] Wait for WALWrite lock took 1518564057 ns
[...]

When the option --statistics is used, statistics about the traced locks can be collected. The statistics are shown during the termination of the tool (after hitting CTRL+c).

A tranche is the identifier of the resource that is protected by the lock. LWLocks can be acquired using different functions in PostgreSQL:

The function LWLockAcquire(...) (link) is the most commonly used function to acquire LWLocks. If the lock can be granted, it is granted and the function returns. Otherwise, the function waits until the lock is available, squires it, and returns.
The function LWLockConditionalAcquire(...) (link) also tries to acquire the lock. If it is not directly available, it just returns false.
The function LWLockAcquireOrWait(...) (link) tries to acquire the lock. If it is not directly available, it waits until the lock is available but does not acquire the lock.

From the PostgreSQL source code (link):

The semantics of this function are a bit funky. If the lock is currently free, it is acquired in the given mode, and the function returns true. If the lock isn’t immediately free, the function waits until it is released and returns false, but does not acquire the lock.

Depending on the function used to acquire the LWLock, different counters are increased in the statistics.

Lock statistics:
================

Locks per tranche
+---------------+----------+--------------------------+------------------------+-------------------------------+-----------------------------+-------+----------------+
|    Tranche    | Acquired | AcquireOrWait (Acquired) | AcquireOrWait (Waited) | ConditionalAcquire (Acquired) | ConditionalAcquire (Failed) | Waits | Wait time (ns) |
+---------------+----------+--------------------------+------------------------+-------------------------------+-----------------------------+-------+----------------+
| BufferContent |    1     |            0             |           0            |               0               |              0              |   0   |       0        |
| BufferMapping |    1     |            0             |           0            |               0               |              0              |   0   |       0        |
|  LockFastPath |    4     |            0             |           0            |               0               |              0              |   0   |       0        |
|  LockManager  |    2     |            0             |           0            |               0               |              0              |   0   |       0        |
|  PgStatsData  |    0     |            0             |           0            |               4               |              0              |   0   |       0        |
|   ProcArray   |    2     |            0             |           0            |               1               |              0              |   0   |       0        |
|   WALInsert   |    2     |            0             |           0            |               0               |              0              |   0   |       0        |
|    WALWrite   |    0     |            1             |           1            |               0               |              0              |   1   |   1518564057   |
|    XactSLRU   |    0     |            0             |           0            |               1               |              0              |   0   |       0        |
|     XidGen    |    1     |            0             |           0            |               0               |              0              |   0   |       0        |
+---------------+----------+--------------------------+------------------------+-------------------------------+-----------------------------+-------+----------------+

Locks per type
+--------------+----------+
|  Lock type   | Requests |
+--------------+----------+
| LW_EXCLUSIVE |    18    |
|  LW_SHARED   |    3     |
+--------------+----------+

Summary

pg_lw_lock_tracer is a tracer for PostgreSQL lightweight locks. The tool is available on GitHub for download. It uses Userland Statically Defined Tracing to trace the LWLock activity in real-time. Statistics about wait times of the LWLocks are also collected. This makes the tool very useful for performance analysis.

A description of a lock tracer for heavyweight locks can be found in the first part of this article series about locks.

Trace PostgreSQL locks with pg_lock_tracer

2023-01-11T00:00:00+00:00

The DBMS PostgreSQL uses locks to synchronize access to resources like tables. To get more information about the locks, the table pg_locks shows which relation is currently locked by which process. However, this relation shows only the current state of the locks. To show the locking activity in real-time, the new lock tracing tool pg_lock_tracer can be used. pg_lock_tracer is an open-source tool that I have just recently created. It can be downloaded from the website of the project.

Goal of the Tool

The tool employs a Berkeley Packet Filter (BPF) program to trace the locking activity of a PostgreSQL process in real-time with very low overhead. In addition, statistics about the taken locks (e.g., number of locks, lock types, delay) are measured by the tool. After the tool is running, the taken locks of the process are shown in real-time.

The tracer is intended for developers or system administrators to get additional information about the internals of PostgreSQL. In addition to the lock types, table open and close activity, transactions, deadlocks, errors, and the way the lock is grated is shown (fast-path locking or local locks).

The output of the tool is intended to be readable by a human. However, by using the --json flag, the output is generated in JSON format and can be processed by further tools.

Download and Usage

To install the lock tracer, the Python package installer pip can be used:

pip install pg-lock-tracer

This command installs the lock tracer with most needed dependencies. However, the BPF Python binding needs to be installed via the package manager of the used Linux distribution; they are currently not available via pip. To install them on a Ubuntu or Debian based system, the following command can be used:

apt install python3-bpfcc

Execute the Tracer

In this section, a simple query is traced. After the tracer is installed, it can be executed. The following command uses the PostgreSQL binary /home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres and observes the process with the ID 327578 (the SQL query SELECT * from pg_backend_pid(); can be used to determine the PID of the PostgreSQL backend process).

To resolve the used Object identifiers (OIDs) in the lock call, pg_lock_tracer can connect to the catalog of the database and get the real names of the tables. For example, the OID 3081 is translated into pg_catalog.pg_extension_name_index. Because every database has its own catalog with OIDs, the OID resolver has to be specified per traced process. By using the --statistics parameter, statistics about the locks are shown before the tool is terminated.

pg_lock_tracer -x /home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres -p 327578 -r 327578:sql://jan@localhost/test2 --statistics

Execute the SQL Query

After the tracer is running, a SQL query can be executed. In this example, the following SQL is used:

CREATE TABLE metrics(ts timestamptz NOT NULL, id int NOT NULL, value float);

Output of the Tracer

===> Ready to trace queries
745064333930117 [Pid 327578] Query begin 'create table metrics(ts timestamptz NOT NULL, id int NOT NULL, value float);'
745064333965769 [Pid 327578] Transaction begin
745064334157640 [Pid 327578] Table open 3079 (pg_catalog.pg_extension) AccessShareLock
745064334176147 [Pid 327578] Lock object 3079 (pg_catalog.pg_extension) AccessShareLock
745064334204453 [Pid 327578] Lock granted (fastpath) 3079 (pg_catalog.pg_extension) AccessShareLock
745064334224361 [Pid 327578] Lock granted (local) 3079 (pg_catalog.pg_extension) AccessShareLock (Already hold local 0)
745064334243659 [Pid 327578] Lock was acquired in 67512 ns
745064334285877 [Pid 327578] Lock object 3081 (pg_catalog.pg_extension_name_index) AccessShareLock
745064334309610 [Pid 327578] Lock granted (fastpath) 3081 (pg_catalog.pg_extension_name_index) AccessShareLock
745064334328475 [Pid 327578] Lock granted (local) 3081 (pg_catalog.pg_extension_name_index) AccessShareLock (Already hold local 0)
745064334345266 [Pid 327578] Lock was acquired in 59389 ns
745064334562977 [Pid 327578] Lock ungranted (fastpath) 3081 (pg_catalog.pg_extension_name_index) AccessShareLock
745064334583578 [Pid 327578] Lock ungranted (local) 3081 (pg_catalog.pg_extension_name_index) AccessShareLock (Hold local 0)
745064334608957 [Pid 327578] Table close 3079 (pg_catalog.pg_extension) AccessShareLock
745064334631046 [Pid 327578] Lock ungranted (fastpath) 3079 (pg_catalog.pg_extension) AccessShareLock
745064334649932 [Pid 327578] Lock ungranted (local) 3079 (pg_catalog.pg_extension) AccessShareLock (Hold local 0)
745064334671897 [Pid 327578] Table open 3079 (pg_catalog.pg_extension) AccessShareLock
745064334688382 [Pid 327578] Lock object 3079 (pg_catalog.pg_extension) AccessShareLock
745064334712042 [Pid 327578] Lock granted (fastpath) 3079 (pg_catalog.pg_extension) AccessShareLock
745064334731081 [Pid 327578] Lock granted (local) 3079 (pg_catalog.pg_extension) AccessShareLock (Already hold local 0)
745064334748288 [Pid 327578] Lock was acquired in 59906 ns
745064334772367 [Pid 327578] Lock object 3081 (pg_catalog.pg_extension_name_index) AccessShareLock
745064334795943 [Pid 327578] Lock granted (fastpath) 3081 (pg_catalog.pg_extension_name_index) AccessShareLock
745064334814983 [Pid 327578] Lock granted (local) 3081 (pg_catalog.pg_extension_name_index) AccessShareLock (Already hold local 0)
745064334832570 [Pid 327578] Lock was acquired in 60203 ns
[...]

The output of the tracer is truncated to keep the example readable. The full output of the tracer for the query can be found here.

After the query is executed, the lock tracer can be terminated by pressing CTRL + c. It stops to trace the process, shows the collected statistics and terminates afterward.

Lock statistics:
================

Locks per oid
+----------------------------------------------+----------+------------------------------+
|                  Lock Name                   | Requests | Total Lock Request Time (ns) |
+----------------------------------------------+----------+------------------------------+
|     pg_catalog.pg_depend_reference_index     |    20    |           1174663            |
|             pg_catalog.pg_depend             |    8     |            456525            |
|              pg_catalog.pg_type              |    5     |            282986            |
|     pg_catalog.pg_type_typname_nsp_index     |    4     |            229317            |
|         pg_catalog.pg_type_oid_index         |    4     |            300239            |
|             pg_catalog.pg_class              |    3     |            180540            |
|        pg_catalog.pg_class_oid_index         |    3     |            172549            |
|     pg_catalog.pg_depend_depender_index      |    3     |            171186            |
|    pg_catalog.pg_class_relname_nsp_index     |    2     |            114311            |
|           pg_catalog.pg_attribute            |    2     |            113041            |
|  pg_catalog.pg_attribute_relid_attnum_index  |    2     |            113299            |
|                public.metrics                |    2     |            223162            |
| pg_catalog.pg_class_tblspc_relfilenode_index |    1     |            56426             |
|  pg_catalog.pg_attribute_relid_attnam_index  |    1     |            57238             |
|            pg_catalog.pg_shdepend            |    1     |            65878             |
|    pg_catalog.pg_shdepend_reference_index    |    1     |            63127             |
+----------------------------------------------+----------+------------------------------+

Lock types
+---------------------+---------------------------+
|      Lock Type      | Number of requested locks |
+---------------------+---------------------------+
|   AccessShareLock   |             32            |
|   RowExclusiveLock  |             28            |
| AccessExclusiveLock |             2             |
+---------------------+---------------------------+

More Options of the Tracer

The lock tracer provides a lot of additional options. For example, the types of the events can be restricted or stack traces can be generated for every locking event. To trace only locking events (-t LOCK) and generate stack traces for every lock event (-s LOCK), the tracer can be invoked as follows:

pg_lock_tracer -x /home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres -p 1051967 -r 1051967:sql://jan@localhost/test2 -s LOCK -t LOCK

The output of the tracer looks as follows:

[...]
1990162746005798 [Pid 1051967] Lock object 3079 (pg_catalog.pg_extension) AccessShareLock
	LockRelationOid+0x0 [postgres]
	table_open+0x1d [postgres]
	parse_analyze+0xed [postgres]
	pg_analyze_and_rewrite+0x49 [postgres]
	exec_simple_query+0x2db [postgres]
	PostgresMain+0x833 [postgres]
	ExitPostmaster+0x0 [postgres]
	BackendStartup+0x1b1 [postgres]
	ServerLoop+0x2d9 [postgres]
	PostmasterMain+0x1286 [postgres]
	startup_hacks+0x0 [postgres]
	__libc_start_main+0xea [libc-2.31.so]
	[unknown]
[...]

To resolve one of these addresses to a line in the source code, the debugger gdb can be used. For example, to resolve exec_simple_query+0x2db to a line, the following command has to be executed:

gdb /home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres
[...]
(gdb) info line *(exec_simple_query+0x2db)
Line 1130 of "postgres.c" starts at address 0x5d4758  and ends at 0x5d477f .

It can be seen that the address exec_simple_query+0x2db resolves to line 1130 of the file postgres.c.

More information about all the options of pg_lock_tracer can be found in the help output:

usage: pg_lock_tracer [-h] [-v] [-j] -p PID [PID ...] -x PATH [-r [OIDResolver ...]]
                      [-s [{DEADLOCK,LOCK,UNLOCK} ...]] [-t [{TRANSACTION,QUERY,TABLE,LOCK,ERROR} ...]]
                      [-o OUTPUT_FILE] [--statistics] [-d]

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         be verbose
  -j, --json            generate output as JSON data
  -p PID [PID ...], --pid PID [PID ...]
                        the pid(s) to trace
  -x PATH, --exe PATH   path to binary
  -r [OIDResolver ...], --oid-resolver [OIDResolver ...]
                        OID resolver for a PID. The resolver has to be specified in format 
  -s [{DEADLOCK,LOCK,UNLOCK} ...], --stacktrace [{DEADLOCK,LOCK,UNLOCK} ...]
                        print stacktrace on every of these events
  -t [{TRANSACTION,QUERY,TABLE,LOCK,ERROR} ...], --trace [{TRANSACTION,QUERY,TABLE,LOCK,ERROR} ...]
                        events to trace (default: All events are traced)
  -o OUTPUT_FILE, --output OUTPUT_FILE
                        write the trace into output file
  --statistics          print lock statistics
  -d, --dry-run         compile and load the BPF program but exit afterward

usage examples:
# Trace use binary '/home/jan/postgresql-sandbox/bin/REL_15_1_DEBUG/bin/postgres' for tracing and trace pid 1234
pg_lock_tracer -x /home/jan/postgresql-sandbox/bin/REL_15_1_DEBUG/bin/postgres -p 1234

# Trace two PIDs
pg_lock_tracer -x /home/jan/postgresql-sandbox/bin/REL_15_1_DEBUG/bin/postgres -p 1234 -p 5678

# Be verbose
pg_lock_tracer -x /home/jan/postgresql-sandbox/bin/REL_15_1_DEBUG/bin/postgres -p 1234 -v 

# Use the given db connection to access the catalog of PID 1234 to resolve OIDs
pg_lock_tracer -x /home/jan/postgresql-sandbox/bin/REL_15_1_DEBUG/bin/postgres -p 1234 -r 1234:psql://jan@localhost/test2

# Output in JSON format
pg_lock_tracer -x /home/jan/postgresql-sandbox/bin/REL_15_1_DEBUG/bin/postgres -p 1234 -j

# Print stacktrace on deadlock
pg_lock_tracer -x /home/jan/postgresql-sandbox/bin/REL_15_1_DEBUG/bin/postgres -p 1234 -s DEADLOCK

# Print stacktrace for locks and deadlocks
pg_lock_tracer -x /home/jan/postgresql-sandbox/bin/REL_15_1_DEBUG/bin/postgres -p 1234 -s LOCK, DEADLOCK

# Trace only Transaction and Query related events
pg_lock_tracer -x /home/jan/postgresql-sandbox/bin/REL_15_1_DEBUG/bin/postgres -p 1234 -t TRANSACTION QUERY

# Write the output into file 'trace'
pg_lock_tracer -x /home/jan/postgresql-sandbox/bin/REL_15_1_DEBUG/bin/postgres -p 1234 -o trace

# Show statistics about locks
pg_lock_tracer -x /home/jan/postgresql-sandbox/bin/REL_15_1_DEBUG/bin/postgres -p 1234 --statistics

Summary

pg_lock_tracer is my new open-source tracing tool for PostgreSQL lock activity. It uses the Berkeley Packet Filter (BPF) to trace a running PostgreSQL process and shows the lock activity in real-time. The tool can be downloaded from the website of the project.

Measuring and visualizing I/O latency with ioping and gnuplot

2022-09-21T00:00:00+00:00

Reading and writing data from mass storage (volumes) is a quite common pattern in software. However, some time elapses between starting the I/O request (i.e., reading or writing data) and the completion of the request. The elapsed time to complete the request is called I/O latency. Older magnetic hard disks need some time to position the head on the right track of the magnetic disk. Also, newer flash-based disks like SSDs need some time to read the data. In addition, at common cloud providers, different types of block volumes are available that provide a different I/O latency. With the software ioping, this latency can be measured.

Having a good understanding of the I/O latency of the used mass storage device is crucial for implementing fast / low latency software systems. For example, in a database management system, multiple I/O requests are usually needed to execute a query. Monitoring the I/O latency can also be useful to detect a defect on a storage device (e.g., bad sectors on a hard disk) that lead to higher processing times of an executed software system.

The ioping software can determine the I/O latency of a device. Like with the regular ping command, the delay of requests is measured.

The ioping package is included in most modern distributions. On Debian-based distributions, the software can be installed as follows:

Installing and Executing ioping

$ sudo apt install ioping

The software provides a wide range of options. Required is only the destination folder in which the I/O requests are performed. In addition, the parameter -c allows specifying how many requests should be performed.

In the following example, 10 I/O requests are executed in the directory /tmp. Per default, an I/O request of 4 kilobytes is executed, other sizes can be specified by using the -s parameter. With the parameter -i the delay between two requests can be specified. Per default, a delay of one second is used.

$ ioping -c 10 /tmp
KiB <<< /tmp (ext4 /dev/vda1 39.3 GiB): request=1 time=423.2 us (warmup)
KiB <<< /tmp (ext4 /dev/vda1 39.3 GiB): request=2 time=636.9 us
KiB <<< /tmp (ext4 /dev/vda1 39.3 GiB): request=3 time=629.6 us
KiB <<< /tmp (ext4 /dev/vda1 39.3 GiB): request=4 time=612.6 us
KiB <<< /tmp (ext4 /dev/vda1 39.3 GiB): request=5 time=599.9 us
KiB <<< /tmp (ext4 /dev/vda1 39.3 GiB): request=6 time=590.8 us
KiB <<< /tmp (ext4 /dev/vda1 39.3 GiB): request=7 time=612.9 us
KiB <<< /tmp (ext4 /dev/vda1 39.3 GiB): request=8 time=638.5 us (slow)
KiB <<< /tmp (ext4 /dev/vda1 39.3 GiB): request=9 time=608.0 us
KiB <<< /tmp (ext4 /dev/vda1 39.3 GiB): request=10 time=685.0 us (slow)

--- /tmp (ext4 /dev/vda1 39.3 GiB) ioping statistics ---
requests completed in 5.61 ms, 36 KiB read, 1.60 k iops, 6.26 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 590.8 us / 623.8 us / 685.0 us / 26.5 us

In the summary of the executed command about can be seen that the used system has an average I/O latency of 623.8 us.

To get an output that is more suitable for post-processing, the option -print-count can be used. After n requests, raw statistics are printed. With the option -quiet, the normal output can be suppressed. So, to get a good output that can be used for further processing, the options -print-count 1 -quiet can be used. For example:

$ ioping -print-count 1 -c 10  -quiet /tmp
580777 1722 7052621 580777 580777 580777 0 2 1000832419
633389 1579 6466800 633389 633389 633389 0 1 1000147838
591484 1691 6924955 591484 591484 591484 0 1 999936195
638930 1565 6410718 638930 638930 638930 0 1 1000043713
617406 1620 6634208 617406 617406 617406 0 1 999983362
598996 1669 6838109 598996 598996 598996 0 1 999990179
564540 1771 7255465 564540 564540 564540 0 1 999962737
602750 1659 6795521 602750 602750 602750 0 1 1000025151
643763 1553 6362590 643763 643763 643763 0 1 1000046368

The format of the raw statistics is as follows:

Column	Meaning	Remarks
1	count of requests in statistics
2	running time	ns
3	requests per second	iops
4	transfer speed	bytes / seconds
5	minimal request time	ns
6	average request time	ns
7	maximum request time	ns
8	request time standard	ns
9	total requests	including warmup, too slow or too fast
10	total running time	nanoseconds

Comparing Volumes

To generate graphs of two different mass-storage devices, ioping is executed in the following example in the AWS cloud. A t3a.small instance is executed and a 100 GB gp2 and a 100 GB gp3 EBS volume are attached to the EC2 instance. According to some posts (see this and this) the newer gp3 EBS volume type might have a higher I/O latency than the older gp2 volume type. Let’s see if this can be confirmed by ioping and a plot of the individual execution times.

To execute the following commands, both volumes are formatted with an EXT4 files system using the default parameters of mkfs.ext4. The gp2 volume is mounted to the mount point /mnt/gp2 and the gp3 volume is mounted to the mount point /dev/gp3. To compare the latency, for each of the mount points, ioping is executed.

GP2 volume

sudo ioping -c 100 -i 100ms /mnt/gp2
[...]
--- /mnt/gp2 (ext4 /dev/nvme1n1 97.9 GiB) ioping statistics ---
99 requests completed in 36.8 ms, 396 KiB read, 2.69 k iops, 10.5 MiB/s
generated 100 requests in 9.90 s, 400 KiB, 10 iops, 40.4 KiB/s
min/avg/max/mdev = 240.7 us / 371.9 us / 1.44 ms / 174.4 us

GP3 volume

sudo ioping -c 100 -i 100ms /mnt/gp3
[...]
--- /mnt/gp3 (ext4 /dev/nvme2n1 97.9 GiB) ioping statistics ---
99 requests completed in 52.2 ms, 396 KiB read, 1.90 k iops, 7.41 MiB/s
generated 100 requests in 9.90 s, 400 KiB, 10 iops, 40.4 KiB/s
min/avg/max/mdev = 246.9 us / 527.5 us / 1.20 ms / 207.2 us

It can be seen in the output of the commands that the gp2 volumes have an average latency of 371.9 us; the gp3 volume has an average latency of 527.5 us.

Generate Graphs

Gnuplot is a tool that can be used to plot and visualize data. To generate the raw data for the visualization, the following commands can be executed.

sudo ioping -c 1000 -i 100ms -print-count 1 -quiet /mnt/gp2 > gp2.out
sudo ioping -c 1000 -i 100ms -print-count 1 -quiet /mnt/gp3 > gp3.out

After these commands are executed, the files gp2.out and gp3.out with the ioping statistics are created. These files can be processed directly by gnuplot using the following template:

set autoscale
set grid x y

set ylabel "I/O latency (us)"
set xlabel "I/O request number"
set term svg

set output "/dev/null"
set title "EBS GP2 volume attachted to a t3a.small EC2 instance" 
plot 'gp2.out' using (column(0)):($6/1000)
min_y = GPVAL_DATA_Y_MIN
max_y = GPVAL_DATA_Y_MAX
f(x) = mean_y
fit f(x) 'gp2.out' using (column(0)):($6/1000) via mean_y

stddev_y = sqrt(FIT_WSSR / (FIT_NDF + 1 ))

set label 1 gprintf("Minimum = %g", min_y) at 20, 100
set label 2 gprintf("Average = %g", mean_y) at 20, 1650
set label 3 gprintf("Maximum = %g", max_y) at 20, 1720
set label 4 gprintf("Standard deviation = %g", stddev_y) at 20, 1790

set yrange [0:max_y+300]
set output "gp2.svg"
plot min_y with filledcurves y1=mean_y lt 1 lc rgb "#bbbbdd" title "< Average", \
     max_y with filledcurves y1=mean_y lt 1 lc rgb "#bbddbb" title "> Average", \
     'gp2.out' using (column(0)):($6/1000) pt 2 title "", \
     mean_y lt 1 title "Average"

reset

set autoscale
set grid x y

set ylabel "I/O latency (us)"
set xlabel "I/O request number"
set term svg

set output "/dev/null"
set title "EBS GP3 volume attachted to a t3a.small EC2 instance" 
plot 'gp3.out' using (column(0)):($6/1000)
min_y = GPVAL_DATA_Y_MIN
max_y = GPVAL_DATA_Y_MAX
f(x) = mean_y
fit f(x) 'gp3.out' using (column(0)):($6/1000) via mean_y

stddev_y = sqrt(FIT_WSSR / (FIT_NDF + 1 ))

set label 1 gprintf("Minimum = %g", min_y) at 20, 100
set label 2 gprintf("Average = %g", mean_y) at 20, 2700
set label 3 gprintf("Maximum = %g", max_y) at 20, 2800
set label 4 gprintf("Standard deviation = %g", stddev_y) at 20, 2900

set yrange [0:max_y+500]
set output "gp3.svg"
plot min_y with filledcurves y1=mean_y lt 1 lc rgb "#bbbbdd" title "< Average", \
     max_y with filledcurves y1=mean_y lt 1 lc rgb "#bbddbb" title "> Average", \
     'gp3.out' using (column(0)):($6/1000) pt 2 title "", \
     mean_y lt 1 title "Average"

When this template is stored in the same directory as the statistics files with the name ioplot.plot and the command gnuplot ioplot.plot is executed, two SVG images are generated. These images contain a plot of the I/O latency along with the minimum, the average, and the maximum I/O latency.

The average execution I/O latency is roughly the same as shown in the initial commands (an average latency of 377 us for the gp2 volume type and 527 us for the gp3 volume type). The suspicion that the gp3 volumes have a higher latency could be proven by this execution. In addition, the standard deviation of the requests is higher (140 for the gp2 volume type and 229 for the gp3 volume type)

Summary

ioping is a tool to measure the I/O latency of a volume. gnuplot is a tool that can be used to plot and visualize data. It can be used to plot the raw statistics of iopoing.

A HTTPs reverse proxy for Docker with Traefik and Let’s encrypt

2022-08-27T00:00:00+00:00

Docker is one of the most popular runtimes for Containers these days. Often the services in the containers offer a web interface. To ensure that this service can be accessed securely (via HTTPs) on the standard port 443/tcp, reverse proxies are usually used. The reverse proxy receives the incoming requests on port 443, provides the appropriate TLS certificate and distributes the traffic depending on the URL to the respective containers. For a long time, Nginx was the quasi-standard for this task. However, Traefik has also been used for some time. This reverse proxy is discussed in this article.

Nginx is a widely used webserver that can also be used as a reverse proxy. However, the disadvantage of using Nginx as a reverse proxy is that an additional configuration file has to be maintained and additional tools for obtaining SSL-Certificates from Let’s encrypt have to be configured.

Complex Docker deployments are often maintained by using Docker Compose. Traefik integrates seamlessly into such a deployment. The complete reverse proxy can be configured using tags and deployed by including an additional Docker image.

Docker and Reverse Proxies

The first question is, what is a reverse proxy and why is it needed?

A reverse proxy terminates the incoming HTTP(s) connections from the Internet and forwards these connections to internal systems. Often, these systems are not directly reachable from the Internet. The connection forwarding is performed based on the provided URL. In addition, the reverse proxy terminates the HTTPs connection (otherwise, the proxy could not determine the URL from the encrypted connection). The data of the connection could be forwarded as encrypted (HTTPs) or unencrypted (HTTP) connections from the proxy to the actual system.

When using multiple Container images, the reverse proxy also performs a further task. If several containers provide a webinterface, only one container can use the port 80/tcp or 443/tcp and receive the incoming HTTP and HTTPs connections. Another container must use non-default ports, which might be inconvenient for users (e.g., https://example.com:100000). Using the reverse proxy, the proxy could listen on the default ports and forward the connections based on the URLs to the actual container ports. This is illustrated in the following image:

flowchart LR A["HTTPs-Request\n(443/tcp)"] --> C{Traefik} subgraph Docker Host C -->|domain1.example.com| D["Container A\n(10000/tcp)"] C -->|domain2.example.com/path1| E["Container B\n(10001/tcp)"] C -->|domain3.example.com/path/2/| F["Container C\n(10002/tcp)"] end

Because most container images do not support HTTPs connections out-of-the-box, the incoming HTTP traffic is forwarded unencrypted as regular HTTPs traffic to the containers. However, the traffic is only forwarded on the local system via the loopback interface and can not be intercepted by an attacker.

Installing and Configuring Traefik

To use Traefik, the provided container image has to be downloaded and started. This can be done by using the following lines in a Docker compose file. The compose file also applies a basic configuration to Traefik. The software listens to all requests on ports 80/tcp and 443/tcp. In addition, these ports of the Docker Hosts are forwarded to this container.

traefik:
    image: "traefik:v2.2"
    container_name: "traefik"
    restart: unless-stopped
    command:
        - "--api.insecure=false"
        - "--api.dashboard=true"
        - "--providers.docker=true"
        - "--providers.docker.exposedbydefault=false"
        - "--entrypoints.web.address=:80"
        - "--entrypoints.websecure.address=:443"
    labels:
        - "traefik.enable=true"
    ports:
    - "80:80"
    - "443:443"
    volumes:
    - "/var/run/docker.sock:/var/run/docker.sock:ro"

In addition, the Docker Socket (the file /var/run/docker.sock) is made accessible to the Traefik container via a volume mount. This is needed to notify Traefik automatically on configuration changes (e.g., a new container is started) and to let Traefik determine the labels that are applied to the containers to build the needed configuration at runtime.

Let’s Encrypt Certificates

Let’s Encrypt is a certificate authority that provides free certificates. These certificates needs to be requested and renewed every 90 days. Traefik can automatically handle the certificate management and no addional tools (e.g., cert-bot) are needed. To use this feature, the label traefik.http.routers..tls.certresolver=myresolver has to be applied to the container (see the complete configuration example below).

Before the myresolver certresolver can be used, it has to be defined and configured. This can be done by adding the following options to the start of the Traefik binary.

# Use a TLS challenge to request new certificates
- "--certificatesresolvers.myresolver.acme.tlschallenge=true"

# Use the E-Mail email@example.com to request the certificates
- "--certificatesresolvers.myresolver.acme.email=email@example.com"

# Store the certificates in the following file
- "--certificatesresolvers.myresolver.acme.storage=/letsencrypt/acme.json"

In order to store the requested certificates permanently and let the certificates survive traefik conatiner restarts, the directory /letsencrypt of the Traefik container should be mapped as a volume to the host system. This can be done by the directive volume: /root/traefik/letsencrypt:/letsencrypt in the Docker compose file.

Notice: The Let’s encrypt service has some rate limits. When these rate limits are reached, no new certificates are provided for a few days. During the setup of a system, it can be useful to use the sandbox CA of let’s encrypt. This CA does not generate valid certificates, but the local settings can be checked. To test the configuration using the Let’s encrypt sandbox CA, the following setting can be used:

 - "--certificatesresolvers.myresolver.acme.caserver=https://acme-staging-v02.api.letsencrypt.org/directory"

If everything works as expected, the setting can be removed and the directory /root/traefik/letsencrypt can be deleted. When Traefik is restarted, the certificates are requested from the official Let’s encrypt CA.

Tuning HTTPs Options

To improve the strength of the HTTPs connections and get a good rating in tests (like the SSL server test of SSL labs), the encryption settings have to be adjusted. For example, the available ciphers must be restricted and the TLS protocol versions have to be limited.

This configuration can be done using a separate configuration file that can be mounted as a volume into the Traefik container. So, the following file can be stored as /root/traefik/dynamic.yml on the Docker system and mounted into the Traefik container in the Docker compose file via volume: /root/traefik/dynamic.yml:/dynamic.yml:ro and loaded by passing the --providers.file.filename=/dynamic.yml to the Traefik binary.

tls:
 options:
   default:
     minVersion: VersionTLS12

     cipherSuites:
       - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
       - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
       - TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305
       - TLS_AES_128_GCM_SHA256
       - TLS_AES_256_GCM_SHA384
       - TLS_CHACHA20_POLY1305_SHA256
       
     curvePreferences:
       - CurveP521
       - CurveP384

     sniStrict: true

Admin Dashboard

Traefik ships with a dashboard that allows exploring the active configuration. To enable the dashboard, a hostname has to be chosen and the following labels have to be applied to the Traefik container. The hostname console.example.com is used in this example and has to be replaced by the real hostname and the password for the user also has to be set.

# Process HTTPs traffic for the dashboard
- "traefik.http.routers.dashboard.entrypoints=websecure"

# Use the myresolver certificate resolver to request TLS certificates
- "traefik.http.routers.dashboard.tls.certresolver=myresolver"

# Listen to the Hostname "console.example.com"
- "traefik.http.routers.dashboard.rule=Host(`console.example.com`)"

# Forward all traffic to the dashboard service
- "traefik.http.routers.dashboard.service=api@internal"

# Protect the access by a username and a password
- "traefik.http.routers.dashboard.middlewares=auth"

# Set the password for the user "myuser"
- "traefik.http.middlewares.auth.basicauth.users=myuser:[...]/"

Note: The encrypted version of the password for the user has to be included in the configuration file. The password can be generated by using the following command:

echo $(htpasswd -nb myuser mysecret) | sed -e s/\\$/\\$\\$/g

The htpasswd is included in the apache2-utils package on Debian-based distributions like Ubuntu.

Settings Labels for a Container

Like in Kubernetes, the configuration for the Containers is done based on labels. These labels are parsed by Traefik and the needed configuration is created at runtime. To let Traefik handle the HTTP and HTTPs-connections for one container, the following labels have to be applied to the container:

# Enable Traefik for this container
- "traefik.enable=true"

# Handle also the incoming HTTPs traffic for the host and use a certifcate that is requested via the "myresolver" certresolver
- "traefik.http.routers.develop-platform.rule=Host(`myservice.example.com`)"
- "traefik.http.routers.develop-platform.entrypoints=websecure"
- "traefik.http.routers.develop-platform.tls.certresolver=myresolver"

The labels above ensure that the traffic to the HTTPs port is handled properly. Unencrypted HTTP traffic for the domain is not handled so far. Therefore, an error message is shown in the browser if a user opens the domain via a regular HTTP connection. So, it might be useful to redirect all HTTP requests automatically to HTTPs. This can be done by using the following labels.

# Handle the incoming HTTP traffic for the host "myservice.example.com" and perform an automatic redirect to HTTPs
- "traefik.http.routers.develop-platform-plain.entrypoints=web"
- "traefik.http.routers.develop-platform-plain.rule=Host(`myservice.example.com`)"
- "traefik.http.routers.develop-platform-plain.middlewares=redirect-https"

The Complete Configuration

In this subsection, the complete configuration is shown. It starts one container with a web interface (called develop-platform in this example) and it starts the Traefik proxy that terminates the HTTP and HTTPs connections on the Docker host. The complete stack can be deployed by invoking `docker-compose up -d’.

version: "3.4"

services:

   develop-platform:
      image: nginxdemos/hello
      restart: unless-stopped
      labels:
         - "traefik.enable=true"
         - "traefik.http.routers.develop-platform.rule=Host(`myservice.example.com`)"
         - "traefik.http.routers.develop-platform.entrypoints=websecure"
         - "traefik.http.routers.develop-platform.tls.certresolver=myresolver"
         - "traefik.http.routers.develop-platform-plain.entrypoints=web"
         - "traefik.http.routers.develop-platform-plain.rule=Host(`myservice.example.com`)"
         - "traefik.http.routers.develop-platform-plain.middlewares=redirect-https"

    traefik:
        image: "traefik:v2.2"
        container_name: "traefik"
        restart: unless-stopped
        command:
            - "--api.insecure=false"
            - "--api.dashboard=true"
            - "--providers.file.filename=/dynamic.yml"
            - "--providers.docker=true"
            - "--providers.docker.exposedbydefault=false"
            - "--entrypoints.web.address=:80"
            - "--entrypoints.websecure.address=:443"
            - "--certificatesresolvers.myresolver.acme.tlschallenge=true"
            - "--certificatesresolvers.myresolver.acme.email=email@example.com"
            - "--certificatesresolvers.myresolver.acme.storage=/letsencrypt/acme.json"
        labels:
            - "traefik.enable=true"

            - "traefik.http.middlewares.redirect-https.redirectScheme.scheme=https"
            - "traefik.http.middlewares.redirect-https.redirectScheme.permanent=true"

            - "traefik.http.routers.dashboard-plain.entrypoints=web"
            - "traefik.http.routers.dashboard-plain.rule=Host(`console.example.com`)"
            - "traefik.http.routers.dashboard-plain.middlewares=redirect-https"

            - "traefik.http.routers.dashboard.entrypoints=websecure"
            - "traefik.http.routers.dashboard.tls.certresolver=myresolver"
            - "traefik.http.routers.dashboard.rule=Host(`console.example.com`)"
            - "traefik.http.routers.dashboard.service=api@internal"
            - "traefik.http.routers.dashboard.middlewares=auth"
            - "traefik.http.middlewares.auth.basicauth.users=myuser:[...]/"
        ports:
            - "80:80"
            - "443:443"
        volumes:
            - "./traefik/letsencrypt:/letsencrypt"
            - "./traefik/dynamic.yml:/dynamic.yml:ro"
            - "/var/run/docker.sock:/var/run/docker.sock:ro"

Backup your Data encrypted to AWS S3 using Duplicity

2022-08-07T00:00:00+00:00

No one wants to create backups; everyone just wants to be able to restore data – that’s the old saying in IT. However, to be able to restore data, backups need to be created on a regular basis. To ensure that major disasters can be survived, the backups should be stored in a different location. In this article, the software Duplicity is used to create automated backups and store them on an AWS S3 bucket in the AWS cloud. These backups are encrypted using GPG.

Creation of an AWS S3 Bucket

The Amazon Simple Storage Service (S3) is a service that is specialized in storing files in a scalable manner. The data can be stored in various storage classes. Currently (in 2022), storing a gigabyte costs approximately 2.3 cents (USD) per month. With a minimum storage period of 30 days (storage class Standard-Infrequent Access), the costs drop to 1.2 cents per month. In addition to the cost of storage, there is the cost of data transfer. Inbound traffic to AWS is free; outbound traffic is billed. Costs may also be incurred for operations (e.g., PUT, DELETE). More details can be found in the AWS price calculator.

When using the Amazon S3 Glacier Deep Archive storage classes, the costs drop even further. However, access to the data takes longer, and the data also has a higher minimum storage period (up to 180 days). AWS itself recommends using the “Standard Infrequent Access” storage class for backups. In most storage classes, data is stored in at least three availability zones at the same time, which results in very high durability (99.9999999%).

Even though the cost is made up of many individual components, storing the data remains inexpensive. According to the AWS pricing calculator, storing 25 GB of data and backing up 10 GB once a month costs 1.42 USD.

Create a new Bucket

Open the S3 configuration in the AWS console and click the Create Bucket button. To create the bucket, a unique name (e.g., my-backup-bucket) has to be specified. In addition, the Object Ownership can be set to ACLs disabled (recommended), and the public access can be set to Block all public access. Using this setting, accessing the bucket requires an IAM account. Also, the Bucket Versioning can be set to Disabled, and the Server-side encryption can also be disabled. By using Duplicity, the backup volumes will be already encrypted on the client side before they are transferred to the S3 bucket.

Bucket Access Management

To grant the Duplicity access to this bucket, a policy has to be created which allows the read and write access to this bucket. This policy can be attached to an IAM user group. Afterward, a new IAM user with an API key and API secret can be created, which is assigned to a user group.

In the first step, the Identity and Access Management (IAM) console has to be opened. Then a new policy can be created. During the policy creation, the permissions of the policy can be specified as JSON. For a policy that allows access to the bucket my-backup-bucket, the following code can be used. In your setup, my-backup-bucket has to be created by your actual bucket name.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ListObjectsInBucket",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::my-backup-bucket"
            ]
        },
        {
            "Sid": "AllObjectActions",
            "Effect": "Allow",
            "Action": "s3:*Object",
            "Resource": [
                "arn:aws:s3:::my-backup-bucket/*"
            ]
        }
    ]
}

Before the policy can be stored, a name has to be chosen. In this example, I used s3-backup-bucket-read-write.

Afterward, a new user group has to be created. This can also be done in the same IAM AWS console. The user group is named s3-backup-group in this example. In the field Attach permissions policies, the s3-backup-bucket-read-write policy has to be chosen and attached to this uer group. Afterward, the user group can be created.

Finally, a new IAM user account can be created that allows access to the S3 bucket. In this example, the user is named duplicity-backup-user. During the creation, the setting Select AWS credential type, the value Access key - Programmatic access has to be activated to allow access via a key and a secret. The value Password - AWS Management Console access has to be disabled. On the permissions tab, the s3-backup-group user group can be attached to this user, and the user can be created. After clicking the Create user button, the API key and the API secret are shown. These values should now be noted since the API secret is only displayed once. The setup of the S3 bucket is now complete.

Installation and Configuration of Duplicity

Duplicity can now be installed. On a Debian-based distribution, this can be done by executing the following commands:

apt install duplicity python3-boto gpg

Setup Encryption

Even the access to the S3 bucket is protected, it is recommended to encrypt the backups. This can be done by a GPG key. In this case, the backups can only be restored when the private key (and the password for the key) is known. Therefore, the GPG key pair has to be also backed up (e.g., on a USB stick that is stored at another physical location).

If a GPG key is already present, this key can be used. Otherwise, a new Key-Pair (a private key and a public key) has to be created. This can be done by executing gpg --gen-key and creating a key (2048 - 4096 bits with no expiration date).

Afterward, gpg --list-keys can be called and the ID of the key (e.g., 45DBFFF2) should be noted down. The ID is used later during the configuration of Duplicity.

Perform Backups

To perform the actual backups, a shell script like the following one should be created and executed as a cron job on a regular basis. In the script the variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY have to be set to the API key and secret of the created IAM user. In addition, my-backup-bucket has to be set to the name of the actual S3 bucket. Also, the id of the used GPG key has to be adjusted (GPG_KEY). The options for the compression can be adjusted. By default, compression is already performed. The compression rate can be improved as shown in the script. However, the compression will then also take longer, which can lead to significantly longer times for creating the backup. Here, a good balance between storage costs and runtime should be considered.

By specifying --include

, the directories that should be backed up can be specified. The line duplicity remove-older-than 2M ensures that backups that are older than two months are deleted. The next invocation of duplicity performs the actual backup. The flag --s3-use-ia ensures that all created files are stored in the infrequent access storage class. In addition, the backups are encrypted by using a GPG key. Normally an incremental backup is performed. However, after one month, a full backup (--full-if-older-than 1M) is created.

#!/bin/sh

export AWS_ACCESS_KEY_ID="[....]"
export AWS_SECRET_ACCESS_KEY="[....]"

DEST=boto3+s3://my-backup-bucket/
GPG_KEY="45DBFFF2"
VERBOSE=""
#VERBOSE="-v8"

# Compression
COMPRESSION=""
#COMPRESSION="--gpg-options='--compress-algo=bzip2 --bzip2-compress-level=9'"

INCLUDES="--include /home --include /root"

duplicity remove-older-than 2M ${DEST} --force
duplicity ${VERBOSE} ${COMPRESSION} --s3-use-ia --encrypt-key ${GPG_KEY} --full-if-older-than 1M ${INCLUDES} --exclude '**' / ${DEST}

Restoring Data

After the backup is created, files can be restored. To restore files, the environment variables AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and DEST (as in the script above) have to be set by calling export .... Afterward, by calling duplicity list-current-files, the files in the most recent backup can be shown. By calling duplicity list-current-files --time 2D, the files from the backup run two days ago are shown.

Restore a Single File

By calling duplicity restore --file-to-restore filename --time 2022-05-18 ${DEST} /tmp/restore/filename, the file filename with the latest change on the 2022-05-18 is restored as /tmp/restore/filename.

Restore a Complete Backup

Also, a complete backup can be restored. For example, this can be done by calling duplicity restore -t 4D ${DEST} /restore. This command restores the backup from four days ago -t 4D into the directory /restore.