• WaSQL Wired
  • Posts
  • Advanced JSON Querying in PostgreSQL: Beyond Basic JSONB

Advanced JSON Querying in PostgreSQL: Beyond Basic JSONB

Sophisticated techniques for working with complex JSON structures efficiently

Introduction

PostgreSQL has evolved into a powerful hybrid database system, combining the reliability of relational databases with the flexibility of NoSQL solutions through its robust JSON capabilities. While basic JSON operations are well documented, this article explores advanced techniques for querying, manipulating, and optimizing complex JSON structures in PostgreSQL.

JSON vs. JSONB: Understanding the Foundation

PostgreSQL offers two primary JSON data types:

  • JSON: Stores data in text format, preserving whitespace and key ordering

  • JSONB: Stores data in binary format, offering significant performance advantages

For most advanced querying use cases, JSONB is the superior choice due to its faster processing, better indexing support, and optimization for read operations. JSONB is the more efficient representation that allows for quicker data retrieval due to its binary format. JSONB, which stands for "JSON Binary", allows for more efficient processing on JSON data. While JSON stores data as plain text (making writes slightly faster), JSONB's binary format enables significantly faster read operations and powerful indexing capabilities. Lets dig into why and how.

Path-Based Access with JSON Path Expressions

There are two ways to extract data out of a JSON column - chained operators and JSON Path expressions. JSON Path expressions were introduced in version 12 and further optimized in version 16. They provide a standardized way to access nested JSON data with improved performance over traditional JSON query using chained operators. JSON Path queries can perform 15-25% faster for deeply nested data.

Note: JSON Path expressions and JSON Path queries are closely related but not exactly the same: JSON Path expressions are syntax patterns that specify how to navigate and select elements within JSON data structures. They define the path to traverse through a JSON document (similar to how XPath works for XML). JSON Path queries are operations that use JSON Path expressions to retrieve or manipulate data from JSON documents. These queries can include additional operations like filtering, searching, or transforming the data located by the expressions. Postgres 16 has the following json_path functions:

  • json_path_query: Extracts all JSON items that match the JSON path expression

  • json_path_query_first: Returns only the first JSON item that matches the path expression

  • json_path_query_array: Returns all matching items as a JSON array

  • json_path_exists: Returns true if the path expression matches at least one JSON item

  • json_path_match: Tests whether a JSON path predicate expression matches

-- Traditional approach with chained operators
SELECT * FROM products 
WHERE specs->'dimensions'->'height' > '10';

-- More efficient JSON Path approach
SELECT * FROM products 
WHERE jsonb_path_query_first(specs, '$.dimensions.height') > '10';

Transforming JSON Data for Efficiency

Restructuring your queries can yield substantial performance improvements. The pattern below improves query performance by 20-35% in benchmark tests, by allowing more efficient processing of the JSON operations.

-- Less efficient approach with transformation in WHERE clause
SELECT * FROM orders 
WHERE (order_data->'items'->>'count')::int > 5;

-- More efficient approach with transformation in SELECT
SELECT * FROM orders 
CROSS JOIN LATERAL jsonb_path_query_first(order_data, '$.items.count') AS item_count 
WHERE (item_count)::int > 5;

NOTE: CROSS JOIN LATERAL in PostgreSQL is a powerful combination of two join types that enables row-by-row processing with context from previous tables.

A standard CROSS JOIN performs a cartesian product between two tables, generating all possible combinations of rows. The LATERAL keyword enables the right-side table or subquery to reference columns from tables that appear earlier in the FROM clause.

Lateral joins are particularly effective for working with JSON arrays and objects, enabling you to unnest and process collection elements more efficiently.

-- Extracting and processing array elements
SELECT u.id, elem.value 
FROM users u, 
LATERAL json_array_elements(u.data->'items') AS elem;

CTEs (Common Table Expressions) can significantly simplify complex JSON queries by breaking them into logical components.

WITH json_data AS (
  SELECT data FROM users
)
SELECT json_data.data->>'name' AS name 
FROM json_data;

GIN (Generalized Inverted Index) indexes are particularly effective for working with JSONB data. Unlike traditional indexes, GIN is designed for handling cases where the indexed items are composite values. Consider these various indexing approaches:

-- Basic GIN index for all keys
CREATE INDEX idx_data ON products USING GIN(data);

-- Partial index for specific use cases
CREATE INDEX idx_active_electronics ON products 
USING GIN (specs jsonb_path_ops) 
WHERE (specs->>'status') = 'active' AND (specs->>'category') = 'electronics';

Partial indexes reduce index size and improve maintenance operations while targeting specific query needs.

PostgreSQL 16 introduced improved function result caching, which can significantly boost the performance of frequently used JSON functions.

For reports requiring complex JSON aggregations or frequent access patterns consider using a materialized view.

CREATE MATERIALIZED VIEW product_dimensions AS 
SELECT id, 
  (specs->'dimensions'->>'width')::numeric AS width, 
  (specs->'dimensions'->>'height')::numeric AS height, 
  (specs->'dimensions'->>'depth')::numeric AS depth 
FROM products 
WHERE specs ? 'dimensions';

-- Refresh on a schedule
REFRESH MATERIALIZED VIEW product_dimensions;

Materialized views allows you to shift processing time to off-peak hours and delivers sub-second query response times for complex JSON-based reports.

Another efficient way to leverage JSONB in PostgreSQL is to combine traditional columns and JSONB. If a key appears frequently in your JSONB blobs, it is likely better stored as a column, while using JSONB as a "catch-all" for more variable data. Consider this hybrid approach:

CREATE TABLE products (
  id SERIAL PRIMARY KEY,
  name TEXT,
  price NUMERIC,
  category TEXT,
  -- Frequently accessed fields as columns
  created_at TIMESTAMP,
  -- Variable data as JSONB
  specs JSONB
);

PostgreSQL's support for generated columns allows you to automatically derive columns from JSON data, improving query performance for frequently accessed fields.

ALTER TABLE employees 
ADD COLUMN full_name TEXT GENERATED ALWAYS AS 
(data->>'first_name' || ' ' || data->>'last_name') STORED;

Note: Regular monitoring helps identify and address JSON query bottlenecks. Use EXPLAIN ANALYZE to examine execution plans for complex JSON queries and monitor pg_stat_statements to track query performance metrics.

-- Enable performance tracking - These queries are permanent and specific to your database
ALTER DATABASE my_database SET track_io_timing = on;
ALTER DATABASE my_database SET pg_stat_statements.track = ALL;

Once you've enabled track_io_timing, you can view the collected I/O statistics using the pg_stat_database view. Here are some common ways to use it:

-- Basic i/o statistics
SELECT 
    datname,
    blk_read_time,
    blk_write_time
FROM pg_stat_database
WHERE datname = 'my_database';

-- detailed I/O metrics with calculated percentages:
SELECT 
    datname,
    blk_read_time, 
    blk_write_time,
    blk_read_time + blk_write_time AS total_io_time,
    (blk_read_time / (blk_read_time + blk_write_time)) * 100 AS read_pct,
    (blk_write_time / (blk_read_time + blk_write_time)) * 100 AS write_pct
FROM pg_stat_database 
WHERE datname = 'my_database'
AND (blk_read_time + blk_write_time) > 0;

-- reset statistics to start fresh
SELECT pg_stat_reset();

Once you've enabled pg_stat_statements.track = ALL, you can use it to collect and analyze detailed query performance statistics. Here are common ways to use it:

-- install the extension if now already installed
CREATE EXTENSION pg_stat_statements;


-- most time consuming queries
SELECT 
    query,
    calls,
    total_exec_time,
    mean_exec_time,
    rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

-- queries with highest average execution time:
SELECT 
    query,
    calls,
    mean_exec_time,
    stddev_exec_time,
    rows
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

-- frequently called queries
SELECT 
    query,
    calls,
    total_exec_time,
    mean_exec_time,
    rows
FROM pg_stat_statements
ORDER BY calls DESC
LIMIT 10;

-- reset to start fresh
SELECT pg_stat_statements_reset();

Congratulations on making it to the end of this article! Thanks again for reading. Please like and SHARE.

This week’s sponsor is YOU. Please let me know articles you would like to see in the future. Looking forward to hearing from y’all. Send your suggestions to [email protected]  :)