- WaSQL Wired
- Posts
- Duck DB Crash Course
Duck DB Crash Course
From Big Data to AI Insights: Why Duck DB is Your Essential Tool
Duck DB is an in-process analytical database management system, similar to SQLite but optimized for analytical queries (OLAP) rather than transactional workloads (OLTP). It's designed to be embedded within applications and excels at processing large datasets with complex queries quickly, using columnar storage to achieve superior performance for analytical workloads. Let’s take a closer look.
Installation of Duck DB is pretty simple and straight forward. If you are on Windows the easiest way is to just download the CLI executable and add it to your PATH. If you want to use Duck DB in Python or R you can just install it with the normal package installer.
# Python
pip install duckdb
# R
install.packages("duckdb")
# CLI - Windows
# Download ZIP from official website
wget https://github.com/duckdb/duckdb/releases/download/v0.9.2/duckdb_cli-windows-amd64.zip
# Extract the ZIP file
Expand-Archive -Path duckdb_cli-windows-amd64.zip -DestinationPath C:\duckdb
# Add to PATH (optional)
$env:PATH += ";C:\duckdb"
# Alternative: Install via Chocolatey
choco install duckdb
# Alternative: Install via Scoop
scoop install duckdb
# CLI - macOS
brew install duckdb
# CLI - Linux
wget https://github.com/duckdb/duckdb/releases/download/v0.9.2/duckdb_cli-linux-amd64.zip
unzip duckdb_cli-linux-amd64.zip
Duck DB vs SQLite
Duck DB differs from SQLite in several key ways:
Storage model: Duck DB uses columnar storage (optimized for analytical queries) vs SQLite's row-based storage
Vectorized execution: Duck DB processes data in batches for better CPU utilization
Query optimization: Duck DB has advanced optimization for complex analytical queries
Parallelism: Duck DB supports multi-threading, SQLite is single-threaded
External data: Duck DB can query external files directly (CSV, Parquet, etc.)
DuckDB is written in C++ and has first-party support for languages like Python, R, Java, and C. Other Languages like JavaScript, Ruby, and Go are supported through maintained bindings or extensions. Here are some examples of using Duck DB.
Python Example
import duckdb
# Connect to an in-memory database
con = duckdb.connect()
# Create and query a table
con.execute("CREATE TABLE items(id INTEGER, name VARCHAR)")
con.execute("INSERT INTO items VALUES (1, 'DuckDB'), (2, 'SQLite')")
result = con.execute("SELECT * FROM items").fetchall()
print(result)
# Query a CSV file directly
result = con.execute("SELECT * FROM read_csv('data.csv') LIMIT 5").fetchdf()
print(result)
R Example
library(duckdb)
con <- dbConnect(duckdb())
dbExecute(con, "CREATE TABLE test (id INTEGER, name VARCHAR)")
dbExecute(con, "INSERT INTO test VALUES (1, 'DuckDB'), (2, 'R')")
dbGetQuery(con, "SELECT * FROM test")
Java Example
import org.duckdb.*;
public class DuckDBExample {
public static void main(String[] args) throws Exception {
// Create a connection
Connection conn = DriverManager.getConnection("jdbc:duckdb:");
Statement stmt = conn.createStatement();
// Execute queries
stmt.execute("CREATE TABLE example(id INTEGER, name VARCHAR)");
stmt.execute("INSERT INTO example VALUES (1, 'DuckDB'), (2, 'Java')");
ResultSet rs = stmt.executeQuery("SELECT * FROM example");
// Print results
while (rs.next()) {
System.out.println(rs.getInt("id") + ": " + rs.getString("name"));
}
}
}
Rust Example
use duckdb::{Connection, Result};
fn main() -> Result<()> {
let conn = Connection::open_in_memory()?;
conn.execute(
"CREATE TABLE test (id INTEGER, name TEXT)",
[]
)?;
conn.execute(
"INSERT INTO test VALUES (1, 'DuckDB'), (2, 'Rust')",
[]
)?;
let mut stmt = conn.prepare("SELECT * FROM test")?;
let rows = stmt.query_map([], |row| {
Ok((row.get::<_, i32>(0)?, row.get::<_, String>(1)?))
})?;
for row in rows {
let (id, name) = row?;
println!("{}: {}", id, name);
}
Ok(())
}
Extensions in Duck DB
Duck DB also has a robust extension system that adds some cool functionality. Some of the built-in extensions are:
parquet: Read/write Parquet files (loaded by default)
json: JSON processing functions
sqlite: Connect to SQLite databases
postgres_scanner: Connect to PostgreSQL databases
mysql_scanner: Connect to MySQL databases
httpfs: Access remote files over HTTP
excel: Load Excel files
Other extensions are also used often and are super useful.
arrow: Apache Arrow integration
spatial: Geospatial data types and functions
iceberg: Read Apache Iceberg tables
aws: Amazon S3 integration
These extensions must be installed and loaded before use. To Install and Load an extension just add this before your query.
INSTALL extension_name; LOAD extension_name;
Apache Arrow Integration
Apache Arrow is a columnar memory format for cross-language data exchange. Duck DB integrates well with Arrow:
import pyarrow as pa
import duckdb
# Create Arrow table
data = {
'id': [1, 2, 3],
'name': ['one', 'two', 'three']
}
arrow_table = pa.Table.from_pydict(data)
# Query Arrow table with DuckDB
con = duckdb.connect()
result = con.execute("SELECT * FROM arrow_table WHERE id > 1").fetchall()
print(result)
# Convert DuckDB result to Arrow
arrow_result = con.execute("SELECT * FROM arrow_table").arrow()
CLI Examples for Data Sources
-- Read CSV with automatic schema detection
SELECT * FROM read_csv('data.csv');
-- Read CSV with schema specification
SELECT * FROM read_csv('data.csv', header=true, columns={'id': 'INTEGER', 'name': 'VARCHAR'});
-- Read Parquet file
SELECT * FROM read_parquet('data.parquet');
-- Read with filters pushed down
SELECT * FROM read_parquet('data.parquet', hive_partitioning=1) WHERE date_column = '2023-01-01';
-- Read JSON file
SELECT * FROM read_json('data.json');
-- Read JSON with specific options
SELECT * FROM read_json('data.json', format='auto', columns={'id': 'INTEGER', 'name': 'VARCHAR'});
-- Read Excel file
INSTALL excel;
LOAD excel;
SELECT * FROM read_excel('data.xlsx', sheet_name='Sheet1');
-- Read from HTTP source
INSTALL httpfs;
LOAD httpfs;
SELECT * FROM read_csv('https://example.com/data.csv');
Complex Joins Example
In my opinion one of the coolest features of Duck DB is it’s ability to join across multiple data sources and types. Here's how to join multiple data sources in Duck DB:
-- Install required extensions
INSTALL sqlite;
LOAD sqlite;
INSTALL postgres_scanner;
LOAD postgres_scanner;
INSTALL mysql_scanner;
LOAD mysql_scanner;
-- Join multiple sources
SELECT
c.distid, c.name, c.email,
j.age, j.color,
m.code, m.name as country_name,
count(o.ordernumber) as order_count,
sum(o.order_amount) as order_total,
count(oi.itemid) as item_count
-- CSV file
FROM read_csv('d:/temp3/names.csv') c
-- JSON file
JOIN read_json('d:/temp3/ages.json') j on j.distid = c.distid
-- SQLite DB
JOIN sqlite_attach('orders.db').orders o on o.dist_id = c.distid
-- Postgres DB
JOIN postgres_attach('host=localhost user=postgres password=password dbname=master').order_items oi on oi.distid = c.distid
-- MySQL DB
JOIN mysql_attach('host=localhost user=root password=password dbname=dis_live').states m on m.code = o.state and m.country = 'US'
GROUP BY
c.distid, c.name, c.email,
j.age, j.color,
m.code, m.name;
Duck DB is quickly becoming essential in modern data workflows due to its ability to process analytical queries on large datasets with impressive speed and minimal setup. As an embedded SQL OLAP database, it excels at columnar analytics while integrating seamlessly with Python and R. For data professionals facing growing data volumes, Duck DB offers a lightweight yet powerful alternative to complex data warehouse setups, enabling fast in-memory processing of gigabyte-scale datasets with familiar SQL syntax. Learning Duck DB today means gaining a valuable tool that bridges the gap between local development and production-scale data engineering.
Congratulations! Thanks again for reading. Hopefully you learned something from this article. If so, please like and SHARE. 🙂