Lineage Overview
LeapSQL automatically tracks how data flows through your transformations, providing both table-level and column-level lineage.
What is Data Lineage?
Data lineage traces the origin and transformation of data as it moves through your pipeline:
raw_customers ──┐
├──> stg_customers ──> customer_360
raw_orders ─────┴──> stg_orders ────┘LeapSQL tracks two levels of lineage:
- Table Lineage: Which tables/models depend on which other tables
- Column Lineage: Which specific columns flow into which output columns
Why Lineage Matters
Impact Analysis
Know what breaks when something changes:
If raw_customers schema changes:
└── stg_customers (directly affected)
└── customer_360 (indirectly affected)
└── customer_report (indirectly affected)Debugging
Trace data issues back to their source:
customer_360.lifetime_value is wrong
└── Comes from stg_orders.amount (check here)
└── Comes from raw_orders.amount (or here)Compliance
Understand where sensitive data flows:
raw_customers.email (PII)
└── stg_customers.email
└── customer_360.email
└── email_campaigns.recipientDocumentation
Auto-generated data flow documentation.
How LeapSQL Tracks Lineage
Automatic Detection
LeapSQL parses your SQL to automatically detect lineage:
/*---
name: customer_summary
---*/
SELECT
c.customer_id, -- From stg_customers.customer_id
c.name, -- From stg_customers.name
COUNT(o.order_id) as order_count, -- From stg_orders.order_id
SUM(o.amount) as total_spent -- From stg_orders.amount
FROM stg_customers c
LEFT JOIN stg_orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.nameLeapSQL automatically knows:
- Table lineage:
customer_summarydepends onstg_customersandstg_orders - Column lineage:
customer_summary.customer_id<-stg_customers.customer_idcustomer_summary.name<-stg_customers.namecustomer_summary.order_count<-stg_orders.order_id(via COUNT)customer_summary.total_spent<-stg_orders.amount(via SUM)
No Manual Annotation
Unlike some tools, you don't need to annotate your SQL:
-- Not required: ref('stg_customers')
-- Not required: source('raw', 'customers')
-- Just write normal SQL!
SELECT * FROM stg_customersViewing Lineage
DAG Command
View table-level lineage:
leapsql dagOutput:
raw_customers
└── stg_customers
└── customer_summary
└── customer_report
raw_orders
└── stg_orders
└── customer_summary
└── customer_report
└── order_metricsState Database
Lineage is stored in the state database (.leapsql/state.db):
-- Table dependencies
SELECT * FROM dependencies;
-- Column lineage
SELECT * FROM column_lineage;Lineage Types
Direct Lineage
Column passes through unchanged:
SELECT
customer_id, -- Direct: customer_id ← source.customer_id
name -- Direct: name ← source.name
FROM sourceTransformed Lineage
Column is derived from source columns:
SELECT
UPPER(name) as name, -- Transformed: name ← source.name
first_name || ' ' || last_name as full_name -- Transformed: full_name ← first_name, last_name
FROM sourceAggregated Lineage
Column is aggregated from source:
SELECT
customer_id,
COUNT(*) as order_count, -- Aggregated: order_count ← orders (implicit)
SUM(amount) as total -- Aggregated: total ← source.amount
FROM orders
GROUP BY customer_idFiltered Lineage
Source data is filtered but columns pass through:
SELECT
customer_id,
name
FROM customers
WHERE status = 'active' -- Filter doesn't change lineageComplex Lineage Scenarios
JOINs
LeapSQL tracks lineage through joins:
SELECT
c.customer_id, -- From customers.customer_id
c.name, -- From customers.name
o.order_date, -- From orders.order_date
o.amount -- From orders.amount
FROM customers c
JOIN orders o ON c.customer_id = o.customer_idSubqueries
Lineage is traced through subqueries:
SELECT
customer_id,
total_amount
FROM (
SELECT
customer_id,
SUM(amount) as total_amount -- From orders.amount
FROM orders
GROUP BY customer_id
) subqueryCTEs
Common Table Expressions are handled correctly:
WITH customer_totals AS (
SELECT
customer_id,
SUM(amount) as total -- From orders.amount
FROM orders
GROUP BY customer_id
)
SELECT
c.name,
ct.total -- Traced back to orders.amount
FROM customers c
JOIN customer_totals ct ON c.customer_id = ct.customer_idUNION
Lineage tracks multiple sources:
SELECT customer_id, amount FROM orders_2023
UNION ALL
SELECT customer_id, amount FROM orders_2024
-- Lineage: output columns ← orders_2023 AND orders_2024CASE Statements
Conditional lineage is tracked:
SELECT
customer_id,
CASE
WHEN region = 'US' THEN domestic_price
ELSE international_price
END as price
-- Lineage: price ← region, domestic_price, international_price
FROM productsLineage Limitations
Dynamic SQL
Template-generated SQL is analyzed after rendering:
SELECT {{ dynamic_column }} FROM {{ dynamic_table }}
-- Lineage based on rendered valuesExternal Functions
Custom database functions may obscure lineage:
SELECT my_custom_function(col1, col2) as result
-- LeapSQL may not know which columns affect resultComplex Expressions
Very complex expressions may have simplified lineage:
SELECT
COALESCE(
NULLIF(TRIM(a.col1), ''),
b.col2,
'default'
) as value
-- Lineage shows: value ← a.col1, b.col2Best Practices
1. Keep Transformations Traceable
-- Good: Clear lineage
SELECT
customer_id,
revenue / orders as avg_order_value
FROM metrics
-- Less clear: What contributed to this?
SELECT
customer_id,
complex_udf(a, b, c, d, e, f) as mystery_value
FROM metrics2. Use Meaningful Aliases
-- Good: Aliases help understanding
SELECT
src.customer_id,
src.name as customer_name,
SUM(ord.amount) as total_revenue
FROM customers src
JOIN orders ord ON ...
-- Less helpful
SELECT
a.x,
a.y,
SUM(b.z)
FROM t1 a
JOIN t2 b ON ...3. Document Complex Derivations
/*---
name: customer_score
meta:
column_notes:
risk_score: |
Calculated using formula:
- Base: customer_age_days
- Modified by: order_count, return_rate
- See: docs/risk-scoring.md
---*/Next Steps
- Table Lineage - Deep dive into table dependencies
- Column Lineage - Column-level tracking details
- Dependencies - How dependencies are detected