Skip to main content

Extracting Lineage from Kafka

Introduction

Kafka is the nervous system of the modern data stack. Whether you stream events between microservices, fan data out to multiple consumers, or bridge operational systems with analytics platforms, Kafka is often the real-time backbone of your entire organization.

Kafka decouples producers from consumers and routes data through topics rather than direct connections. Tracing exactly which fields flow from a producer through a topic and into a downstream consumer requires analysis that goes beyond runtime observation. When a schema change in a Kafka producer silently breaks a downstream consumer, traditional monitoring tools only surface the problem after data has already been corrupted.

Foundational connects to your existing repositories and reads your Kafka producer and consumer code directly. It builds field-level lineage from the code itself and does not require a running cluster or changes to the application.


Why this framework matters in your data stack

Kafka topics can sit anywhere in your data stack: between upstream microservices and data warehouses, between event producers and stream processing jobs, or between operational systems and data lakes. Every topic is a potential point of failure when schemas evolve.

When an upstream change occurs such as a renamed field in a Kafka message schema or a removed key in a producer, it can silently break downstream consumers, stream processors, and the warehouse tables they populate. Data teams need a way to predict downstream impact before they deploy.

Foundational CI, powered by comprehensive end-to-end lineage, thoroughly checks every pending pull request to ensure it doesn't disrupt downstream transformations, dashboards, ML models, and more.


How Foundational analyzes this framework

Foundational's code engine scans and extracts lineage straight from Kafka producer and consumer source code. Data teams gain full visibility into Kafka-based data flows, including custom serialization logic and schema definitions that runtime tools cannot observe before deployment.

This shift-left approach lets teams review data flow changes in pending pull requests so that changes impacting your data stack do not lead to data incidents, pipeline disruptions, or data quality issues.

Foundational analyzes each pattern to extract field-level lineage:

  • Kafka producers: Code that writes messages to Kafka topics, extracting the fields and schema being produced.

  • Kafka consumers: Code that reads messages from Kafka topics, determining which fields are consumed and where they flow downstream.

  • Schema Registry definitions: Avro, Protobuf, and JSON Schema definitions registered in a Schema Registry are used to resolve field-level lineage across topics.

  • Kafka streams: Stream processing topologies defined in code, traced to map how fields are transformed, filtered, and routed between topics.

  • Other Kafka-based frameworks: Kafka Connect connectors and custom consumer implementations in any supported language.

Multi-step extraction process


Foundational uses a multi-step process to track data:

Identify producers, consumers, and schema definitions

Foundational scans accessible repositories to locate Kafka producer and consumer code and schema definition files. It uses heuristics to detect Kafka client usage patterns, for example calls to KafkaProducer.send() and KafkaConsumer.poll(), and Avro or Protobuf schema files. A schema for an order event looks like this:

{
"type": "record",
"name": "OrderPlaced",
"namespace": "com.example.events",
"fields": [
{"name": "order_id", "type": "string"},
{"name": "customer_id", "type": "string"},
{"name": "total_amount", "type": "double"},
{"name": "status", "type": "string"}
]
}

Resolve topic-to-schema mappings

Foundational links each Kafka topic to its schema definition. It resolves Schema Registry references, inline schema declarations, and schema objects instantiated in producer code. This mapping allows field-level lineage to trace across topic boundaries rather than stopping at the topic name.

Trace field flow through consumers and processors

The engine follows data from the producer through the topic and into each consumer. It analyzes how individual fields are accessed, filtered, transformed, or forwarded, including in Kafka Streams topologies where data is re-keyed, joined, or aggregated across multiple topics.

Construct and link the graph

A single Kafka service rarely contains the full picture of end-to-end data flow. Foundational merges code analysis results with schema definitions and metadata from connected data systems. You see the exact fields flowing from producers, through topics, and into downstream warehouse tables, data lakes, and applications.


Advantages of Foundational's approach

Foundational gives data teams:

  • Early visibility: Shows how Kafka schema changes impact data flows during development, integrated into your source control e.g., GitHub, GitLab, Azure Repos, and Bitbucket.
    ​

  • Shift-left impact analysis: Detects breaking schema changes in open pull requests before they reach production, so downstream consumers can prepare in advance.
    ​

  • Cross-system field-level tracking: Connects Kafka topics to their upstream producers and downstream consumers, including stream processors, warehouse loaders, and data lake writers, in a single lineage graph.
    ​

  • Reduced breakages: Prevents downstream consumers, dashboards, and ML pipelines from breaking due to upstream schema changes.


Set up Kafka lineage in Foundational

Setup is seamless.

  1. Connect the repositories that contain your Kafka producer and consumer code, as well as any schema definition files. There is no need to manually annotate code, add instrumentation, or connect to a running Kafka cluster.
    ​

  2. From there, the code engine automatically identifies Kafka usage patterns, resolves topic-to-schema mappings, extracts field-level lineage, detects changes in pull requests, and evaluates downstream impact.


Useful links

Did this answer your question?