apache_beam.ml.rag.types module

Core types for RAG pipelines. This module contains the core dataclasses used throughout the RAG pipeline implementation, including Chunk and Embedding types that define the data contracts between different stages of the pipeline.

class apache_beam.ml.rag.types.Content(text: str | None = None)[source]

Bases: object

Container for embeddable content. Add new types as when as necessary.

Parameters:

text – Text content to be embedded

text: str | None = None
class apache_beam.ml.rag.types.Embedding(dense_embedding: List[float] | None = None, sparse_embedding: Tuple[List[int], List[float]] | None = None)[source]

Bases: object

Represents vector embeddings.

Parameters:
  • dense_embedding – Dense vector representation

  • sparse_embedding – Optional sparse vector representation for hybrid search

dense_embedding: List[float] | None = None
sparse_embedding: Tuple[List[int], List[float]] | None = None
class apache_beam.ml.rag.types.Chunk(content: ~apache_beam.ml.rag.types.Content, id: str = <factory>, index: int = 0, metadata: ~typing.Dict[str, ~typing.Any] = <factory>, embedding: ~apache_beam.ml.rag.types.Embedding | None = None)[source]

Bases: object

Represents a chunk of embeddable content with metadata.

Parameters:
  • content – The actual content of the chunk

  • id – Unique identifier for the chunk

  • index – Index of this chunk within the original document

  • metadata – Additional metadata about the chunk (e.g., document source)

  • embedding – Vector embeddings of the content

content: Content
id: str
index: int = 0
metadata: Dict[str, Any]
embedding: Embedding | None = None