apache_beam.ml.rag.types module

Core types for RAG pipelines.

This module contains the core dataclasses used throughout the RAG pipeline implementation. The primary type is EmbeddableItem, which represents any content that can be embedded and stored in a vector database.

Types:
  • Content: Container for embeddable content

  • Embedding: Vector embedding with optional metadata

  • EmbeddableItem: Universal container for embeddable content

  • Chunk: Alias for EmbeddableItem (backward compatibility)

class apache_beam.ml.rag.types.Content(text: str | None = None)[source]

Bases: object

Container for embeddable content.

Parameters:

text – Text content to be embedded.

text: str | None = None
class apache_beam.ml.rag.types.Embedding(dense_embedding: List[float] | None = None, sparse_embedding: Tuple[List[int], List[float]] | None = None)[source]

Bases: object

Represents vector embeddings with optional metadata.

Parameters:
  • dense_embedding – Dense vector representation.

  • sparse_embedding – Optional sparse vector representation for hybrid search.

dense_embedding: List[float] | None = None
sparse_embedding: Tuple[List[int], List[float]] | None = None
class apache_beam.ml.rag.types.EmbeddableItem(content: ~apache_beam.ml.rag.types.Content, id: str = <factory>, index: int = 0, metadata: ~typing.Dict[str, ~typing.Any] = <factory>, embedding: ~apache_beam.ml.rag.types.Embedding | None = None)[source]

Bases: object

Universal container for embeddable content.

Represents any content that can be embedded and stored in a vector database. Use factory methods for convenient construction, or construct directly with a Content object.

Examples

Text (via factory):
item = EmbeddableItem.from_text(

“hello world”, metadata={‘src’: ‘doc’})

Text (direct, equivalent to old Chunk usage):

item = EmbeddableItem(content=Content(text=”hello”), index=3)

Parameters:
  • content – The content to embed.

  • id – Unique identifier.

  • index – Position within source document (for chunking use cases).

  • metadata – Additional metadata (e.g., document source, language).

  • embedding – Embedding populated by the embedding step.

content: Content
id: str
index: int = 0
metadata: Dict[str, Any]
embedding: Embedding | None = None
classmethod from_text(text: str, *, id: str | None = None, index: int = 0, metadata: Dict[str, Any] | None = None) EmbeddableItem[source]

Create an EmbeddableItem with text content.

Parameters:
  • text – The text content to embed

  • id – Unique identifier (auto-generated if not provided)

  • index – Position within source document (for chunking)

  • metadata – Additional metadata

property dense_embedding: List[float] | None
property sparse_embedding: Tuple[List[int], List[float]] | None
apache_beam.ml.rag.types.Chunk

alias of EmbeddableItem