apache_beam.ml.rag.types module
Core types for RAG pipelines.
This module contains the core dataclasses used throughout the RAG pipeline implementation. The primary type is EmbeddableItem, which represents any content that can be embedded and stored in a vector database.
- Types:
Content: Container for embeddable content
Embedding: Vector embedding with optional metadata
EmbeddableItem: Universal container for embeddable content
Chunk: Alias for EmbeddableItem (backward compatibility)
- class apache_beam.ml.rag.types.Content(text: str | None = None)[source]
Bases:
objectContainer for embeddable content.
- Parameters:
text – Text content to be embedded.
- class apache_beam.ml.rag.types.Embedding(dense_embedding: List[float] | None = None, sparse_embedding: Tuple[List[int], List[float]] | None = None)[source]
Bases:
objectRepresents vector embeddings with optional metadata.
- Parameters:
dense_embedding – Dense vector representation.
sparse_embedding – Optional sparse vector representation for hybrid search.
- class apache_beam.ml.rag.types.EmbeddableItem(content: ~apache_beam.ml.rag.types.Content, id: str = <factory>, index: int = 0, metadata: ~typing.Dict[str, ~typing.Any] = <factory>, embedding: ~apache_beam.ml.rag.types.Embedding | None = None)[source]
Bases:
objectUniversal container for embeddable content.
Represents any content that can be embedded and stored in a vector database. Use factory methods for convenient construction, or construct directly with a Content object.
Examples
- Text (via factory):
- item = EmbeddableItem.from_text(
“hello world”, metadata={‘src’: ‘doc’})
- Text (direct, equivalent to old Chunk usage):
item = EmbeddableItem(content=Content(text=”hello”), index=3)
- Parameters:
content – The content to embed.
id – Unique identifier.
index – Position within source document (for chunking use cases).
metadata – Additional metadata (e.g., document source, language).
embedding – Embedding populated by the embedding step.
- classmethod from_text(text: str, *, id: str | None = None, index: int = 0, metadata: Dict[str, Any] | None = None) EmbeddableItem[source]
Create an EmbeddableItem with text content.
- Parameters:
text – The text content to embed
id – Unique identifier (auto-generated if not provided)
index – Position within source document (for chunking)
metadata – Additional metadata
- apache_beam.ml.rag.types.Chunk
alias of
EmbeddableItem