apache_beam.ml.rag.chunking.langchain module
- class apache_beam.ml.rag.chunking.langchain.LangChainChunker(text_splitter: None, document_field: str, metadata_fields: List[str], chunk_id_fn: Callable[[Chunk], str] | None = None)[source]
Bases:
ChunkingTransformProvider
A ChunkingTransformProvider that uses LangChain text splitters.
This provider integrates LangChain’s text splitting capabilities into Beam’s MLTransform framework. It supports various text splitting strategies through LangChain’s TextSplitter interface, including recursive character splitting and other methods.
The provider: - Takes documents with text content and metadata - Splits text using configured LangChain splitter - Preserves document metadata in resulting chunks - Assigns unique IDs to chunks (configurable via chunk_id_fn)
- Example usage:
```python from langchain.text_splitter import RecursiveCharacterTextSplitter
- splitter = RecursiveCharacterTextSplitter(
chunk_size=100, chunk_overlap=20
)
chunker = LangChainChunker(text_splitter=splitter)
- with beam.Pipeline() as p:
- chunks = (
p | beam.Create([{‘text’: ‘long document…’, ‘source’: ‘doc.txt’}]) | MLTransform(…).with_transform(chunker))
- Parameters:
text_splitter – A LangChain TextSplitter instance that defines how documents are split into chunks.
metadata_fields – List of field names to copy from input documents to chunk metadata. These fields will be preserved in each chunk created from the document.
chunk_id_fn – Optional function that take a Chunk and return str to generate chunk IDs. If not provided, random UUIDs will be used.
- get_splitter_transform() PTransform[PCollection[Dict[str, Any]], PCollection[Chunk]] [source]