Apache Parquet I/O connector
- Java SDK
- Python SDK
The Beam SDKs include built-in transforms that can read data from and write data to Apache Parquet files.
Before you start
To use ParquetIO, add the Maven artifact dependency to your pom.xml
file.
Additional resources:
ParquetIO comes preinstalled with the Apache Beam python sdk..2.60.0
Additional resources:
Using ParquetIO with Spark before 2.4
ParquetIO
depends on an API introduced in Apache Parquet 1.10.0. Spark 2.4.x is compatible and no additional steps are necessary. Older versions of Spark will not work out of the box since a pre-installed version of Parquet libraries will take precedence during execution. The following workaround should be applied.
Note: The following technique allows you to execute your pipeline with
ParquetIO
correctly. The Parquet files that are consumed or generated by this Beam connector should remain interoperable with the other tools on your cluster.
Include the Parquet artifact normally and ensure that it brings in the correct version of Parquet as a transitive dependency.
Relocate the following packages:
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<configuration>
<createDependencyReducedPom>false</createDependencyReducedPom>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<shadedArtifactAttached>true</shadedArtifactAttached>
<shadedClassifierName>shaded</shadedClassifierName>
<relocations>
<relocation>
<pattern>org.apache.parquet</pattern>
<shadedPattern>shaded.org.apache.parquet</shadedPattern>
</relocation>
<!-- Some packages are shaded already, and on the original spark classpath. Shade them more. -->
<relocation>
<pattern>shaded.parquet</pattern>
<shadedPattern>reshaded.parquet</shadedPattern>
</relocation>
<relocation>
<pattern>org.apache.avro</pattern>
<shadedPattern>shaded.org.apache.avro</shadedPattern>
</relocation>
</relocations>
<transformers>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
This technique has been tested to work on Spark 2.2.3, Spark 2.3.3 and Spark 2.4.3 (although it is optional for Spark 2.4+).
Last updated on 2024/11/20
Have you found everything you were looking for?
Was it all useful and clear? Is there anything that you would like to change? Let us know!