Apache Parquet I/O connector
The Beam SDKs include built-in transforms that can read data from and write data to Apache Parquet files.
Before you start
To use ParquetIO, add the Maven artifact dependency to your
ParquetIO comes preinstalled with the Apache Beam python sdk..2.41.0
Using ParquetIO with Spark before 2.4
ParquetIO depends on an API introduced in Apache Parquet 1.10.0. Spark 2.4.x is compatible and no additional steps are necessary. Older versions of Spark will not work out of the box since a pre-installed version of Parquet libraries will take precedence during execution. The following workaround should be applied.
Note: The following technique allows you to execute your pipeline with
ParquetIOcorrectly. The Parquet files that are consumed or generated by this Beam connector should remain interoperable with the other tools on your cluster.
Include the Parquet artifact normally and ensure that it brings in the correct version of Parquet as a transitive dependency.
Relocate the following packages:
<plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <configuration> <createDependencyReducedPom>false</createDependencyReducedPom> <filters> <filter> <artifact>*:*</artifact> <excludes> <exclude>META-INF/*.SF</exclude> <exclude>META-INF/*.DSA</exclude> <exclude>META-INF/*.RSA</exclude> </excludes> </filter> </filters> </configuration> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> <configuration> <shadedArtifactAttached>true</shadedArtifactAttached> <shadedClassifierName>shaded</shadedClassifierName> <relocations> <relocation> <pattern>org.apache.parquet</pattern> <shadedPattern>shaded.org.apache.parquet</shadedPattern> </relocation> <!-- Some packages are shaded already, and on the original spark classpath. Shade them more. --> <relocation> <pattern>shaded.parquet</pattern> <shadedPattern>reshaded.parquet</shadedPattern> </relocation> <relocation> <pattern>org.apache.avro</pattern> <shadedPattern>shaded.org.apache.avro</shadedPattern> </relocation> </relocations> <transformers> <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/> </transformers> </configuration> </execution> </executions> </plugin>
This technique has been tested to work on Spark 2.2.3, Spark 2.3.3 and Spark 2.4.3 (although it is optional for Spark 2.4+).
Last updated on 2020/08/14
Have you found everything you were looking for?
Was it all useful and clear? Is there anything that you would like to change? Let us know!