Built-in I/O Transforms

Apache Parquet I/O connector

The Beam SDKs include built-in transforms that can read data from and write data to Apache Parquet files.

Before you start

To use ParquetIO, add the Maven artifact dependency to your pom.xml file.

<dependency>
    <groupId>org.apache.beam</groupId>
    <artifactId>beam-sdks-java-io-parquet</artifactId>
    <version>2.60.0</version>
</dependency>

Additional resources:

ParquetIO comes preinstalled with the Apache Beam python sdk..2.60.0

Additional resources:

Using ParquetIO with Spark before 2.4

ParquetIO depends on an API introduced in Apache Parquet 1.10.0. Spark 2.4.x is compatible and no additional steps are necessary. Older versions of Spark will not work out of the box since a pre-installed version of Parquet libraries will take precedence during execution. The following workaround should be applied.

Note: The following technique allows you to execute your pipeline with ParquetIO correctly. The Parquet files that are consumed or generated by this Beam connector should remain interoperable with the other tools on your cluster.

Include the Parquet artifact normally and ensure that it brings in the correct version of Parquet as a transitive dependency.

<dependency>
    <groupId>org.apache.beam</groupId>
    <artifactId>beam-sdks-java-io-parquet</artifactId>
    <version>${beam.version}</version>
</dependency>

Relocate the following packages:

<plugin>
  <groupId>org.apache.maven.plugins</groupId>
  <artifactId>maven-shade-plugin</artifactId>
  <configuration>
    <createDependencyReducedPom>false</createDependencyReducedPom>
    <filters>
      <filter>
        <artifact>*:*</artifact>
        <excludes>
          <exclude>META-INF/*.SF</exclude>
          <exclude>META-INF/*.DSA</exclude>
          <exclude>META-INF/*.RSA</exclude>
        </excludes>
      </filter>
    </filters>
  </configuration>
  <executions>
    <execution>
      <phase>package</phase>
      <goals>
        <goal>shade</goal>
      </goals>
      <configuration>
        <shadedArtifactAttached>true</shadedArtifactAttached>
        <shadedClassifierName>shaded</shadedClassifierName>
        <relocations>
          <relocation>
            <pattern>org.apache.parquet</pattern>
            <shadedPattern>shaded.org.apache.parquet</shadedPattern>
          </relocation>
          <!-- Some packages are shaded already, and on the original spark classpath. Shade them more. -->
          <relocation>
            <pattern>shaded.parquet</pattern>
            <shadedPattern>reshaded.parquet</shadedPattern>
          </relocation>
          <relocation>
            <pattern>org.apache.avro</pattern>
            <shadedPattern>shaded.org.apache.avro</shadedPattern>
          </relocation>
        </relocations>
        <transformers>
          <transformer
            implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
        </transformers>
      </configuration>
    </execution>
  </executions>
</plugin>

This technique has been tested to work on Spark 2.2.3, Spark 2.3.3 and Spark 2.4.3 (although it is optional for Spark 2.4+).