CsvIO (Apache Beam 2.62.0)

java.lang.Object
- org.apache.beam.sdk.io.csv.CsvIO

```
public class CsvIO
extends java.lang.Object
```
PTransforms for reading and writing CSV files.
Reading CSV files

Reading from CSV files is not yet implemented. Please see https://github.com/apache/beam/issues/24552.
Valid CSVFormat Configuration

A CSVFormat must meet the following conditions to be considered valid when reading CSV:
- String[] header - must contain at least one column name, and all column names must be non-empty.
- boolean allowDuplicateHeaderNames - must be false.
- boolean allowMissingColumnNames - must be false.
- boolean ignoreHeaderCase - must be false.
- boolean skipHeaderRecord - must be false. The header is already accounted for during parsing.
Ignored CSVFormat parameters

The following CSVFormat parameters are either not relevant for parsing CSV or are validated satisfactorily by the Apache Commons CSV library.
- boolean autoFlush
- char commentMarker
- char delimiter
- char escape
- char quote
- org.apache.commons.csv.QuoteMode quoteMode
- String nullString
- char recordSeparator
- systemRecordSeparator
- firstRecordAsHeader
- java.lang.Object... headerComments
- boolean ignoreEmptyLines
- boolean ignoreSurroundingSpaces
- boolean trim
- boolean skipHeaderRecord
- boolean trailingDelimiter
Writing CSV files

To write a PCollection to one or more CSV files, use CsvIO.Write, using writeRows(java.lang.String, org.apache.commons.csv.CSVFormat) or write(java.lang.String, org.apache.commons.csv.CSVFormat). CsvIO.Write supports writing Row or custom Java types using an inferred Schema. Examples below show both scenarios. See the Beam Programming Guide on inferring schemas for more information on how to enable Beam to infer a Schema from a custom Java type.
CsvIO.Write only supports writing the parts of Schema aware types that do not contain any nested Schema.FieldTypes such a Schema.TypeName.ROW or repeated Schema.TypeName.ARRAY types. See VALID_FIELD_TYPE_SET for valid Schema.FieldTypes.
Example usage:

Suppose we have the following Transaction class annotated with @DefaultSchema(JavaBeanSchema.class) so that Beam can infer its Schema:
```
 @DefaultSchema(JavaBeanSchema.class)
 public class Transaction {
   public Transaction() { … }
   public Long getTransactionId();
   public void setTransactionId(Long transactionId) { … }
   public String getBank() { … }
   public void setBank(String bank) { … }
   public double getPurchaseAmount() { … }
   public void setPurchaseAmount(double purchaseAmount) { … }
 }
 
```
From a PCollection<Transaction>, CsvIO.Write can write one or many CSV files automatically creating the header based on its inferred Schema.
```
 PCollection<Transaction> transactions = ...
 transactions.apply(CsvIO.<Transaction>write("path/to/folder/prefix", CSVFormat.DEFAULT));
 
```
The resulting CSV files will look like the following where the header is repeated for every file, whereas by default, CsvIO.Write will write all fields in sorted order of the field names.
```
 bank,purchaseAmount,transactionId
 A,10.23,12345
 B,54.65,54321
 C,11.76,98765
 
```
To control the order and subset of fields that CsvIO.Write writes, use CSVFormat.withHeader(java.lang.Class<? extends java.lang.Enum<?>>). Note, however, the following constraints:
1. Each header column must match a field name in the Schema; matching is case sensitive.
2. Matching header columns must match Schema fields that are valid Schema.FieldTypes; see VALID_FIELD_TYPE_SET.
3. CSVFormat only allows repeated header columns when CSVFormat.withAllowDuplicateHeaderNames()
The following example shows the use of CSVFormat.withHeader(java.lang.Class<? extends java.lang.Enum<?>>) to control the order and subset of Transaction fields.
```
 PCollection<Transaction> transactions ...
 transactions.apply(
  CsvIO
    .<Transaction>write("path/to/folder/prefix", CSVFormat.DEFAULT.withHeader("transactionId", "purchaseAmount"))
 );
 
```
The resulting CSV files will look like the following where the header is repeated for every file, but will only include the subset of fields in their listed order.
```
 transactionId,purchaseAmount
 12345,10.23
 54321,54.65
 98765,11.76
 
```
In addition to header customization, CsvIO.Write supports CSVFormat.withHeaderComments(java.lang.Object...) as shown below. Note that CSVFormat.withCommentMarker(char) is required when specifying header comments.
```
 PCollection<Transaction> transactions = ...
 transactions
    .apply(
        CsvIO.<Transaction>write("path/to/folder/prefix",
        CSVFormat.DEFAULT
          .withCommentMarker('#')
          .withHeaderComments("Bank Report", "1970-01-01", "Operator: John Doe")
    );
 
```
The resulting CSV files will look like the following where the header and header comments are repeated for every shard file.
```
 # Bank Report
 # 1970-01-01
 # Operator: John Doe
 bank,purchaseAmount,transactionId
 A,10.23,12345
 B,54.65,54321
 C,11.76,98765
 
```
A PCollection of Rows works just like custom Java types illustrated above, except we use writeRows(java.lang.String, org.apache.commons.csv.CSVFormat) as shown below for the same Transaction class. We derive Transaction's Schema using a DefaultSchema.DefaultSchemaProvider. Note that hard-coding the Rows below is for illustration purposes. Developers are instead encouraged to take advantage of DefaultSchema.DefaultSchemaProvider.toRowFunction(org.apache.beam.sdk.values.TypeDescriptor<T>).
```
 DefaultSchemaProvider defaultSchemaProvider = new DefaultSchemaProvider();
 Schema schema = defaultSchemaProvider.schemaFor(TypeDescriptor.of(Transaction.class));
 PCollection<Row> transactions = pipeline.apply(Create.of(
  Row
    .withSchema(schema)
    .withFieldValue("bank", "A")
    .withFieldValue("purchaseAmount", 10.23)
    .withFieldValue("transactionId", "12345")
    .build(),
  Row
    .withSchema(schema)
    .withFieldValue("bank", "B")
    .withFieldValue("purchaseAmount", 54.65)
    .withFieldValue("transactionId", "54321")
    .build(),
  Row
    .withSchema(schema)
    .withFieldValue("bank", "C")
    .withFieldValue("purchaseAmount", 11.76)
    .withFieldValue("transactionId", "98765")
    .build()
 );

 transactions.apply(
  CsvIO
    .writeRowsTo("gs://bucket/path/to/folder/prefix", CSVFormat.DEFAULT)
 );
 
```
Writing the transactions PCollection of Rows would yield the following CSV file content.
```
 bank,purchaseAmount,transactionId
 A,10.23,12345
 B,54.65,54321
 C,11.76,98765
 
```
CsvIO.Write does not support the following CSVFormat properties and will throw an IllegalArgumentException.

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

static class CsvIO.Write<T>
PTransform for writing CSV files.

Nested Classes
Modifier and Type	Class and Description
`static class`	`CsvIO.Write<T>` `PTransform` for writing CSV files.

Field Summary

Fields
Modifier and Type Field and Description

static java.util.Set<Schema.FieldType> VALID_FIELD_TYPE_SET
The valid Schema.FieldType from which CsvIO converts CSV records to the fields.

Fields
Modifier and Type	Field and Description
`static java.util.Set<Schema.FieldType>`	`VALID_FIELD_TYPE_SET` The valid `Schema.FieldType` from which `CsvIO` converts CSV records to the fields.

Constructor Summary

Constructors
Constructor and Description

CsvIO()

Constructors
Constructor and Description
`CsvIO()`

Method Summary

All Methods Static Methods Concrete Methods
Modifier and Type	Method and Description
`static <T> CsvIOParse<T>`	`parse(java.lang.Class<T> klass, CSVFormat csvFormat)` Instantiates a `CsvIOParse` for parsing CSV string records into custom `Schema`-mapped `Class<T>`es from the records' assumed CsvFormat.
`static CsvIOParse<Row>`	`parseRows(Schema schema, CSVFormat csvFormat)` Instantiates a `CsvIOParse` for parsing CSV string records into `Row`s from the records' assumed CsvFormat and expected `Schema`.
`static <T> CsvIO.Write<T>`	`write(java.lang.String to, CSVFormat csvFormat)` Instantiates a `CsvIO.Write` for writing user types in `CSVFormat` format.
`static CsvIO.Write<Row>`	`writeRows(java.lang.String to, CSVFormat csvFormat)` Instantiates a `CsvIO.Write` for writing `Row`s in `CSVFormat` format.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail
- VALID_FIELD_TYPE_SET
```
public static final java.util.Set<Schema.FieldType> VALID_FIELD_TYPE_SET
```
  The valid Schema.FieldType from which CsvIO converts CSV records to the fields.

Constructor Detail
- CsvIO
```
public CsvIO()
```

Method Detail

write

public static <T> CsvIO.Write<T> write(java.lang.String to,
                                       CSVFormat csvFormat)

Instantiates a CsvIO.Write for writing user types in CSVFormat format.

writeRows

public static CsvIO.Write<Row> writeRows(java.lang.String to,
                                         CSVFormat csvFormat)

Instantiates a CsvIO.Write for writing Rows in CSVFormat format.

parse
```
public static <T> CsvIOParse<T> parse(java.lang.Class<T> klass,
                                      CSVFormat csvFormat)
```
Instantiates a CsvIOParse for parsing CSV string records into custom Schema-mapped Class<T>es from the records' assumed CsvFormat. See the Beam Programming Guide on how to configure your custom Class<T> for Beam to infer its Schema using a SchemaProvider annotation such as AutoValueSchema or JavaBeanSchema.
Example usage
The example below illustrates parsing CsvFormat#DEFAULT formatted CSV string records, read from TextIO.Read, into an AutoValueSchema annotated AutoValue data class PCollection.
```
{@code
 // SomeDataClass is a data class configured for Beam to automatically infer its Schema.
```

parseRows

public static CsvIOParse<Row> parseRows(Schema schema,
                                        CSVFormat csvFormat)

Instantiates a CsvIOParse for parsing CSV string records into Rows from the records' assumed CsvFormat and expected Schema.

Example usage

The example below illustrates parsing CsvFormat#DEFAULT formatted CSV string records, read from TextIO.Read, into a Row PCollection.


 // Define the expected Schema.
 Schema schema = Schema.of(
  Schema.Field.of("someString", FieldType.STRING),
  Schema.Field.of("someInteger", FieldType.INT32)
 );

 // Pipeline example reads CSV string records from Google Cloud storage and writes to BigQuery.
 Pipeline pipeline = Pipeline.create();

 // Read CSV records from Google Cloud storage using TextIO.
 PCollection<String> csvRecords = pipeline
  .apply(TextIO.read().from("gs://bucket/folder/*.csv");

 // Apply the CSV records PCollection<String> to the CsvIOParse transform instantiated using CsvIO.parseRows.
 CsvIOParseResult<Row> result = csvRecords.apply(CsvIO.parseRow(
      schema,
      CsvFormat.DEFAULT.withHeader("someString", "someInteger")
 ));

 // Acquire any processing errors to either write to logs or apply to a downstream dead letter queue such as BigQuery.
 result.getErrors().apply(BigQueryIO.<CsvIOParseError>write()
  .to("project:dataset.table_of_errors")
  .useBeamSchema()
  .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
  .withWriteDisposition(WriteDisposition.WRITE_APPEND));

 // Acquire the successful PCollection<Row> output.
 PCollection<Row> output = result.getOutput();

 // Do something with the output such as write to BigQuery.
 output.apply(BigQueryIO.<Row>write()
  .to("project:dataset.table_of_output")
  .useBeamSchema()
  .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
  .withWriteDisposition(WriteDisposition.WRITE_APPEND));

Class CsvIO

Reading CSV files

Valid CSVFormat Configuration

Ignored CSVFormat parameters

Writing CSV files

Example usage:

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

VALID_FIELD_TYPE_SET

Constructor Detail

CsvIO

Method Detail

write

writeRows

parse

Example usage

parseRows

Example usage