java.lang.Object
org.apache.beam.runners.dataflow.internal.IsmFormat

public class IsmFormat extends Object
An Ism file is a prefix encoded composite key value file broken into shards. Each composite key is composed of a fixed number of component keys. A fixed number of those sub keys represent the shard key portion; see IsmFormat.IsmRecord and IsmFormat.IsmRecordCoder for further details around the data format. In addition to the data, there is a bloom filter, and multiple indices to allow for efficient retrieval.

An Ism file is composed of these high level sections (in order):

  • shard block
  • bloom filter (See ScalableBloomFilter for details on encoding format)
  • shard index
  • footer (See IsmFormat.Footer for details on encoding format)

The shard block is composed of multiple copies of the following:

  • data block
  • data index

The data block is composed of multiple copies of the following:

  • key prefix (See IsmFormat.KeyPrefix for details on encoding format)
  • unshared key bytes
  • value bytes
  • optional 0x00 0x00 bytes followed by metadata bytes (if the following 0x00 0x00 bytes are not present, then there are no metadata bytes)
Each key written into the data block must be in unsigned lexicographically increasing order and also its shard portion of the key must hash to the same shard id as all other keys within the same data block. The hashing function used is the 32-bit murmur3 algorithm, x86 variant (little-endian variant), using 1225801234 as the seed value.

The data index is composed of N copies of the following:

  • key prefix (See IsmFormat.KeyPrefix for details on encoding format)
  • unshared key bytes
  • byte offset to key prefix in data block (variable length long coding)

The shard index is composed of a variable length integer encoding representing the number of shard index records followed by that many shard index records. See IsmFormat.IsmShardCoder for further details as to its encoding scheme.

  • Field Details

  • Constructor Details

    • IsmFormat

      public IsmFormat()
  • Method Details

    • validateCoderIsCompatible

      public static void validateCoderIsCompatible(IsmFormat.IsmRecordCoder<?> coder)
      Validates that the key portion of the given coder is deterministic.
    • isMetadataKey

      public static boolean isMetadataKey(List<?> keyComponents)
      Returns true if and only if any of the passed in key components represent a metadata key.
    • getMetadataKey

      public static Object getMetadataKey()
      An object representing a wild card for a key component. Encoded using IsmFormat.MetadataKeyCoder.