apache_beam.io.vcfio module

A source for reading from VCF files (version 4.x).

The 4.2 spec is available at https://samtools.github.io/hts-specs/VCFv4.2.pdf.

class apache_beam.io.vcfio.VariantInfo(data, field_count)

Bases: tuple

Create new instance of VariantInfo(data, field_count)


Alias for field number 0


Alias for field number 1

class apache_beam.io.vcfio.MalformedVcfRecord(file_name, line)

Bases: tuple

Create new instance of MalformedVcfRecord(file_name, line)


Alias for field number 0


Alias for field number 1

class apache_beam.io.vcfio.Variant(reference_name=None, start=None, end=None, reference_bases=None, alternate_bases=None, names=None, quality=None, filters=None, info=None, calls=None)[source]

Bases: object

A class to store info about a genomic variant.

Each object corresponds to a single record in a VCF file.

Initialize the Variant object.

  • reference_name (str) – The reference on which this variant occurs (such as chr20 or X). .
  • start (int) – The position at which this variant occurs (0-based). Corresponds to the first base of the string of reference bases.
  • end (int) – The end position (0-based) of this variant. Corresponds to the first base after the last base in the reference allele.
  • reference_bases (str) – The reference bases for this variant.
  • alternate_bases (List[str]) – The bases that appear instead of the reference bases.
  • names (List[str]) – Names for the variant, for example a RefSNP ID.
  • quality (float) – Phred-scaled quality score (-10log10 prob(call is wrong)) Higher values imply better quality.
  • filters (List[str]) – A list of filters (normally quality filters) this variant has failed. PASS indicates this variant has passed all filters.
  • info (dict) – A map of additional variant information. The key is specified in the VCF record and the value is of type VariantInfo.
  • calls (list of VariantCall) – The variant calls for this variant. Each one represents the determination of genotype with respect to this variant.
class apache_beam.io.vcfio.VariantCall(name=None, genotype=None, phaseset=None, info=None)[source]

Bases: object

A class to store info about a variant call.

A call represents the determination of genotype with respect to a particular variant. It may include associated information such as quality and phasing.

Initialize the VariantCall object.

  • name (str) – The name of the call.
  • genotype (List[int]) – The genotype of this variant call as specified by the VCF schema. The values are either 0 representing the reference, or a 1-based index into alternate bases. Ordering is only important if phaseset is present. If a genotype is not called (that is, a . is present in the GT string), -1 is used
  • phaseset (str) – If this field is present, this variant call’s genotype ordering implies the phase of the bases and is consistent with any other variant calls in the same reference sequence which have the same phaseset value. If the genotype data was phased but no phase set was specified, this field will be set to *.
  • info (dict) – A map of additional variant call information. The key is specified in the VCF record and the type of the value is specified by the VCF header FORMAT.
class apache_beam.io.vcfio.ReadFromVcf(file_pattern=None, compression_type='auto', validate=True, allow_malformed_records=False, **kwargs)[source]

Bases: apache_beam.transforms.ptransform.PTransform

A PTransform for reading VCF files.

Parses VCF files (version 4) using PyVCF library. If file_pattern specifies multiple files, then the header from each file is used separately to parse the content. However, the output will be a PCollection of Variant (or MalformedVcfRecord for failed reads) objects.

Initialize the ReadFromVcf transform.

  • file_pattern (str) – The file path to read from either as a single file or a glob pattern.
  • compression_type (str) – Used to handle compressed input files. Typical value is CompressionTypes.AUTO, in which case the underlying file_path’s extension will be used to detect the compression.
  • validate (bool) – flag to verify that the files exist during the pipeline creation time.
  • allow_malformed_records (bool) – determines if failed VCF record reads will be tolerated. Failed record reads will result in a MalformedVcfRecord being returned from the read of the record rather than a Variant.