Regex

Pydoc Pydoc


Filters input string elements based on a regex. May also transform them based on the matching groups.

Examples

In the following examples, we create a pipeline with a PCollection of text strings. Then, we use the Regex transform to search, replace, and split through the text elements using regular expressions.

You can use tools to help you create and test your regular expressions, such as regex101. Make sure to specify the Python flavor at the left side bar.

Lets look at the regular expression (?P<icon>[^\s,]+), *(\w+), *(\w+) for example. It matches anything that is not a whitespace \s ([ \t\n\r\f\v]) or comma , until a comma is found and stores that in the named group icon, this can match even utf-8 strings. Then it matches any number of whitespaces, followed by at least one word character \w ([a-zA-Z0-9_]), which is stored in the second group for the name. It does the same with the third group for the duration.

Note: To avoid unexpected string escaping in your regular expressions, it is recommended to use raw strings such as r'raw-string' instead of 'escaped-string'.

Example 1: Regex match

Regex.matches keeps only the elements that match the regular expression, returning the matched group. The argument group is set to 0 (the entire match) by default, but can be set to a group number like 3, or to a named group like 'icon'.

Regex.matches starts to match the regular expression at the beginning of the string. To match until the end of the string, add '$' at the end of the regular expression.

To start matching at any point instead of the beginning of the string, use Regex.find(regex).

import apache_beam as beam

# Matches a named group 'icon', and then two comma-separated groups.
regex = r'(?P<icon>[^\s,]+), *(\w+), *(\w+)'
with beam.Pipeline() as pipeline:
  plants_matches = (
      pipeline
      | 'Garden plants' >> beam.Create([
          'πŸ“, Strawberry, perennial',
          'πŸ₯•, Carrot, biennial ignoring trailing words',
          'πŸ†, Eggplant, perennial',
          'πŸ…, Tomato, annual',
          'πŸ₯”, Potato, perennial',
          '# 🍌, invalid, format',
          'invalid, πŸ‰, format',
      ])
      | 'Parse plants' >> beam.Regex.matches(regex)
      | beam.Map(print)
  )

Output PCollection after Regex.matches:

plants_matches = [
    'πŸ“, Strawberry, perennial',
    'πŸ₯•, Carrot, biennial',
    'πŸ†, Eggplant, perennial',
    'πŸ…, Tomato, annual',
    'πŸ₯”, Potato, perennial',
]
View on GitHub View on GitHub


Example 2: Regex match with all groups

Regex.all_matches keeps only the elements that match the regular expression, returning all groups as a list. The groups are returned in the order encountered in the regular expression, including group 0 (the entire match) as the first group.

Regex.all_matches starts to match the regular expression at the beginning of the string. To match until the end of the string, add '$' at the end of the regular expression.

To start matching at any point instead of the beginning of the string, use Regex.find_all(regex, group=Regex.ALL, outputEmpty=False).

import apache_beam as beam

# Matches a named group 'icon', and then two comma-separated groups.
regex = r'(?P<icon>[^\s,]+), *(\w+), *(\w+)'
with beam.Pipeline() as pipeline:
  plants_all_matches = (
      pipeline
      | 'Garden plants' >> beam.Create([
          'πŸ“, Strawberry, perennial',
          'πŸ₯•, Carrot, biennial ignoring trailing words',
          'πŸ†, Eggplant, perennial',
          'πŸ…, Tomato, annual',
          'πŸ₯”, Potato, perennial',
          '# 🍌, invalid, format',
          'invalid, πŸ‰, format',
      ])
      | 'Parse plants' >> beam.Regex.all_matches(regex)
      | beam.Map(print)
  )

Output PCollection after Regex.all_matches:

plants_all_matches = [
    ['πŸ“, Strawberry, perennial', 'πŸ“', 'Strawberry', 'perennial'],
    ['πŸ₯•, Carrot, biennial', 'πŸ₯•', 'Carrot', 'biennial'],
    ['πŸ†, Eggplant, perennial', 'πŸ†', 'Eggplant', 'perennial'],
    ['πŸ…, Tomato, annual', 'πŸ…', 'Tomato', 'annual'],
    ['πŸ₯”, Potato, perennial', 'πŸ₯”', 'Potato', 'perennial'],
]
View on GitHub View on GitHub


Example 3: Regex match into key-value pairs

Regex.matches_kv keeps only the elements that match the regular expression, returning a key-value pair using the specified groups. The argument keyGroup is set to a group number like 3, or to a named group like 'icon'. The argument valueGroup is set to 0 (the entire match) by default, but can be set to a group number like 3, or to a named group like 'icon'.

Regex.matches_kv starts to match the regular expression at the beginning of the string. To match until the end of the string, add '$' at the end of the regular expression.

To start matching at any point instead of the beginning of the string, use Regex.find_kv(regex, keyGroup).

import apache_beam as beam

# Matches a named group 'icon', and then two comma-separated groups.
regex = r'(?P<icon>[^\s,]+), *(\w+), *(\w+)'
with beam.Pipeline() as pipeline:
  plants_matches_kv = (
      pipeline
      | 'Garden plants' >> beam.Create([
          'πŸ“, Strawberry, perennial',
          'πŸ₯•, Carrot, biennial ignoring trailing words',
          'πŸ†, Eggplant, perennial',
          'πŸ…, Tomato, annual',
          'πŸ₯”, Potato, perennial',
          '# 🍌, invalid, format',
          'invalid, πŸ‰, format',
      ])
      | 'Parse plants' >> beam.Regex.matches_kv(regex, keyGroup='icon')
      | beam.Map(print)
  )

Output PCollection after Regex.matches_kv:

plants_matches_kv = [
    ('πŸ“', 'πŸ“, Strawberry, perennial'),
    ('πŸ₯•', 'πŸ₯•, Carrot, biennial'),
    ('πŸ†', 'πŸ†, Eggplant, perennial'),
    ('πŸ…', 'πŸ…, Tomato, annual'),
    ('πŸ₯”', 'πŸ₯”, Potato, perennial'),
]
View on GitHub View on GitHub


Example 4: Regex find

Regex.find keeps only the elements that match the regular expression, returning the matched group. The argument group is set to 0 (the entire match) by default, but can be set to a group number like 3, or to a named group like 'icon'.

Regex.find matches the first occurrence of the regular expression in the string. To start matching at the beginning, add '^' at the beginning of the regular expression. To match until the end of the string, add '$' at the end of the regular expression.

If you need to match from the start only, consider using Regex.matches(regex).

import apache_beam as beam

# Matches a named group 'icon', and then two comma-separated groups.
regex = r'(?P<icon>[^\s,]+), *(\w+), *(\w+)'
with beam.Pipeline() as pipeline:
  plants_matches = (
      pipeline
      | 'Garden plants' >> beam.Create([
          '# πŸ“, Strawberry, perennial',
          '# πŸ₯•, Carrot, biennial ignoring trailing words',
          '# πŸ†, Eggplant, perennial - 🍌, Banana, perennial',
          '# πŸ…, Tomato, annual - πŸ‰, Watermelon, annual',
          '# πŸ₯”, Potato, perennial',
      ])
      | 'Parse plants' >> beam.Regex.find(regex)
      | beam.Map(print)
  )

Output PCollection after Regex.find:

plants_matches = [
    'πŸ“, Strawberry, perennial',
    'πŸ₯•, Carrot, biennial',
    'πŸ†, Eggplant, perennial',
    'πŸ…, Tomato, annual',
    'πŸ₯”, Potato, perennial',
]
View on GitHub View on GitHub


Example 5: Regex find all

Regex.find_all returns a list of all the matches of the regular expression, returning the matched group. The argument group is set to 0 by default, but can be set to a group number like 3, to a named group like 'icon', or to Regex.ALL to return all groups. The argument outputEmpty is set to True by default, but can be set to False to skip elements where no matches were found.

Regex.find_all matches the regular expression anywhere it is found in the string. To start matching at the beginning, add '^' at the start of the regular expression. To match until the end of the string, add '$' at the end of the regular expression.

If you need to match all groups from the start only, consider using Regex.all_matches(regex).

import apache_beam as beam

# Matches a named group 'icon', and then two comma-separated groups.
regex = r'(?P<icon>[^\s,]+), *(\w+), *(\w+)'
with beam.Pipeline() as pipeline:
  plants_find_all = (
      pipeline
      | 'Garden plants' >> beam.Create([
          '# πŸ“, Strawberry, perennial',
          '# πŸ₯•, Carrot, biennial ignoring trailing words',
          '# πŸ†, Eggplant, perennial - 🍌, Banana, perennial',
          '# πŸ…, Tomato, annual - πŸ‰, Watermelon, annual',
          '# πŸ₯”, Potato, perennial',
      ])
      | 'Parse plants' >> beam.Regex.find_all(regex)
      | beam.Map(print)
  )

Output PCollection after Regex.find_all:

plants_find_all = [
    ['πŸ“, Strawberry, perennial'],
    ['πŸ₯•, Carrot, biennial'],
    ['πŸ†, Eggplant, perennial', '🍌, Banana, perennial'],
    ['πŸ…, Tomato, annual', 'πŸ‰, Watermelon, annual'],
    ['πŸ₯”, Potato, perennial'],
]
View on GitHub View on GitHub


Example 6: Regex find as key-value pairs

Regex.find_kv returns a list of all the matches of the regular expression, returning a key-value pair using the specified groups. The argument keyGroup is set to a group number like 3, or to a named group like 'icon'. The argument valueGroup is set to 0 (the entire match) by default, but can be set to a group number like 3, or to a named group like 'icon'.

Regex.find_kv matches the first occurrence of the regular expression in the string. To start matching at the beginning, add '^' at the beginning of the regular expression. To match until the end of the string, add '$' at the end of the regular expression.

If you need to match as key-value pairs from the start only, consider using Regex.matches_kv(regex).

import apache_beam as beam

# Matches a named group 'icon', and then two comma-separated groups.
regex = r'(?P<icon>[^\s,]+), *(\w+), *(\w+)'
with beam.Pipeline() as pipeline:
  plants_matches_kv = (
      pipeline
      | 'Garden plants' >> beam.Create([
          '# πŸ“, Strawberry, perennial',
          '# πŸ₯•, Carrot, biennial ignoring trailing words',
          '# πŸ†, Eggplant, perennial - 🍌, Banana, perennial',
          '# πŸ…, Tomato, annual - πŸ‰, Watermelon, annual',
          '# πŸ₯”, Potato, perennial',
      ])
      | 'Parse plants' >> beam.Regex.find_kv(regex, keyGroup='icon')
      | beam.Map(print)
  )

Output PCollection after Regex.find_kv:

plants_find_all = [
    ('πŸ“', 'πŸ“, Strawberry, perennial'),
    ('πŸ₯•', 'πŸ₯•, Carrot, biennial'),
    ('πŸ†', 'πŸ†, Eggplant, perennial'),
    ('🍌', '🍌, Banana, perennial'),
    ('πŸ…', 'πŸ…, Tomato, annual'),
    ('πŸ‰', 'πŸ‰, Watermelon, annual'),
    ('πŸ₯”', 'πŸ₯”, Potato, perennial'),
]
View on GitHub View on GitHub


Example 7: Regex replace all

Regex.replace_all returns the string with all the occurrences of the regular expression replaced by another string. You can also use backreferences on the replacement.

import apache_beam as beam

with beam.Pipeline() as pipeline:
  plants_replace_all = (
      pipeline
      | 'Garden plants' >> beam.Create([
          'πŸ“ : Strawberry : perennial',
          'πŸ₯• : Carrot : biennial',
          'πŸ†\t:\tEggplant\t:\tperennial',
          'πŸ… : Tomato : annual',
          'πŸ₯” : Potato : perennial',
      ])
      | 'To CSV' >> beam.Regex.replace_all(r'\s*:\s*', ',')
      | beam.Map(print)
  )

Output PCollection after Regex.replace_all:

plants_replace_all = [
    'πŸ“,Strawberry,perennial',
    'πŸ₯•,Carrot,biennial',
    'πŸ†,Eggplant,perennial',
    'πŸ…,Tomato,annual',
    'πŸ₯”,Potato,perennial',
]
View on GitHub View on GitHub


Example 8: Regex replace first

Regex.replace_first returns the string with the first occurrence of the regular expression replaced by another string. You can also use backreferences on the replacement.

import apache_beam as beam

with beam.Pipeline() as pipeline:
  plants_replace_first = (
      pipeline
      | 'Garden plants' >> beam.Create([
          'πŸ“, Strawberry, perennial',
          'πŸ₯•, Carrot, biennial',
          'πŸ†,\tEggplant, perennial',
          'πŸ…, Tomato, annual',
          'πŸ₯”, Potato, perennial',
      ])
      | 'As dictionary' >> beam.Regex.replace_first(r'\s*,\s*', ': ')
      | beam.Map(print)
  )

Output PCollection after Regex.replace_first:

plants_replace_first = [
    'πŸ“: Strawberry, perennial',
    'πŸ₯•: Carrot, biennial',
    'πŸ†: Eggplant, perennial',
    'πŸ…: Tomato, annual',
    'πŸ₯”: Potato, perennial',
]
View on GitHub View on GitHub


Example 9: Regex split

Regex.split returns the list of strings that were delimited by the specified regular expression. The argument outputEmpty is set to False by default, but can be set to True to keep empty items in the output list.

import apache_beam as beam

with beam.Pipeline() as pipeline:
  plants_split = (
      pipeline
      | 'Garden plants' >> beam.Create([
          'πŸ“ : Strawberry : perennial',
          'πŸ₯• : Carrot : biennial',
          'πŸ†\t:\tEggplant : perennial',
          'πŸ… : Tomato : annual',
          'πŸ₯” : Potato : perennial',
      ])
      | 'Parse plants' >> beam.Regex.split(r'\s*:\s*')
      | beam.Map(print)
  )

Output PCollection after Regex.split:

plants_split = [
    ['πŸ“', 'Strawberry', 'perennial'],
    ['πŸ₯•', 'Carrot', 'biennial'],
    ['πŸ†', 'Eggplant', 'perennial'],
    ['πŸ…', 'Tomato', 'annual'],
    ['πŸ₯”', 'Potato', 'perennial'],
]
View on GitHub View on GitHub


Pydoc Pydoc