Pattern Induction: Best Practices for Extracting Text Patterns

A user guide on how best to utilize Pattern Induction for your text extraction task

Maeda Hanafi
7 min readJul 24, 2022

Pattern Induction is a feature up on IBM Watson Discovery that helps users quickly and accurately extract text patterns from documents from user-provided text examples. In this blog post, we wrote an introductory piece on how to extract text patterns. In this blog post, we outline the general guidelines for you to follow to best utilize Pattern Induction for your extraction task.

Highlight examples with relatively few (at most 6) tokens.

The first thing to understand is that the tool doesn’t handle patterns of arbitrary length. The runtime of the system to learn highly depends on the length of the pattern with longer patterns slowing down the process. As a result, the tool currently only supports patterns up to 6 tokens in length. For this beta version, we recommend working with patterns that have at most 6 tokens to get the best results.

Additionally, it is important to understand Pattern Induction’s tokenization behavior, or how it defines the boundaries of a token or a word. Like how other text-based AI tools have their own definition of token, Pattern Induction has its own tokenization behavior. In this tool, token boundaries depend on the language of the input documents. For this release, the tool focuses on English, so we provide a few examples for English documents. In general, the token boundaries are determined by white spaces. For instance, the phrase revenue: 10.5 million dollars is composed of five tokens: |revenue| |:| |10.5| |million| |dollars|.

The words that are used in daily English are separated by whitespaces. The numeric amount is its own token. Symbols are also considered as their own tokens.

Consider the following phrase: AMD32: Performance Review

There are five tokens in total since “AMD32:” will be split into three tokens by the system: |AMD| |32| |:|. Pattern Induction breaks tokens up according to consecutive numerical characters, consecutive alphabetical characters, and individual symbol characters. Note that the number “32” is not part of a numeric amount with a decimal point or commas (for clarity as in large numbers e.g. “4,927,535”).

Highlight examples that belong to the same concept.

For instance, when you highlight the following examples:

  • revenue: 10 million dollars
  • income: 3.2 thousand dollars

We do not recommend you highlight other concepts such as the following at the same time: 5 December 2025

If you want to capture a different concept (e.g., dates vs revenue in this case), you can start a new Pattern Induction session and create a separate extractor for the new concept by providing appropriate examples.

Highlight examples with similar patterns.

To illustrate pattern similarity, suppose you highlight the following two examples:

  • revenue: 10 million dollars
  • revenue: $15.5 thousand

The pattern can be described at a high level: Extract tokens “revenue” followed by a colon and the currency amount. Pattern Induction will learn a rule that will capture such texts in the rest of the documents. However, you must avoid highlighting texts that do not match the patterns of the already-highlighted examples, such as:

  • revenue and income in 2010: an estimate of 10.5 million dollars

In this example, the underlying pattern is a lot more complex and longer than the patterns derived from the shorter examples. The longer pattern requires a set of tokens, including both “revenue” and “income”, followed by the year and a colon. Afterwards, there is a set of tokens indicating “an estimate of” and finally, the example ends with the currency amount.

Providing such examples makes it challenging for Pattern Induction to create a rule that generalizes both shorter examples and longer examples. Moreover, the number of tokens is significantly larger than in the other two shorter examples. It is important to note that the system works best when the highlighted examples have roughly the same length (in terms of tokens). Although the system employs heuristics to learn rules that capture examples of slightly different lengths, it is generally difficult to infer rules if there is a big difference in the length of examples.

While it is advised to highlight examples with similar patterns, we would also like to emphasize that Pattern Induction is capable of learning variations of a pattern, which are small tweaks to a pattern. Variations to a pattern are mostly at a token-level, as illustrated in the following as a variation of the above pattern but with the token “income”:

income: 3.2 thousand dollars

In cases where examples vary in pattern similarity, you must highlight an example for each variation.

Pattern Induction is capable of only learning patterns of text according to the examples you provide to it. So, if you wish to extract texts containing either “revenue” or “income”, then you must highlight one for each variation:

  • revenue: 10 million dollars
  • income: 3.2 thousand dollars

Note that these set of examples are of a similar pattern, where the currency amount appears after the tokens, “revenue” or “income”.

If your documents have another variation where the extracted examples are of different pattern similarity, e.g., the currency amount appears before the tokens, “revenue” or “income”, then you must also highlight them:

  • 5 million dollars in revenue
  • $90 million in income

Another important thing to remember is that Pattern Induction attempts to generate a generalized rule that captures all the highlighted examples. In this case, the patterns are different for each variation, and so one rule cannot capture all examples. Pattern Induction can split the examples according to pattern similarity and pattern variation and learn rules for each variation/similarity. Thus, Pattern Induction will learn one rule with examples ending with the currency amounts (the first variation) and another rule with examples beginning with the currency amounts (the second variation):

Highlight examples that are missing from the extracted examples by Pattern Induction.

After Pattern Induction extracts texts, you should inspect the extracted examples in the preview pane. If you notice that an intended extraction is missing, you must highlight them in your document for Pattern Induction to learn not to miss it in the following iterations. For instance, if Pattern Induction did not extract any texts with the token “revenue” but all those with the token “income”, then you must double check whether you have highlighted an example containing “revenue”. Pattern Induction lists all the examples it learns from in the review pane, where you can confirm whether an example variation is missing.

Patterns that require understanding word meanings or semantics can be challenging for Pattern Induction to learn.

To illustrate a challenging case, suppose you wish to extract company performance measurements and you highlight the following two examples:

Such a task is best suited for the Domain Vocabulary Induction tool in Watson Discovery.

In such a task, you expect Pattern Induction to learn to extract other measures, such as “net profit margin” or “operating cash flow”. However, this is not possible because Pattern Induction is limited to understanding meaning based on its predefined named entity recognition (NER) extractors (which includes extractors for common sense entities).

The key takeaway from this is that most of the generated rules do not make use of the semantic meanings of the highlighted words, unless these words can be extracted by the predefined NER extractors. For this release, Pattern Induction learns to extract mostly syntactic patterns (such as regular expressions).

So, if the highlighted words do not belong to one of the concepts supported in Pattern Induction, then they will be extracted by regular expressions learned on the fly. Regular expressions do not care about the meaning of the words, but only the syntactic structures.

In the case illustrated above, since none of the predefined NER extractors can capture any of the words in the user-provided examples, Pattern Induction will attempt to learn a rule containing regular expressions that extract three consecutive alphabetical words. However, such a pattern is too general and extracts any three-consecutive words, such as “certification of sustainable” and “thinking about buying”.

Alternatively, learning to capture dataset-specific vocabularies such as gross profit margin and monthly recurring revenue is best suited for the Domain Vocabulary Induction tool, which is also included in IBM Watson Discovery under the Dictionary tab.

Below, we provide a scenario in which Pattern Induction can accurately extract:

Given these two user-highlighted examples, the tool learns rules that accurately capture the user’s intent (i.e., extracting countries), which will not extract examples like “A CEO”, because the first two examples can be extracted by one of the predefined NER extractors (i.e., countries) in Pattern Induction.

Given such nuances, it is advised that once the tool determines a pattern, the user goes through some of the highlighted examples to make sure that the pattern accurately captures their intent.

Conclusion

We understand that patterns that require understanding word meanings or semantics can be challenging for our current version of Pattern Induction to learn. Currently, pattern induction creates dictionaries based on similar patterns in the tokens. In the future, we plan on enabling users to upload their own dictionaries that Pattern Induction can utilize when generating patterns. Stay tuned!

Authors: Dr. Maeda Hanafi, Dr. Yannis Katsis, Dr. Yunyao Li, Dr. Bikalpa Neupane

--

--

Maeda Hanafi

Currently building human-centered AI solutions at IBM Research. Also, every day I write in my journal. Sometimes I share those snippets here.