Blog

Common Issues In Auto-Tagging Thesauri

Thesauri are powerful knowledge organization systems that allow knowledge engineers to describe the world by expanding hierarchical taxonomies to create rich relationships between concepts. Thesauri offer many benefits for both indexers and information seekers who wish to find and discover content related to a specific domain. Expanding queries by specifying relationships between concepts in thesauri creates unique opportunities to enhance findability and discoverability by controlling for lexical variants and identifying related concepts. The benefits of thesauri are often reduced by improper application of thesaurus terms. Even the greatest thesaurus is only effective if properly applied.

Auto-tagging is a popular method of applying taxonomy and thesaurus concepts to a large corpus of content. The rich relationships that define a thesaurus are often ambiguous to auto-tagging tools. This means that terms are frequently incorrectly applied to content when performing auto-tagging, reducing the effectiveness of thesauri. PoolParty is one popular auto-tagging tool that offers two complementary capabilities, disambiguation and negation, to overcome this common issue. These capabilities allow PoolParty knowledge engineers to easily enhance auto-tagging rules without complicated coding or scripting languages.

Disambiguation

Every so often, when using the Text Mining and auto-tagging tools in PoolParty, incorrect terms will appear. When wrong annotations in the text extraction process continuously appear, you can specify the context in which terms should appear. This is done through disambiguation. Disambiguation allows knowledge engineers to help the system identify correct thesaurus terms when performing auto-tagging even if it shares Simple Knowledge Organization System (SKOS) relationships with other thesaurus terms.

For example, the thesaurus terms “Tata Consulting Group.” and “Tata Motors Inc.” both share the SKOS alternative label “Tata.” Thus, both terms will frequently appear together when performing auto-tagging regardless of context despite being unrelated organizations. Disambiguation is a method that can be used to correct this issue. In the example on the right, the thesaurus term “Tata Motors Inc.” has a related concept “clean energy.”

By selecting the” has related” relationship in disambiguation, the text mining feature in PoolParty will be able to determine the correct value to select in the correct context based on related terms in the thesaurus.

Once activated, the PoolParty extractor will annotate the thesaurus term “Tata Motors Inc.” when it is found near the term “clean energy”.

To access Disambiguation in PoolParty select Corpora and navigate to Disambiguation Settings: 

From the Disambiguation Settings menu, knowledge engineers can specify any SKOS relationships and/or custom relationships that they wish to be included when calculating terms in text extraction.  This is found in the Distance Calculation tab:

In Distance Calculation, you can decide which graph transversal patterns are allowed in principle to calculate the “closeness” between thesaurus terms and all SKOS labels.  This allows users to improve the effectiveness of auto-classification and enhance information retrieval by shaping how PoolParty weighs specific terms.

Indirect Relations and Disambiguation

Disambiguation can also be used to eliminate ambiguity when auto-tagging through indirect relations by assessing the entire graph and thesaurus structure. One example of this is when two different concepts share the same preferred label. In this example, Europe is used to represent a physical location and a character from Greek mythology:

 

If the phrase “Skiing is popular in Europe” appears in a piece of content included in the auto-tagging process, the system should apply the thesaurus term “Europe” that represents a physical location.  In the thesaurus, there is only one path between Skiing and the correct thesaurus term “Europe”:

Skiing –> 1976 Winter Olympics –> Innsbruck –> Austria –> Europe:

Skiing –> 1976 Winter Olympics

 

1976 Winter Olympics –> Innsbruck

 

 

 

 

Innsbruck–> Austria –> Europe

By specifying the entire graph and thesaurus structure in Disambiguation, knowledge engineers can ensure the correct term will be selected when auto-tagging in any scenario.

Negation

It is also possible to improve PoolParty’s auto-tagging capabilities through Negation.  Negation allows users to define when a specific term should or should not appear based on thesaurus relations.  In this example, the thesaurus term “Jaguar” is used poly-hierarchically to represent both the brand of a car and an animal.  The term “Jaguar” used to represent a brand of car frequently appears out of context when performing auto-tagging when content references “Jaguar” as an animal.  Using negation, knowledge engineers can prevent the brand “Jaguar” from displaying when the word “cat” is present.

To access Negation, select Corpora and navigate to Disambiguation Settings: 

From the Disambiguation Settings menu select the Negation tab:

In the Negation tab, knowledge engineers can select any applicable SKOS or custom relationship to enhance auto-tagging through negation:

Ben White Ben White An information and knowledge management expert. Ben is enthused about improving the flow of information and knowledge through Communities of Practice, taxonomy design, and information management strategies. More from Ben White »