Many of our clients come to Enterprise Knowledge (EK) looking for help defining a metadata schema for their search or content management initiatives. Too often they tell us that they do not have the budget or team to keep up with the manual tagging efforts they try to do for their content. Our consultants regularly show our clients how to make metadata management, manageable. I thought I would share some of these ideas with you.
There are a lot of methods to simplify or automate metadata management. Six of the most common approaches we use are:
- Implied metadata
- Linked or distant metadata
- Entity extraction tools
- Auto-categorization tools
- Pattern matching
- Batch metadata management
Implied metadata is metadata that can be derived based on some existing information about the content. Simple examples include the document type (word, excel, powerpoint, or html). A more interesting example is using the folder where the content resides to identify a topic or business unit that owns the content. A quick up front analysis can often identify numerous pieces of implied metadata that can be managed at little or no cost.
Linked or distant metadata is metadata that is related to existing metadata. For example, an organization can identify the group that owns a piece of content based on the group to which the author belongs. A similar example for more public content is identifying the topic based on the content the author typically writes about. One of the best advantages of linked metadata is its ability to scale. The metadata team need only manage the relationships between metadata elements. Large increases in content do not necessitate additional tagging efforts.
Entity extraction or content enrichment tools identify people, places and things in textual documents. These tools use natural language processing and dictionaries to flag or extract people places and things from text documents. The extracted information can then be used as metadata for your documents. Entity extraction tools are very popular and maturing rapidly. If you have used these tools before and were disappointed in the results, it is worth taking another look at some of the industry leaders like SmartLogic and Temis.
Auto-categorization tools categorize content into folders using one of three methods: rules based, pattern matching or natural language processing. These tools are very useful in grouping or tagging content based on topic, though they do require some amount of training or management to make them effective. The management of these tools is minimal compared to the level of effort required in manual tagging. As with entity extraction tools, these products have improved dramatically over the past couple of years.
Pattern matching is a more customized solution, but it can offer great value in the right situations. Content like forms or reports tend to have a consistent structure. In these cases, code can be developed using tools like regular expressions (regex) to extract metadata from the documents. In some cases these solutions can be simple to build and provide a great source of metadata at a very low cost.
Finally, batch metadata management is a search based solution that allows the content manager to apply tags to content in batches rather than individually. Content managers use a custom search interface to select a group of related content. The interface allows the manager to tag all of that content at once. One of the best examples of this is the way labels are managed in Google Mail. Batch metadata management is a great way to tag large amounts of content that cannot easily be tagged using the other approaches I previously described.
Are you struggling to tag your content today? Do you need more metadata in order to improve search or better manage records? Enterprise Knowledge can provide a metadata strategy that will help solve these issues.