Structuring Unstructured Content: The Power of Knowledge Graphs and Content Deconstruction

May 19, 2020

EK Team

Unstructured content is ubiquitous in today’s business environment. In fact, the IDC estimates that 80% of the world’s data will be unstructured by 2025, with many organizations already at that volume. Every organization possesses libraries, shared drives, and content management systems full of unstructured data contained in Word documents, power points, PDFs, and more. Documents like these often contain pieces of information that are critical to business operations, but these “nuggets” of information can be difficult to find when they’re buried within lengthy volumes of text. For example, legal teams may need information that is hidden in process and policy documents, and call center employees might require fast access to information in product guides. Users search for and use the information found in unstructured content all the time, but its management and retrieval can be quite challenging when content is long, text heavy, and has few descriptive attributes (metadata) associated with it.

What is Unstructured Content?

Unstructured content (also called unstructured data, and used here interchangeably), is content that does not have any data model or infrastructure applied to it. This makes it difficult to be ingested and managed by information systems, rendering search applications less accurate than they could be. Unstructured content is typically textual in nature, but can also include names, dates, and other data.

Common Unstructured Content Dilemmas

At EK we see two common dilemmas that users often encounter when dealing with unstructured content.

The “Search Again” Dilemma

Imagine that you’re trying to find your organization’s process for submitting a travel request. After searching through a shared drive of company information, you find an HR PDF called “Employee Handbook.” You open the file and see that it is 40 pages long, so you use Ctrl + F (or Edit -> Find) to search for the phrase “travel request.” This takes you to the portion of the handbook that you needed. In this scenario, you had to search twice: once for the document, and again for the actual information you needed. If you were using a system that was underpinned by deconstructed content, your initial search for “travel request” would have rendered what you needed, saving you time and effort. This is because the content pieces would each be tagged with metadata and indexed by a search application, as opposed to just the bigger, longer document. When this happens search is able to treat each content chunk as one search result, surfacing more specific answers.

The “I Didn’t Know I Was Looking for That” Dilemma

Users often embark on their search for content with a certain document or type of content in mind. They may think “I need the manual for this procedure,” or “I need this specific form.” However, users don’t always have access to, or awareness of, the full breadth of content an organization has stored in its systems. For example, imagine that you’re a lawyer looking for an example of a Licensing Contract that you can use for the project you’re working on. You may enter your company’s intranet and search for “Licensing Contract,” which returns dozens of contracts that you can scroll through. Further down in the search results you see a document titled “Licensing Contract Template” and realize this is what you need, as opposed to a completed example. In this instance, you had to scroll through search results only to realize that what you were looking for was actually a template. When data is unstructured, search results become unstructured. Systems cannot derive meaning from volumes of text, so they cannot reflect search results back in a meaningful way.

These scenarios should be familiar to almost anyone working in an organization with lots of unstructured data. Many users become accustomed to stumbling through virtual stacks of documents until they strike the right piece of information they need. However, this doesn’t need to be the status quo.

Creating Structure

There are two different practices that, when combined, result in a robust, efficient system for managing and searching for unstructured content. Before I talk about them, though, I should mention a critical part of information architecture: taxonomy. A precursor to more complex content management efforts should be the design of a user centric business taxonomy that satisfactorily encompasses the range of information being stored in a system. Taxonomy terms will be the glue that holds together the solutions I talk about moving forward. Once a taxonomy is in place, content deconstruction and a knowledge graph can be used to create a sophisticated content management solution.

Content deconstruction, which is explained in more depth in this blog, breaks longer documents into smaller chunks to apply more pointed metadata relevant to each section. This creates more relevant search results, consumable by both systems and users alike. In the context of the “Search Again” problem, a deconstructed approach would eliminate the need to dig through longer documents to get to the right piece of information. Applying a knowledge graph to content “chunks” results in an even more sophisticated solution in which these chunks can be related to each other.

Knowledge graphs create and manage meaningful relationships between content, breaking the constraints of keyword search and generating advanced discovery. Creating knowledge graphs is a complex endeavor, one which my colleagues at EK have written about extensively here. For this particular use case, knowledge graphs can relate structured content and data associated with content (like author, business area, and topic), so that relevant information can be quickly surfaced in search results. This drives users to discover content they may not have been aware of, preventing the second dilemma I discussed above and applying significantly more value to an organization’s content.

Putting it All Together

To give an example of how these two solutions can work together to create a seamless content consumption experience for users, take the example of a project I worked on for an international grocery store chain. This organization had an intranet that stored all employee handbooks and HR policies, amounting to long lists of links to download even longer pdfs on topics like Time Off, Dress Code, Pay, and Travel. If an employee wanted to find information about what uniform they are required to wear, they would first have to search the intranet using the term “uniform,” which would return, among other things, a 30 page pdf titled “Employee Dress Code.” Then they would have to download that pdf and take the extra step of scrolling or using Ctrl + F to find information specifically about uniforms. This should sound familiar, as it is an example of the “Search Again” dilemma.

What we did in this scenario was take each of the long policy documents and “chunk” them, breaking each into segments that addressed one topic or subject. A taxonomy was designed so that each segment was tagged with topical and departmental information. For the “Employee Dress Code” document, there were segments like “Store Uniform,” “Office Uniform,” and “Warehouse Uniform,” each with specific rules and expectations around these policies. Now, when an employee searches for the keyword “uniform,” they will be able to quickly assess which segment they need based on its content and tags.

To take this one step further, the use of a knowledge graph surfaces content related to the segment a user is viewing. For example, if the user searched for uniform and clicked on the “Store Uniform” segment, they might be shown the related content: “Uniform Order Form” and “Dress Code Violations.” In the course of finding this information, the user may realize that they need to place an order, and be able to efficiently do so because the order form link is readily available to them. This demonstrates a solution to the “I Didn’t Know I Was Looking for That” Dilemma.

Parting Thoughts

Deconstructing content and creating a knowledge graph for that content is no small feat, but it is a realistic and achievable approach to content management. At the end of the day, the goal is to build a system that stores and manages deconstructed “chunks” of tagged content that are related using a knowledge graph. If you would like guidance on where to begin, here is how Enterprise Knowledge experts can help.

Blog