How to Ensure Your Data is AI Ready

Artificial intelligence has the potential to be a game-changer for organizations looking to empower their employees with data at every level. However, as business leaders look to initiate projects that incorporate data as part of their AI solutions, they frequently look to us to ask, “How do I ensure my organization’s data is ready for AI?” In the first blog in this series, we shared ways to ensure knowledge assets are ready for AI. In this follow-on article, we will address the unique challenges that come with connecting data—one of the most unique and varied types of knowledge assets—to AI. Data is pervasive in any organization and can serve as the key feeder for many AI use cases, so it is a high priority knowledge asset to ready for your organization.

The question of data AI readiness stems from the very real concern that when AI is pointed at data that isn’t correct or that doesn’t have the right context associated with it, organizations could face risks to their reputation, their revenue, or their customers’ privacy. With the additional nuance that data brings by often being presented in formats that require transformation, lacking in context, and frequently containing multiple duplicates or near-duplicates with little explanation of their meaning, data (although seemingly already structured and ready for machine consumption) requires greater care than other forms of knowledge assets to comprise part of a trusted AI solution. 

This blog focuses on the key actions an organization needs to perform to ensure their data is ready to be consumed by AI. By following the steps below, an organization can use AI-ready data to develop end-products that are trustworthy, reliable, and transparent in their decision making.

1) Understand What You Mean by “Data” (Data Asset and Scope Definition)

Data is more than what we typically picture it as. Broadly, data is any raw information that can be interpreted to garner meaning or insights on a certain topic. While the typical understanding of data revolves around relational databases and tables galore, often with esoteric metrics filling their rows and columns, data takes a number of forms, which can often be surprising. In terms of format, while data can be in traditional SQL databases and formats, NoSQL data is growing in usage, in forms ranging from key-value pairs to JSON documents to graph databases. Plain, unstructured text such as emails, social media posts, and policy documents are also forms of data, but traditionally not included within the enterprise definition. Finally, data comes from myriad sources—from live machine data on a manufacturing floor to the same manufacturing plant’s Human Resources Management System (HRMS). Data can also be categorized by its business role: operational data that drives day-to-day processes, transactional data that records business exchanges, and even purchased or third-party data brought in to enrich internal datasets. Increasingly, organizations treat data itself as a product, packaged and maintained with the same rigor as software, and rely on data metrics to measure quality, performance, and impact of business assets.

All these forms and types of data meet the definition of a knowledge asset—information and expertise that an organization can use to create value, which can be connected with other knowledge assets. No matter the format or repository type, ingested, AI-ready data can form the backbone of a valuable AI solution by allowing business-specific questions to be answered reliably in an explainable manner. This raises the question to organizational decision makers—what within our data landscape needs to be included in our AI solution? From your definition of what data is, start thinking of what to add iteratively. What systems contain the highest priority data? What datasets would provide the most value to end users? Select high-value data in easy-to-transform formats that allows end users to see the value in your solution. This can garner excitement across departments and help support future efforts to introduce additional data into your AI environment. 

2) Ensure Quality (Data Cleanup)

The majority of organizations we’ve worked with have experienced issues with not knowing what data they have or what it’s intended to be used for. This is especially common in large enterprise settings as the sheer scale and variety of data can breed an environment where data becomes lost, buried, or degrades in quality. This sprawl occurs alongside another common problem, where multiple versions of the same dataset exist, with slight variations in the data they contain. Furthermore, the issue is exacerbated by yet another frequent challenge—a lack of business context. When data lacks context, neither humans nor AI can reliably determine the most up-to-date version, the assumptions and/or conditions in place when said data was collected, or even if the data warrants retention.

Once AI is introduced, these potential issues are only compounded. If an AI system is provided data that is out of date or of low quality, the model will ultimately fail to provide reliable answers to user queries. When data is collected for a specific purpose, such as identifying product preferences across customer segments, but not labeled for said use, and an AI model leverages that data for a completely separate purpose, such as dynamic pricing models, harmful biases can be introduced into the results that negatively impact both the customer and the organization.

Thankfully, there are several methods available to organizations today that allow them to inventory and restructure their data to fix these issues. Examples include data dictionaries, master data (MDM data), and reference data that help standardize data across an organization and help point to what is available at large. Additionally, data catalogs are a proven tool to identify what data exists within an organization, and include versioning and metadata features that can help label data with their versions and context. To help populate catalogs and data dictionaries and to create MDM/reference data, performing a data audit alongside stewards can help rediscover lost context and label data for better understanding by humans and machines alike. Another way to deduplicate, disambiguate, and contextualize data assets is through lineage. Lineage is a feature included in many metadata management tools that stores and displays metadata regarding source systems, creation and modification dates, and file contributors. Using this lineage metadata, data stewards can select which version of a data asset is the most current or relevant for a specific use case and only expose said asset to AI. These methods to ensure data quality and facilitate data stewardship can aid in action towards a larger governance framework. Finally, at a larger scale, a semantic layer can unify data and its meaning for easier ingestion into an AI solution, assist with deduplication efforts, and break down silos between different data users and consumers of knowledge assets at large. 

Separately, for the elimination of duplicate/near-duplicate data, entity resolution can autonomously parse the content of data assets, deduplicate them, and point AI to the most relevant, recent, or reliable data asset to answer a question. 

3) Fill Gaps (Data Creation or Acquisition)

With your organization’s data inventoried and priorities identified, it’s time to start identifying what gaps exist in your data landscape in light of the business questions and challenges you are looking to address. First, ask use case-based questions. Based on your identified use cases, what data would an AI model need to answer topical questions that your organization doesn’t already possess?

At a higher level, gaps in use cases for your AI solution will also exist. To drive use case creation forward, consider the use of a data model, entity relationship diagram (ERD), or ontology to serve as the conceptual map on which all organizational data exists. With a complete data inventory, an ontology can help outline the process by which AI solutions would answer questions at a high level, thanks to being both machine and human-readable. By traversing the ontology or data model, you can design user journeys and create questions that form the basis of novel use cases.

Often, gaps are identified that require knowledge assets outside of data to fill. A data model or ontology can help identify related assets, as they function independently of their asset type. Moreover, standardized metadata across knowledge assets and asset types can enrich assets, link them to one another, and provide insights previously not possible. When instantiated in a solution alongside a knowledge graph, this forms a semantic layer where data assets, such as data products or metrics, gain context and maturity based on related knowledge assets. We were able to enhance the performance of a large retail chain’s analytics team through such an approach utilizing a semantic layer.

To fill these gaps, organizations can look to collect or create more data, as well as purchase publicly available/incorporate open-source datasets (build vs. buy). Another common method of filling identified organizational gaps is the creation of content (and other non-data knowledge assets) to identify a gap via the extraction of tacit organizational knowledge. This is a method that more chief data officers/chief data and AI officers (CDOs/CDAOs) are employing, as their roles expand and reliance on structured data alone to gather insights and solve problems is no longer feasible.

As a whole, this process will drive future knowledge asset collection, creation, and procurement efforts and consequently is a crucial step in ensuring data at large is AI ready. If no such data exists for AI to rely on for certain use cases, users will be presented unreliable, hallucination-based answers, or in a best-case scenario, no answer at all. Yet as part of a solid governance plan as mentioned earlier, the continuation of the gap analysis process post-solution deployment can empower organizations to continually identify—and close—knowledge gaps, continuously improving data AI readiness and AI solution maturity.

4) Add Structure and Context (Semantic Components)

A key component of making data AI-ready is structure—not within the data per se (e.g., JSON, SQL, Excel), but the structure relating the data to use cases. As a term, ‘structure’ added meaning to knowledge assets in our previous blog, but can introduce confusion as a misnomer in this section. Consequently, ‘structure’ will refer to the added, machine-readable context a semantic model adds to data assets, rather than the format of the data assets themselves, as data loses meaning once taken out of the structure or format it is stored in (e.g., as takes place when retrieved by AI).

Although we touched on one type of semantic model in the previous step, there are three semantic models that work together to ensure data AI readiness: business glossaries, taxonomies, and ontologies. Adding semantics to data for the purpose of getting it ready for AI allows an organization to help users understand the meaning of the data they’re working with. Together, taxonomies, ontologies, and business glossaries imbue data with the context needed for an AI model to fully grasp the data’s meaning and make optimal use of it to answer user queries. 

Let’s dive into the business glossary first. Business glossaries define business context-specific terms that are often found in datasets in a plaintext, easy-to-understand manner. For AI models which are often trained generally, these glossary terms can further assist in the selection of the correct data needed to answer a user query. 

Taxonomies group knowledge assets into broader and narrower categories, providing a level of hierarchical organization not available with traditional business glossaries. This can help data AI readiness in manifold ways. By standardizing terminology (e.g., referring to “automobile,” “car,” and “vehicle” all as “Vehicles” instead of separately), data from multiple sources can be integrated more seamlessly, disambiguated, and deduplicated for clearer understanding. 

Finally, ontologies provide the true foundation for linking related datasets to one another and allow for the definition of custom relationships between knowledge assets. When combining ontology with AI, organizations can perform inferences as a way to capture explicit data about what’s only implied by individual datasets. This shows the power of semantics at work, and demonstrates that good, AI-ready data enriched with metadata can provide insights at the same level and accuracy as a human. 

Organizations who have not pursued developing semantics for knowledge assets before can leverage traditional semantic capture methods, such as business glossaries. As organizations mature in their curation of knowledge assets, they are able to leverage the definitions developed as part of these glossaries and dictionaries, and begin to structure that information using more advanced modeling techniques, like taxonomy and ontology development. When applied to data, these semantic models make data more understandable, both to end users and AI systems. 

5) Semantic Model Application (Labeling and Tagging) 

The data management community has more recently been focused on the value of metadata and metadata-first architecture, and is scrambling to catch up to the maturity displayed in the fields of content and knowledge management. Through replicating methods found in content management systems and knowledge management platforms, data management professionals are duplicating past efforts. Currently, the data catalog is the primary platform where metadata is being applied and stored for data assets. 

To aggregate metadata for your organization’s AI readiness efforts, it’s crucial to look to data stewards as the owners of, and primary contributors to, this effort. Through the process of labeling data by populating fields such as asset descriptions, owner, assumptions made upon collection, and purposes, data stewards help to drive their data towards AI readiness while making tacit knowledge explicit and available to all. Additionally, metadata application against a semantic model (especially taxonomies and ontologies) contextualizes assets in business context and connects related assets to one another, further enriching AI-generated responses to user prompts. While there are methods to apply metadata to assets without the need for as much manual effort (such as auto-classification, which excels for content-based knowledge assets), structured data usually dictates the need for human subject matter experts to ensure accurate classification. 

With data catalogs and recent investments in metadata repositories, however, we’ve noticed a trend that we expect will continue to grow and spread across organizations in the near future. Data system owners are more and more keen to manage metadata and catalog their assets within the same systems that data is stored/used, adopting features that were previously exclusive to a data catalog. Major software providers are strategically acquiring or building semantic capabilities for this purpose. This has been underscored by the recent acquisition of multiple data management platforms by the creators of larger, flagship software products. With the features of the data catalog being adapted from a full, standalone application that stores and presents metadata to a component of a larger application that focuses as a metadata store, the metadata repository is beginning to take hold as the predominant metadata management platform.

6) Address Access and Security (Unified Entitlements)

Applying semantic metadata as described above helps to make data findable across an organization and contextualized with relevant datasets—but this needs to be balanced alongside security and entitlements considerations. Without regard to data security and privacy, AI systems risk bringing in data they shouldn’t have access to because access entitlements are mislabeled or missing, leading to leaks in sensitive information.

A common example of when this can occur is with user re-identification. Data points that independently seem innocuous, when combined by an AI system, can leak information about customers or users of an organization. With as few as just 15 data points, information that was originally collected anonymously can be combined to identify an individual. Data elements like ZIP code or date of birth would not be damaging on their own, but when combined, can expose information about a user that should have been kept private. These concerns become especially critical in industries with small population sizes for their datasets, such as rare disease treatment in the healthcare industry.

EK’s unified entitlements work is focused on ensuring the right people and systems view the correct knowledge assets at the right time. This is accomplished through a holistic architectural approach with six key components. Components like a policy engine capture can enforce whether access to data should be given, while components like a query federation layer ensure that only data that is allowed to be retrieved is brought back from the appropriate sources. 

The components of unified entitlements can be combined with other technologies like dark data detection, where a program scrapes an organization’s data landscape for any unlabeled information that is potentially sensitive, so that both human users and AI solutions cannot access data that could result in compliance violations or reputational damage. 

As a whole, data that exposes sensitive information to the wrong set of eyes is not AI-ready. Unified entitlements can form the layer of protection that ensures data AI readiness across the organization.

7) Maintain Quality While Iteratively Improving (Governance)

Governance serves a vital purpose in ensuring data assets become, and remain, AI-ready. With the introduction of AI to the enterprise, we are now seeing governance manifest itself beyond the data landscape alone. As AI governance begins to mature as a field of its own, it is taking on its own set of key roles and competencies and separating itself from data governance. 

While AI governance is meant to guide innovation and future iterations while ensuring compliance with both internal and external standards, data governance personnel are taking on the new responsibility of ensuring data is AI-ready based on requirements set by AI governance teams. Barring the existence of AI governance personnel, data governance teams are meant to serve as a bridge in the interim. As such, your data governance staff should define a common model of AI-ready data assets and related standards (such as structure, recency, reliability, and context) for future reference. 

Both data and AI governance personnel hold the responsibility of future-proofing enterprise AI solutions, in order to ensure they continue to align to the above steps and meet requirements. Specific to data governance, organizations should ask themselves, “How do you update your data governance plan to ensure all the steps are applicable in perpetuity?” In parallel, AI governance should revolve around filling gaps in their solution’s capabilities. Once the AI solutions launch to a production environment and user base, more gaps in the solution’s realm of expertise and capabilities will become apparent. As such, AI governance professionals need to stand up processes to use these gaps to continue identifying new needs for knowledge assets, data or otherwise, in perpetuity.

Conclusion

As we have explored throughout this blog, data is an extremely varied and unique form of knowledge asset with a new and disparate set of considerations to take into account when standing up an AI solution. However, following the steps listed above as part of an iterative process for implementation of data assets within said solution will ensure data is AI-ready and an invaluable part of an AI-powered organization.

If you’re seeking help to ensure your data is AI-ready, contact us at info@enterprise-knowledge.com.

Kyle Garcia Kyle Garcia is a Senior Technical Analyst at EK and part of the Semantic Engineering and Enterprise AI Practice. Kyle is experienced in data engineering, semantic technologies, and applying large language models (LLMs) to real-world business challenges. A published thought leader in AI, Kyle is passionate about integrating generative AI, data science and engineering, and machine learning into the field of knowledge management. More from Kyle Garcia »
Thomas Mitrevski Thomas is a Principal Data Management Consultant who has demonstrated the ability to consistently foster business engagement and accountability for data. He possesses an uncanny ability to solve business data problems through policy development, process reengineering, and technology solutions. He has extensive experience deploying enterprise metadata management programs, with solutions in both multi-cloud and on-premise environments, including natural language search and machine learning in support of taxonomy and ontology development. Finally, Thomas is knowledgeable in program development in areas such as data architecture, metadata management, and data analytics. More from Thomas Mitrevski »