Data Catalog Evaluation Criteria

Data Catalogs have risen in adoption and popularity in the past several years, and it’s no coincidence as to why. The amount of data, and therefore metadata, is exploding at a rapid pace and will certainly not slow down anytime soon, pushing the need for a cloud solution that creates a source of truth for data and information. It’s difficult to manage and make sense of all of it. Moreover, people are not sure what the best use of all this data is for their businesses. There are so many data catalog vendors out there, all seemingly having the same message, that they are the right choice for you, but that isn’t the case. Choosing the right data catalog for your business depends on several criteria. Before looking at vendors and selection criteria, let’s narrow down what is important for your data catalog solution to have.

Enlarged text reading "Know Your Use Cases and Users"

Before delving into what criteria and vendor you want for your data catalog, thoroughly consider the Use Cases and Users of your business, because they are the main drivers of getting the most efficient use of your data catalog solution.

Use Cases: Consider the root problem that led your business to decide they need a data catalog solution. Beyond the fact that you have siloed data sources that you want to bring together in one centralized location, what are the true needs behind this? Are you trying to enable discovery, governance, data quality, analytics and/or delivery of your data assets? While all data catalog vendors share the common goal of merging your siloed data sources, each vendor will have a tailored functionality that answers one or more of the previous questions.

Users: Who will be accessing your data catalog? Your users should align with your use cases, and knowing who they are will help you focus on the most pertinent criteria for your data catalog. Do you need a platform for data scientists and engineers to build and monitor ETL processes? Are business users using the data catalog as a go-to discovery platform for insights and answers? Some example users of your data catalog might be:

  • Casual Users: Conduct broad searches and perform data discovery.
  • Data Stewards: Make governance decisions about data within their domain.
  • Data Analysts: Analyze data sets to generate insights and trends.
  • Data Architects/Engineers: Build data transformation pipelines (ETL).
  • System Administrators: Monitor system performance, use and other metrics.
  • Mission Enablers: Transform data and information into insights within analysis and reports to support objectives.

Enlarged section header text reading "Selection Criteria"

In the previous section, I listed some potential use cases your organization may be focused on depending on the root cause of your need for a data catalog or identified users. Let’s dive deeper into the 6 different criteria that you should prioritize when evaluating your data catalog solution. 

Sub-header text: "Availability & Discovery"

To maximize the value of your data, you need to understand what you have and how it relates to other data. Increased availability leads to less time catalog users spend looking for data, therefore reducing time to insight and analysis. Discovery allows for greater creativity and innovation of your data and metadata within your infrastructure for your data professionals making your business more efficient. For example, a client I am supporting to implement a data catalog solution needs their casual end users to be able to search for keywords and documents from separate databases and see all related results in one place to reduce time spent searching through multiple databases for the same information.

Sub-header: "Interoperability"

Interoperability pertains to the data catalog’s ability to integrate with your siloed information platforms and aggregate them into one centralized location. Data catalog vendors do not serve every database, data warehouse or data lake on the market. Rather, they will often target one or a few particular business software suites. Integration compatibility across your current environment is necessary to maximize your user experience as well as just making the data catalog usable. In addition to considering system interoperability, evaluate the data interoperability of the catalog. I recommend using a data catalog that will store and relate your data together using graphs and Semantic Web standards. Graphs and the Semantic Web standards help transform unstructured and semi-structured data at scale into meaningful and human readable relationships. Before choosing your catalog, assess the ease of configuration and linking your data catalog to your current environment. An example for checking for interoperability of your data catalog might be that if your current environment spans across multiple data storage providers such as AWS, Google or Microsoft, it’s important that your data catalog can aggregate information from all sources that are mission critical.

Sub-header: "Governance"

Businesses wrap their data in complicated security processes and rules, typically enforced by a specialized data governance team. These security processes and rules are enforced with a top-down approach and slow down your work. The modern and rising data framework highlights the need for governance to be a bottom-up approach to reduce bottlenecks of discovery and analysis. Choose the data catalog that provides governance features that prioritize catalog setup, data quality, data access and end-to-end data lineage. A few key governance features to consider are data ownership, curation, workflow management, and policy/usability controls. These governance features streamline and consolidate efforts to provide proper data access and management for users with an easy to use interface that spans across all data within the catalog. The right data catalog solution for your business will contain and specialize in the governance features needed by your user personas, such as system administrators to control data intake for users based on role, responsibility and sensitivity. For more information regarding metadata governance, check out my colleague’s post on the Best Practices for Successful Metadata Governance.

Sub-header: "Analytics & Reporting '

Analytics and reporting pertains to the ability to develop, automate and deliver analytical summaries and reports about your data. Internally or through integration, your data catalog needs to expand beyond being a centralized repository for your assets and provide analytical insights about how your data is being consumed and what business outcomes it is helping to drive. Some insights that are of interest to many organizations are which datasets are most popular, which users are consuming particular datasets, and the overall quality of the data contained within your data catalog. The most sought after insight I see with client implementations surrounds data usage by user types (analyzing which users consume particular data sets to get a better understanding of the data that has the most business impact).

Sub-header: "Metadata Management"

Metadata often outlasts the lifecycle of the data itself after it is deprecated, replaced, or deleted. Some of the key components of metadata management are availability, quality, lineage, and licensing.

  • Availability: Metadata needs to be stored where it can be accessed, indexed, and discovered in a timely manner.
  • Quality: Metadata needs to have consistency in its quality so that the consumers of the data know it can be trusted.
  • Historical Lineage: Metadata needs to be kept over time to be able to track data curation and deprecation.
  • Proper Licensing: Metadata needs to contain proper licensing information to ensure proper use by the appropriate users.

Depending on your use cases and personas, some of the key components above will take priority over others. Ensure that your data catalog contains, collects and analyzes the metadata your business needs. During the data catalog implementation, one feature I notice that clients usually need from their data catalog is data lineage. If historical lineage of your data is a dealbreaker, this will help narrow down your data catalog search effort.

Sub-header: "Enterprise Scale"

Enterprise scale is the capability for widespread use across multiple organizations, domains, and projects. Your data catalog will need to scale vertically with the amount of data that is ingested, as well as horizontally to continually serve new business ventures within your roadmap. Evaluate how you foresee your data catalog to grow in the coming years. Vertical scaling will reflect a need to continually add more data to the catalog, whereas horizontal scaling will reflect a need to spread the reach of your data catalog to more users.

Visual diagram comparing vertical vs. horizontal scaling

Conclusion

Now that you have an idea of the criteria that are most important when selecting your data catalog vendor, it’s time to explore further into your options. Take advantage of demos offered by data catalog vendors to get a feel for which catalogs have the right fit for your use cases and users. Carefully consider the pros and cons of each vendor’s platform and how their platform can meet the goals of your business catalog. If a data catalog is the right fit for your business and you’re still not sure as to which is the right for you, reach out to us at Enterprise Knowledge and we can help you evaluate your use cases and recommend the right data catalog solution for you!

 

Ian Thompson Ian Thompson Ian Thompson is a Data Engineer with 4+ years of experience in data management, analysis, visualization and graph solutions development using Python, SQL, R, and Tableau. Specializing in relational and graph database management, system integrations, ETL, and problem solving. More from Ian Thompson »