The Glossary Is The Start: Building the Context Layer That Makes AI Work in Financial Services

One of the most common conversations I have with financial institutions teams is whether they should bother with a glossary. These conversations happen across the spectrum of firms. In the past, functioning without semantic knowledge was a nuisance. In the AI era, it is a disqualifier. A glossary is an essential beginning to building a context layer that transforms raw data into something AI systems and analysts can actually trust and reason over.

Let’s explore the value of glossaries and how companies can progress from simple to strong value.

Dictionaries: Eliminating the cost of ambiguity

The financial services industry loves its arcane terminology. Every business line has developed its own dialect over decades. Understanding that terminology unambiguously is critical for doing business. That’s a big challenge for anyone outside of or new to a business.

A dictionary fixes that. It ensures everyone is speaking the same language, accelerates onboarding, and prevents the miscommunication that quietly costs firms time and money every day.

Data dictionaries: Turning language into discoverable, contextual knowledge

A dictionary becomes significantly more valuable when connected to data. New hires work with data in many forms, input fields in a trading app, tables on screen, columns in a spreadsheet. They’ll be ineffective if they can’t relate that data back to the language of their business. Making that connection promotes a dictionary into a data dictionary.

The deeper value is in the databases. I can speak from experience that huge portions of databases are completely undocumented and unknown except by tribal knowledge. The amount of time and effort wasted discussing exactly what data is stored in a table or column is truly staggering. If companies understood just how much time and money those conversations cost, they’d be breaking down doors to solve the issue. Linking each field to its term from the data dictionary eliminates that waste and sets the firm up for efficient discovery.

Governed terms: Enforcing quality and managing risk at scale

Linking terms to data is just the beginning. The next level of value comes from associating constraints, rules, and logic with those terms, creating a single source of truth that actively enforces quality and reduces risk across the entire data estate.

Many terms carry clear constraints. Currency has a finite set of values defined by ISO. FICO Score runs from 300 to 850. Whether a rate is stored as a decimal or a percentage is a technical decision with real consequences. I’ve seen that discrepancy cause serious problems for years before anyone noticed. It’s common to see business teams build elaborate override systems because they have no trust in the data they’re being given.

With that information centralized, the payoffs compound:

  • Data quality at scale: A single source of truth makes inconsistencies detectable across the entire landscape. That decimal vs. percentage problem becomes a scan, not a years-long mystery.
  • Breach response you can execute: Regulations require firms to know precisely what data was exposed in a security incident. When sensitivity classifications and PII designations are centralized in governed terms, that question has a fast, accurate answer. Without it, breach response is a frantic manual exercise across systems no one fully understands.
  • CDE discovery without the inventory: Critical Data Elements are business designations, the data your firm has formally identified as most important to its operations and regulatory obligations. Link them to governed terms and every field carrying a CDE surfaces automatically, refined further to only those flowing into critical reports.
  • Regulatory proof on demand: Many terms have specific calculations associated with them. Centralizing that logic lets you verify that implementations across stored procedures and downstream systems match the business specification, and produce that evidence for regulators without a fire drill.

Taxonomies: Organizing knowledge within and across domains

A dictionary becomes more powerful when terms are organized into hierarchies. A securities lending trader and a mortgage broker both have Rate terms in their domains. They’re conceptually similar but also different. A taxonomy establishes that both are narrower instances of a broader Interest Rate concept. This lets each business line manage its own language independently while making firm-wide relationships explicit.

The same applies across external terminology. Bloomberg, Refinitiv, and FactSet each arrive with terminology that overlaps with internal definitions. A taxonomy maps incoming vendor terms to your own systematically, resolving ambiguity at ingestion rather than leaving it to individual developers.

A firm’s Balance Sheet under US GAAP and Statement of Financial Position under IFRS are the same concept. A taxonomy that says so makes it possible to calculate the same values consistently across regional regulatory reports, like LCR, NSFR, and others, without rebuilding the translation layer for each jurisdiction.

This is precisely what BCBS 239 demands. A firm with a common taxonomy maps once and reports many times. A firm without it rebuilds that translation layer for every new requirement. A well-structured taxonomy is foundational context infrastructure: it ensures that the same question asked across two business lines, two vendors, or two regulatory regimes returns a consistent answer.

A different kind of step

What we’ve discussed so far is semantic work that aims to make humans more effective. The value is communicative and organizational. Machines benefit indirectly, but the primary beneficiary is people. The distinction between this and what follows is formal inference. A taxonomy documents that Mortgage Rate and Sec Lending Rate are both instances of Interest Rate. An ontology specifies that relationship with enough logical precision that a reasoner can derive conclusions that were not explicitly asserted.

We might wonder if that’s still needed if LLMs can traverse taxonomic relationships and answer many questions that previously required formal ontologies. Their inference is probabilistic, not guaranteed, so doesn’t produce the kind of auditable chain of reasoning that regulated industries often require. Formal ontologies produce conclusions that are guaranteed every time.

Ontologies and knowledge graphs: Enabling machine reasoning

Formal ontologies encode what things mean in relation to each other with enough precision to support automated inference. That can lift a semantic layer from a reference tool into a reasoning engine. Even better, these ontologies can be placed on top of the semantic taxonomies as an added layer without disturbing them at all.

A few examples of the concrete business value:

  • Firm-wide exposure analysis on demand: When rate terms are formally linked to benchmarks and customer positions are connected to those terms, “which customers are exposed to a SOFR event?” becomes a single query, not a weeks-long manual analysis across siloed systems. The answer is derived explicitly from the ontology.
  • Proactive regulatory risk scoring: Infer which regulatory reports are most at risk by associating systems as the source of record for key terms and connecting these systems to relevant data quality metrics. That will let you understand which systems are causing the most risk before regulators do.
  • Automatic scope determination: Identify which instruments, entities, and data elements fall within the scope of a regulation based on their formal relationships rather than relying on cumbersome and error-prone manual tagging.
  • Auditability on demand: Every conclusion is traceable to the rules and data that produced it, giving regulators a clear chain of reasoning on demand.
  • Data estate intelligence: Identify redundant datasets, trace data flows, and surface coverage gaps automatically across the entire landscape.

It is these kinds of cross-entity, rules-driven inferencing that make knowledge graphs central to Anti-Money Laundering programs, where connecting relationships across customers, accounts, transactions, and entities at scale is the difference between detecting a pattern and missing it entirely.

The AI imperative

Every business-specific large language model your firm deploys will eventually hit the same wall: it knows language, but it doesn’t know how you run your business. It cannot reliably reason about your specific data, your specific terms, or your specific regulatory obligations without that precise foundation. So, while it’s tempting to skip the ontology for LLMs, they are a poor substitute where precision and auditability are mandatory.

The good news is that it’s not a binary decision because LLMs are great at managing ontologies. Extracting concepts from years of accumulated documentation, proposing relationships, resolving cross-domain ambiguities into a formal specification are the kind of tasks that made managing an ontology difficult to scale and manage. LLMs bring all of this into the realm of possibility. So, the answer isn’t replacing ontologies with LLMs, it’s using the LLMs to do the management. This allows firms to build their own ontological identities at enterprise scale. That brings the agents closer to the firm instead of the firm closer to the agents.

Wrapping up: The knowledge ladder

Semantics

Dictionaries

  • Miscommunication costs money; shared language eliminates it.
  • People spend less time learning and more time producing.

Data dictionaries

  • Stop paying people to repeat the same arguments about data.
  • Data becomes findable and trustworthy.

Governed terms

  • Bad data gets caught before it becomes a bad decision.
  • Compliance evidence exists for audits.

Taxonomies

  • The same question gets the same answer across different business units.
  • Vendor and regulatory data stops requiring manual reconciliation.

Machine reasoning

Ontologies and knowledge graphs

  • Complex inference questions are directly queryable.
  • The firm becomes understandable as a whole, not just as a collection of documented silos.

Context: The AI imperative

  • LLMs build ontologies at scale. Ontologies make LLM reasoning trustworthy. Neither is sufficient alone.
  • The firms that understand this relationship are ahead.

What’s next

Now that we’ve established the value, the next step is implementation. A good starting point is understanding where your firm sits on this ladder today: dictionary, data dictionary, governed terms, taxonomy, or knowledge graph. That assessment will be our starting point for the next article.

Future-proof your data catalog

DataHub transforms enterprise metadata management with AI-powered discovery, intelligent observability, and automated governance.

Explore DataHub Cloud

Take a self-guided product tour to see DataHub Cloud in action.

Join the DataHub open source community 

Join our 14,000+ community members to collaborate with the data practitioners who are shaping the future of data and AI.

FAQs

The context layer in financial services is the governed semantic foundation that makes data interpretable by humans and machines. It includes the glossary, data dictionary, governed terms, taxonomy, and ontology that together let humans and AI agents reason over the firm’s data with consistent understanding. It is the layer regulators implicitly assume exists and that most firms are still building.

A traditional data catalog inventories what data exists and who owns it. Semantic models typically sit inside a BI or analytics tool and standardize metric definitions for that tool’s users. A context layer is broader than either: It unifies technical metadata, business glossary, governed terms, lineage, and the relationships between them into a single foundation that serves every downstream consumer, including AI agents. A data catalog is one input to a context layer, not a substitute for one.

Financial services combines three pressures that magnify the cost of ambiguous context: Dense industry-specific terminology that varies by business line Strict regulatory requirements for traceability and explainability High-stakes decisions where wrong answers carry direct financial and legal consequences. Most other industries can absorb a certain amount of semantic drift. Financial services usually cannot.

A knowledge graph encodes the relationships between entities (customers, accounts, transactions, instruments, regulations, terms) and lets machines reason over those relationships at scale. In financial services, this is what makes firm-wide exposure analysis, AML pattern detection, and automatic regulatory scope determination possible. It is the rung of the context layer that turns a structured glossary into actual contextual intelligence.

BCBS 239 and similar regulations require firms to demonstrate control over data lineage, definitions, and quality across reports. A governed context layer makes that demonstrable rather than aspirational. Once business terms, calculations, and source-system relationships are centralized, the same definitions can be mapped once and reused across LCR, NSFR, and other jurisdictional reports without rebuilding the translation layer for each requirement.

No. A retrieval engine is a runtime component that pulls data from somewhere and hands it to a model. It does not define what the data means, understand the historical context, govern who can access it, or enforce the constraints that regulators expect. Without an upstream context layer, retrieval over an ungoverned data estate produces fluent answers about ambiguous fields, which is exactly the failure mode financial services firms cannot afford.

Enterprise AI fails in financial services for the same reason it fails everywhere else, only with higher stakes: the model has no reliable way to reason about what the firm’s data actually means. Without a governed semantic foundation upstream, retrieval pulls back ungoverned fields, the model produces fluent but ambiguous answers, and the output cannot be defended to a regulator or an auditor. Solving the model is not the fix. Solving the context layer is.

DataHub unifies technical metadata, business glossary, governed terms, documentation, and lineage into a single context graph that serves humans and AI agents from the same source of truth. Its event-driven architecture keeps the layer synced with 100+ data sources, and its Model Context Protocol (MCP) server and native integrations expose the governed context graph directly to the AI tools financial services teams are already deploying. The result is one governed foundation underneath every downstream initiative, instead of five disconnected ones.