The universal semantic layer may be the missing piece from your analytics puzzle you overlooked

Balázs Bertold, Havas Levente

What is a semantic layer? Why is it necessary for successful enterprise-level reporting and analytics? Why don’t you buy your gold jewelry from the goldmine? This last question may seem out of trend, but we promise it will make sense when you read on.

You may have already heard the phrase “data is the new gold”. It is a common phrase of webinars, analytics software sales pitches or basically any communication efforts which try to sell you the idea that 1) data is the answer to all your challenges and 2) and you are only a few easy clicks away from having perfectly shaped data work for you and your business. Even better, they say anyone can do it, no extensive data skills or statistical knowledge is required because self-service business intelligence can effortlessly be done by its users. You can be a business line employee, mid-manager or even a C-level business leader and you can still easily create your own data analyses…. it’s almost too good to be true, isn’t it?

Well, it is not that simple unfortunately, so we will break it down a bit. Let’s stick to the “data is gold” analogy for the rest of the article which fits quite well as you may see in the following paragraphs! Firstly, we would argue that data is not yet gold, at least not in a sense that we would like it to be. Data is just….well, data. Unfiltered, raw, and often way too technical for the human eye. Much like gold, without extracting it, refining it, transforming it, and then transporting it to the people, data won’t be able to create much value for your business. In its raw form, data is seldom usable by business users. So, data is rather like the ore / the rock which contains some gold, but I doubt you would buy a large piece of rock to wear it on your finger. Ideally, you want business insights (not just data), which are actionable, accurate, trusted and very importantly easy to understand by the business even without the help of IT or data engineers. That is the real gold pursued by firms nowadays.

Alright, so you need insight, not just data, which shouldn’t be that hard to obtain, right? After all, what would you do if you wanted to purchase a gold ring or any other gold jewelry? Would you go to the gold mine yourself? Of course not! You would normally go to a jewelry store, a place where the products are already prepared, cleaned, categorized etc. Okay, but obviously in the jewelry store you would like to learn the details of how your selected gold ring was transported there, get to know the complex chemical reactions involved in the making, like cyanidation, and you can’t wait to read some technical information, such as “75% Au, 25% Ag”. No? In our experience, most business users are not that interested in the sophisticated ETL (extract, transform, load) processes, the technical field names in the SQL database or the script level details of the corporate data pipeline. What they rather appreciate is a unified layer, an abstraction of complex data where they can find trusted, cleaned, and already modeled KPIs and metrics. Metrics which follow the logic of the company’s business operations and reduce the complicated data terms into common business terms like net profit, G&A costs, or Year-over-Year Revenue Growth. You simply want an 18-carat ring and the feeling that you can trust the jewelry’s source and quality without the need to know all the details and processes behind.

Hence, what is needed is a single and(!) simple source of truth, a jewelry store if you like, where you walk in as a regular, non-expert business user and find your insights based on metrics and dimensions which fit your thinking and everyday terminology. This is where Universal Semantic Layer (USL) enters the scene! But what is it anyway? Although, there is no single best definition (ironically), Kyligence’s blog post¹ sums it up quite well when stating:

„A semantic layer is a business abstraction derived from the technical implementation layer to uniformly maintain business logic”

In a 2020 report, Gartner⁶ highlights that in today’s complex business environments firms have a difficult time sharing their data assets and key metrics, which undermines the trustworthiness and development speed of reporting systems. Moreover, the sheer volume and variety of data makes it increasingly cumbersome to maintain a single platform for company data assets with common business definitions. This leads to data silos at the distinct business units / functions⁶, creating little isolated report islands. The problem is, in the absence of commonly agreed and shared definitions, the KPIs of these silos won’t be comparable or suitable for aggregation. This phenomenon is only accelerated by the presence of multiple self-service BI tools (mentioned earlier) and desktop version dashboards in organizations. At the same time, too much control and centralization would lead to an overwhelmed IT/BI department and growth in development times as not only would the company’s IT experts encounter capacity issues, but they would also lack the specific business knowledge to promptly create sophisticated reports.

Now we see the challenges. Let’s summarize them before trying to overcome them. In today’s chaotic and swiftly changing business landscape, firms face difficulties in BI and analytics because:

The volume and variety of available data assets in a company reaches vast amounts
Data in many cases is too technical and distant from the operations’ logic for business users to utilize it easily
The lack of common definitions slows the integration of business unit data, creating silos in the analytics, thus making it hard to maintain a single source of truth for decision making
The uncontrolled use of self-service BI undermines aspirations for integrated company-wide reporting

So how can a universal semantic layer, overarching the entire company, help us to solve these problems? During our research, we analyzed numerous semantic layer tools (along with data catalog and metadata management solutions) , read articles & reports regarding the topic, and having already worked with most of these solutions, we identified 7 key capabilities of USLs. In our opinion, these are the requirements a solution should meet to be called a universal (enterprise-wide) semantic layer. Now, we will continue with these key functions one-by-one to see how they assist business users and BI departments in tackling the aforementioned obstacles, which stand between them and their desired data-driven operations.

1. CONNECT “ANYTHING”, FROM UPSTREAM DATA SOURCES TO BI SOLUTIONS

A true USL should be able to gather data from virtually any mainstream data source such as relational databases, flat files, cloud-based sources, or big data processing applications. Yet, this is hardly a new requirement. Major BI platforms have been able to satisfy this need for ages and their list of data connectors is continuously growing. On the other hand, so far there have been relatively few platforms which allowed other BI applications to connect to their semantic layers (if they had one in the first place). This is fairly understandable from the perspective of BI vendors who probably don’t want their competitors’ front-end dashboard apps to connect to their model. Some BI platforms such as MicroStrategy or Oracle BI are more willing to open their “built-in” semantic models for other analytics tools such as Tableau, Qlik and so on. This way, distinct dashboard tools can connect to a single, centrally governed data model. We distinguish these solutions from independent semantic layer vendors like AtScale, which do not have their own front-end. These “independent” firms can focus solely on the semantics and virtual modeling aspect and aim to make their product usable by as many dashboard and analytics apps as possible. In conclusion, the biggest added value of a USL in terms of connectivity is the ability to serve multiple downstream applications, if necessary, while providing the same semantic model and governance for all of them.

2. PROVIDE ADVANCED MODELING FEATURES TO UNIFY DISTINCT DATA SOURCES

Occasionally, a central DWH (Data Warehouse) can accumulate all our company data and serve as a single source for the reporting; however, more often there are numerous different sources where our data assets are created and stored. ERP, CRM, IoT sensor data and many other types of sources can be scattered around the firm in cloud-based or on-premise systems. But what happens for example if you want to see a complex analysis regarding the efficiency of your manufacturing plant, which involve cost data from ERP and IoT data from factory line sensors? Well, even if the systems of our interest are well established and coherent per se, chances are high that they have substantial difference when compared to each other. Maybe the data types or field names that have the same business meaning are different, what is called “Factory_Line_ID” in one source could be “lineID” in another. Or it can be that their level of data aggregation differs, e.g.: cost data might be available only daily or monthly while you have near real time sensor data. To solve these issues and bring all data to a common denominator, a USL must have sophisticated modeling capabilities including JOIN, UNION operators to combine data from distinct sources and transforming functions which help map together attribute elements which have the same meaning but are named differently in the sources. Not only should the USL have advanced virtual modeling capabilities, but it also has to be flexible to adapt quickly to changing business rules or newly integrated data sources. If, instead of parameters and easily changeable conditions, your model contains hard coding, you risk the building of a solution that is difficult to maintain or further develop. So, USL is beneficial because users are able to access data from several sources but in a modeled and consolidated manner.

3. USE SHARED DEFINITIONS BASED ON COMMON BUSINESS TERMS AND THE REAL LOGIC OF OPERATIONS

Based on our BI consulting experience, one of the most common challenges we tackle when assessing a BI environment at a partner company is finding out what KPIs and metrics mean and how are they calculated. As a matter of fact, it’s not only a problem for the external parties, but often people who actually work for that company can have a hard time figuring out what 5,4% “YoY_oper_profit_var_dw2” means on the dashboard. This is due to the following factors:

KPI and metric calculations are often carried out by different employees and divisions over time and each time is a chance of failing to replicate the calculation logic perfectly
In more robust BI environments, there are hundreds of metrics and it’s simply hard to keep track of them without a data dictionary of some sort
Siloed operations can mean siloed definitions among business units for KPIs if there is no mutual platform to share the business rules
Use of technical names of objects (originating from the DWH or transactional sources) makes it hard for business users to understand their data as these field names don’t necessarily describe the business term they cover

A semantic layer mitigates these issues by creating a shared platform where all definitions and calculations of metrics are transparent. Objects can be grouped in subject area folders based on their business relevance instead of their physical source. Furthermore, the naming convention of metrics and dimensions in the USL follows the nomenclature of the business itself and not the ones of the storage systems. Using everyday business terms, object descriptions, and aliases, the semantic layer translates the complexity into familiar definitions for the end users.

So instead of this à “YoY_oper_profit_var_dw2”

You get this

Metric Name: “Year-over-Year Operating Profit Change”
Definition: Compares the operating profit of the chosen period to the same period in previous year.
Unit of measurement: %
Formula: (([Actual Op Profit] – [Last Year Op Profit]) / [Last Year Op Profit]) * 100

4. ENSURE REUSABILITY OF OBJPREVIOUSLY USED OBJECTS

This one may seem obvious, yet there countless counterexamples we have seen in corporate BI environments. Building the same items repeatedly, creating the same KPI 4 times by 4 different people, effectively increasing the number of objects, increasing redundancy, and finally increasing the chances of making mistakes in the process. So, let’s imagine Kate, one of the analysts in your organization who made an amazing business metric calculation where everything adds up, numbers are good, the calculation is efficient. Unfortunately, she did this calculation within a particular dashboard and not in a central repository where it is accessible to others. Therefore, 3 months later when Bob, another analyst needs such a KPI, unaware of the KPI’s existence, he develops it; again, only in his own BI report. Can you spot the risks and inefficiencies here?

The metric calculation had to be figured out again, which takes time, incurring extra costs
Nothing guarantees that the 2 KPIs will have the same results, as they may use different sources or different methods of calculation

A universal semantic layer mitigates this risk by keeping objects in a central, accessible platform so it is ready to be used in not only one but many reports and dashboards, hence decreasing development times. In addition, it ensures that all published BI content uses the same definitions and formulas as the objects themselves (the building blocks of reports like dimensions, metrics, filters, prompts, etc.) are technically the same in the repository. AtScale CTO Matthew Baird said:

“Without some level of abstraction, business is beholden to IT to generate and run reports or risk making big, costly, and worst of all, hidden mistakes. Can you afford to have each of your employees independently trying to replicate this logic correctly in their spreadsheets and reports?”

All in all, reusability is not only about efficiency, but also about accuracy and consistency. Just think about how frustrated you would be to see 42 million in one of your sales reports and 44,5 million in a different one! It may be a justifiable difference, but presumably in some C-level executives it would raise a little doubt about the credibility of the company’s BI system.

5. CONTROL DATA ACCESS AND SECURITY ON THE DATA LEVEL, RATHER THAN AT THE ANALYTICS

Who has the rights to see into our numbers? Can we show the users only the data which is necessary for their job? But what if “necessary” differs from user to user? Who should be able to open our dashboards anyway?

Similar questions will come up sooner or later if an enterprise reaches a certain analytics maturity and volume of BI content. Good news is, the USL can be a major contributor to a centralized, yet flexible access control system. As mentioned before, large firms can have several applications and business intelligence tools utilizing the data assets of the company. Constraining user access rights in each of those individual tools independently can result in a complex and unsustainable governance model. Yet, if the data used by those platforms are in a central virtual model, it is feasible to control data usage from that single central repository. Either if you wish to “hide” certain KPIs or datasets from particular users, or you want to apply Row Level Security (RLS) so people see the same dataset but different segments of it, a USL can assist the organization to do so.

A simple example could be that the financial dataset and its KPIs should only be accessible to the finance department and the C level executives. This can be achieved either by granting access to only those users or by creating a new user group for those people while leaving any other user out of the group.

But what if we have more sophisticated access control requirements? Imagine we have a detailed sales dataset with many relevant KPIs for every business region! We would like the executive management to see the sales report for every region, but the regional managers to have access only to the figures of their own region. Should we create multiple focused datasets and make them available to the different managers? No, this wouldn’t be efficient or sustainable. More dataset objects would be used and any change in the report or data structure must be addressed in several datasets, which takes more effort and leaves more room for mistakes. Instead, only 1 dataset and 1 report / dashboard are enough if we apply Row Level Security on the data itself in the USL. This way, the region dimension acts as a dynamic filter, which is applied based on who is opening the report. If the CFO opens it, all data is available for every region, but if the southern region manager opens it, only the southern region sales data is displayed. This is all controlled in one place, in the semantic layer so it doesn’t matter what dashboard tool sits on top of it.

6. OFFER HIGH-PERFORMANCE WITH QUERY OPTIMIZATION

Naturally, no one likes slow reports and lengthy loading times; therefore, a USL must ensure that all report and dataset refresh runs are optimized at the query level. As mentioned before, one of the core strengths of a semantic layer is to connect to multiple data sources and sort them into integrated logical models. Therefore, when these logical models encounter a request to create a dataset that contains financial data from the ERP and some operational data from another data source, the USL should launch an SQL or SQL-like query to obtain the data from these sources. This query puts together the complex dataset we need so we can use it in our dashboard reports. Since data sources are heterogeneous, a complex query may be needed to bring together all the data with the appropriate filter conditions, suitable aggregation level, and resource-intensive JOIN conditions; hence, it is important that all this is done in a fast and cost efficient manner.

This can be achieved in multiple ways ranging from simple query optimization (preferring certain less resource intensive operations over others) all the way to machine learning driven optimization, which analyzes the sources and the usage patterns of report consumers to make data loading smarter and faster.

7. ENABLE TRANSPARENCY, CLARITY, AND SERVE AS THE SINGLE SOURCE OF THE TRUTH WITHIN THE ORGANIZATION

Finally, the USL must be an ambassador of transparency regarding the data assets. It is only half the job to have shared and central definitions. People need to trust those metrics, filters, and dimensions before they can use them. A good semantic layer therefore satisfies two criteria to enable trust and transparency:

It logs every event and action affecting the data objects. If a metric calculation is changed, a log is created. If a dimension gets a new alias, a log is created. If the “Product Family” dimension object is renamed to “Product Category”, a log is created. It is important to show business users that changes are tracked and traceable, so they are presented with a system where admins and architects are accountable for their actions in the schema.
A data stream helps trace any object to its dependents and components. If the “Where is this number coming from?” question arises in a manager looking at a dashboard, it is vital that the figure can be easily tracked all the way back to its original source.

In an enterprise BI environment where several analytics tools, data sources, and KPIs are present, offering this kind of transparency enables data consumers to trust the figures, while also helping the analysts and content creators to understand and reuse their own and each other’s work.

CONCLUSION

So, can a corporate BI function without a universal semantic layer? Yes, it can. We have seen numerous examples of this. Yet, people should bear in mind that after surpassing a certain complexity and size, it becomes increasingly more difficult to maintain consistency and order in such a BI environment. Surely, constructing a good semantic layer takes planning, time, and costs, but so does everything else. In our opinion, it doesn’t make much sense to spend enormous amounts on databases and front-end tools if the outcome is not user friendly and trusted by the organization. So, walk that extra mile, and put that last piece of the puzzle into your BI picture! If implemented correctly, a universal semantic layer may bridge the gap between business and technology for everyone’s benefit. Don’t forget, even if it isn’t obvious yet, that what you need is not raw gold from the mine, but a quality assured and refined jewelry, a quality assured and refined Business Insight!