When you think of your data warehouse, the “semantic layer” may not be the first thing that pops in your mind. Prior to reading Frances O’Rafferty‘s blog post on this topic, I didn’t even know this was a concept that mattered in the data stack. To be honest, the concept is still a bit confusing to me since I’m not building data warehouses and data products all day. Frances grew up in northern England studying mathematics during the recession. The decision to jump into data was a function of what jobs happened to be available at the time. Frances worked through a variety of data warehousing, BI, and ETL roles before looking more into the data management space like data modeling and cataloguing. This conversation is a deep dive into the world of data warehousing, data catalogues, and of course, the data semantic layer.
Enforcing data warehouse conformity for an insurance company
Imagine an insurance company where the policies are in two different systems. Which database contains the “right” policy for a customer? This is the mess Frances had to deal with when she helped build out the insurance company’s data warehouse. What I thought was interesting is Frances’ team looked at the source data and then interviewed people in the business to understand how the data is generated and how the data is being used. The questions she was asking were pretty high-level:
- What do you do on a day-to-day basis?
- What works well and doesn’t work well?
- What would you like the data to do?
Data quality validation checks and global lookups were set up so that if a new piece of data entered the warehouse and it didn’t match, then the administrator would get an alert. They would then have to figure out what to do with that rogue piece of data to fit the rules that have been set up.
A methodology Frances brought up I’ve never heard before is the Kimball methodology for setting up a data warehouse or BI system. The main tenets of the methodology are basically how modern data warehouses are setup: add business value, structure data with dimensions, and develop the warehouse iteratively. This is an image of the lifecycle from their website:
Focusing on different layers of the warehouse “stack”
Frances’ team first focused on the data source layer and tried to figure out where all the data came from. After that, then came the consolidation layer. That consolidation layer is where the data gets split into facts and dimensions.
I figured even for a data warehouse project, Excel must come into play at some point. Excel was used fro all the modeling to figure out what the dimensions and facts were. It wasn’t a core part of the warehouse but it was still a one-time use tool in the development of the warehouse.
The final layer is the target layer where we are getting more into the business intelligence realm. There are different ways the insurance company wanted to see the data. So Frances team had to create different views of the data to answer questions like: What premiums have we received? What transactions have come through? The actuarial team wanted to see what the balance was on an account so another view was created for them.
Frances noticed that different regions would call the data different things but they were all still referring to the same concept. There wasn’t a system to translate what a metric like gross revenue, for instance, meant in one region would mean in another region. This foreshadows the semantic layer concept Frances wrote about.
Cataloging 5,000 data attributes for an investment bank
Data catalog tools can get expensive (and rightly so) if you are managing thousands of data attributes and definitions. As I discussed in episode #129 with Jean-Mathieu, if you only have a handful of attributes, using Excel or Goole Sheets is completely doable as a data catalog.
The investment bank Frances was working for had many different source systems, KPIs, and measures that the entire investment bank was trying to get alignment on. Span this across a variety of financial products and the team came back with 5,000 attributes to put into a data catalog. The challenge was understanding the requirements from the finance, risk, and treasury departments to create a catalog that could be shared internally within the entire bank.
Frances’ team looked at the taxonomy first across loans, customers, and risk. They had an original glossary and compared the glossary with the new taxonomy. The main tool they used for the data catalog was Collibra. With this new catalog, new publishers of data had to abide by a strict format dictated by the catalog.
After one year and talking with 150 people, they finally launched the data catalog to the entire investment bank. I asked Frances how her team was able to best understand what different data attributes meant. The answer is just as you would expect: she asked people within the bank to send an example of the data attribute and how it’s being used.
Translating data for enterprise consumers with the semantic layer
Back to the Frances original post about semantic layer: it historically is “trapped” in a BI tool, according to Frances. When Frances first started using SAP, there was a business objects universe in SAP which allowed you to create joins between tables and define data attributes. But these rules and definitions only existed in SAP.
Today, the semantic layer can show up in places like dbt, Collibra, graph databases, and more. There isn’t a “semantic layer vendor” that does it all (which is the first question I asked Frances about her blog post). The key takeaway is that the raw data in the data warehouse needs to be translated/converted into something usable by consumers within the enterprise. Frances said this translation is usually needed with legacy applications.
This is a good diagram of where the semantic layer sits within the data stack:
The next question to answer is: who owns the semantic layer? According to Frances, this also depends on your data team. It could be the data governance team, data management team, or even the data visualization team. If you’re looking at ownership from the perspective of product management, it would be the product owner. At the end of the day, it’s the team that is working with the people who are consuming your organization’s data.
Other Podcasts & Blog Posts
No other podcasts or blog posts mentioned in this episode!
The post Dear Analyst #132: How the semantic layer translates your physical data into user-centric business data with Frances O’Rafferty appeared first on .