The hub-and-spoke data warehouse model uses a centralized warehouse feeding dependent data marts.

The hub-and-spoke data warehouse model uses a centralized warehouse feeding dependent data marts.
Conventional wisdom states that you should build a data warehouse (DW) and then populate data marts from it. This data architecture is called a hub and spoke. The central premise of the hub and spoke is that the heavy-duty data integration activities are completed as you move the data from source systems into the DW. Once these data integration tasks are done, you can propagate or “franchise” the data from the DW into the data marts. Data franchising takes the DW data and packages it via filtering, aggregation and transformations into data usable by business intelligence (BI) tools and understandable to business users (see Figure 1). This approach works extremely well, but lately I am concerned because it appears this approach is being abandoned. I think this is happening because people have forgotten its fundamental aspects and why it works better than the alternatives. Due to advances in databases, storage, network and server capabilities, it is often more effective to deploy the hub and spoke all on the same database or instance rather than store the data mart “spokes” on separate servers and databases.

Figure 1: Hub and Spoke

Physically storing all the data on the same database instance works well, but often when companies deploy this architecture, they forget why they split the data into a DW and data marts. What often happens is the DW and data mart-oriented tables get mixed together or, worse, the data marts tables or cubes do not get built at all. Instead, people start trying to use the DW tables for all reporting and analysis needs. This gets us back to the old-school idea of a central DW trying to be all things for all purposes and generally getting none of it right.

Data warehousing has evolved with a DW becoming the data integration and data distribution hub feeding all the downstream feeds (spokes) to support BI applications. The BI applications need the data they use to be recast from their DW schema into a dimensional model, OLAP cube or a denormalized structure. In addition, these downstream feeds from the hub into the spokes or data marts also apply filtering, aggregation and business transformations – I’ll refer to this as data franchising, which a business group or business process might need to be done across many reports or analytical applications. It’s far more efficient to perform this data franchising once for a business group or a business process. Not only does this reduce the time to develop and maintain individual reports, but it also results in a significant reduction in errors and report-processing time because the franchising is prebuilt into the data marts.

People are overusing materialized views instead of creating the data marts of the hub-and-spoke model. Yes, it’s tempting to skip that extract, transform and load (ETL) step and just let the database create the equivalent of the data mart via materialized views, but it’s a simplistic approach and it doesn’t work. Too often, developers do not really understand the data transformation needs of the business and assume that what they can do in materialized views is sufficient. When they oversimplify this way, they pay the price of a much more expansive and expensive reporting and BI tool development cycle. That’s because the BI reports have to implement the transformations that the data franchising processes would have done. And they have to do it every time a report is developed. It also adds to maintenance costs. When developers do try to implement the data franchising process in the materialized views, it takes too long to create them. Just because you can do something technically doesn’t mean  that it’s what your business needs. Lastly, I’m concerned about lost opportunities. The data marts franchised from the DW start the process of packaging data for business consumption. But why end there? Why not extend this approach to have the data marts become hubs for creating OLAP cubes or submarts for providing performance management, reporting and business analytics?

Figure 2: Hub and Spoke Extended

Further extending the hub-and-spoke model (Figure 2) can be an effective method to shift the filtering, aggregation and business transformations that occur in BI reports into the data franchising or ETL process. Reusing transformations, as with hub and spoke, is more efficient and reduces errors. Hub and spoke still has a lot to offer, so let’s not allow it to die out yet.

In a market dominated by big data and analytics, data marts are one key to efficiently transforming information into insights. Data warehouses typically deal with large data sets, but data analysis requires easy-to-find and readily available data. Should a business person have to perform complex queries just to access the data they need for their reports? No—and that’s why companies smart companies use data marts.

A data mart is a subject-oriented database that is often a partitioned segment of an enterprise data warehouse. The subset of data held in a data mart typically aligns with a particular business unit like sales, finance, or marketing. Data marts accelerate business processes by allowing access to relevant information in a data warehouse or operational data store within days, as opposed to months or longer. Because a data mart only contains the data applicable to a certain business area, it is a cost-effective way to gain actionable insights quickly.

Data Mart vs Data Warehouse

Data marts and data warehouses are both highly structured repositories where data is stored and managed until it is needed. However, they differ in the scope of data stored: data warehouses are built to serve as the central store of data for the entire business, whereas a data mart fulfills the request of a specific division or business function. Because a data warehouse contains data for the entire company, it is best practice to have strictly control who can access it. Additionally, querying the data you need in a data warehouse is an incredibly difficult task for the business. Thus, the primary purpose of a data mart is to isolate—or partition—a smaller set of data from a whole to provide easier data access for the end consumers.

The hub-and-spoke data warehouse model uses a centralized warehouse feeding dependent data marts.
The hub-and-spoke data warehouse model uses a centralized warehouse feeding dependent data marts.

A data mart can be created from an existing data warehouse—the top-down approach—or from other sources, such as internal operational systems or external data. Similar to a data warehouse, it is a relational database that stores transactional data (time value, numerical order, reference to one or more object) in columns and rows making it easy to organize and access.

On the other hand, separate business units may create their own data marts based on their own data requirements. If business needs dictate, multiple data marts can be merged together to create a single, data warehouse. This is the bottom-up development approach.

3 Types of Data Marts

There are three types of data marts: dependent, independent, and hybrid. They are categorized based on their relation to the data warehouse and the data sources that are used to create the system.

1. Dependent Data Marts

A dependent data mart is created from an existing enterprise data warehouse. It is the top-down approach that begins with storing all business data in one central location, then extracts a clearly defined portion of the data when needed for analysis.

To form a data warehouse, a specific set of data is aggregated (formed into a cluster) from the warehouse, restructured, then loaded to the data mart where it can be queried. It can be a logical view or physical subset of the data warehouse:

  • Logical view - A virtual table/view that is logically—but not physically—separated from the data warehouse
  • Physical subset - Data extract that is a physically separate database from the data warehouse

Granular data—the lowest level of data in the target set—in the data warehouse serves as the single point of reference for all dependent data marts that are created.

2. Independent Data Marts

An independent data mart is a stand-alone system—created without the use of a data warehouse—that focuses on one subject area or business function. Data is extracted from internal or external data sources (or both), processed, then loaded to the data mart repository where it is stored until needed for business analytics.

Independent data marts are not difficult to design and develop. They are beneficial to achieve short-term goals but may become cumbersome to manage—each with its own ETL tool and logic—as business needs expand and become more complex.

3. Hybrid Data Marts

A hybrid data mart combines data from an existing data warehouse and other operational source systems. It unites the speed and end-user focus of a top-down approach with the benefits of the enterprise-level integration of the bottom-up method.

Structure of a Data Mart

Similar to a data warehouse, a data mart may be organized using a star, snowflake, vault, or other schema as a blueprint. IT teams typically use a star schema consisting of one or more fact tables (set of metrics relating to a specific business process or event) referencing dimension tables (primary key joined to a fact table) in a relational database.

The benefit of a star schema is that fewer joins are needed when writing queries, as there is no dependency between dimensions. This simplifies the ETL request process making it easier for analysts to access and navigate.

In a snowflake schema, dimensions are not clearly defined. They are normalized to help reduce data redundancy and protect data integrity. It takes less space to store dimension tables, but it is a more complicated structure (multiple tables to populate and synchronize) that can be difficult to maintain.

Advantages of a Data Mart

Managing big data—and gaining valuable business insights—is a challenge all companies face, and one that most are answering with strategic data marts.

  • Efficient access — A data mart is a time-saving solution for accessing a specific set of data for business intelligence.
  • Inexpensive data warehouse alternative — Data marts can be an inexpensive alternative to developing an enterprise data warehouse, where required data sets are smaller. An independent data mart can be up and running in a week or less.
  • Improve data warehouse performance — Dependent and hybrid data marts can improve the performance of a data warehouse by taking on the burden of processing, to meet the needs of the analyst. When dependent data marts are placed in a separate processing facility, they significantly reduce analytics processing costs as well.

Other advantages of a data mart include:

  • Data maintenance — Different departments can own and control their data.
  • Simple setup — The simple design requires less technical skill to set up.
  • Analytics — Key performance indicators (KPIs) can be easily tracked.
  • Easy entry — Data marts can be the building blocks of a future enterprise data warehouse project.

The Future of Data Marts is in the Cloud

Even with the improved flexibility and efficiency that data marts offer, big data—and big business—is still becoming too big for many on-premises solutions. As data warehouses and data lakes move to the cloud, so too do data marts.

With a shared cloud-based platform to create and house data, access and analytics become much more efficient. Transient data clusters can be created for short-term analysis, or long-lived clusters can come together for more sustained work. Modern technologies are also separating data storage from compute, allowing for ultimate scalability for querying data.

Other advantages of cloud-based dependent and hybrid data marts include:

  • Flexible architecture with cloud-native applications.
  • Single depository containing all data marts.
  • Resources consumed on-demand.
  • Immediate real-time access to information.
  • Increased efficiency.
  • Consolidation of resources that lowers costs.
  • Real-time, interactive analytics.

Getting Started With Data Marts

Companies are faced with an endless amount of information and an ever-changing need to parse that information into manageable chunks for analytics and insights. Data marts in the cloud provide a long-term, scalable solution. To create a data mart, be sure to find an ETL tool that will allow you to connect to your existing data warehouse or other essential data sources that your business users need to draw insights from. In addition, make sure that your data integration tool can regularly update the data mart to ensure that your data—and the resulting analytics—are up-to-date.

 Talend Data Management Platform helps teams work smarter with an open, scalable architecture and simple, graphical tools to help transform and load applicable data sources to create a new data mart. Additionally, Talend Data Management Platform simplifies maintaining existing data marts by automating and scheduling integration jobs needed to update the data mart.

With Talend Open Studio for Data Integration, you can connect to technologies like Amazon Web Services Redshift, Snowflake, and Azure Data Warehouse to create your own data marts, leveraging the flexibility and scalability of the cloud.

Are data warehouses subsets of data marts?

As a data mart is a subset of a data warehouse, businesses may use data marts to provide user access to those who cannot otherwise access data. Data marts may also be less expensive for storage and faster for analysis given their smaller and specialized designs.

When representing data in a data warehouse using several dimension tables that are each connected only to a fact table means you are using which warehouse structure?

When representing data in a data warehouse, using several dimension tables that are each connected only to a fact table means you are using which warehouse structure? Subset of a multidimensional array (usually a 2D representation) corresponding to a single value set for one (or more) of the dimensions in the subset.

In which stage of extraction transformation and load ETL into a data warehouse are data aggregated?

In the first step extraction, data is extracted from the source system into the staging area. In the transformation step, the data extracted from source is cleansed and transformed . Loading data into the target datawarehouse is the last step of the ETL process.

What type of data warehouse is created separately from the enterprise data warehouse by a department and not reliant on it for updates?

decision makers have unfettered access to all data in the warehouse. Which kind of data warehouse is created separately from the enterprise data warehouse by a department and not reliant on it for updates? one tier architecture.