A data warehouse is a system
commonly used to connect and analyze business data from disparate sources and
to help an organization make decisions.
Data warehouses are central
repositories of integrated data from one or more heterogeneous sources.
A data warehouse is considered a core component of business intelligence.
Enterprise data warehouse architecture. Data warehouse architectural concepts and components. Image Credit: Wikimedia Commons, I have made some modifications in size and color. |
Data Warehouse (DW) - Characteristics,
Architecture and Principles
What is a Data Warehouse (DW)?
In computing, a data
warehouse (DW) is often considered a business intelligence system used
for data analysis and for the day-to-day reporting needs of a business unit.
A data warehouse is a central
repository of integrated data from one or more unequal sources.
The data warehouse contains all the
data related to the business. Information systems can be categorized in many
aspects that can be long or limited. These include classification by purpose of
such systems in enterprises.
The focus here will be on both
operational information systems and decision support systems.
Operational information systems are
concerned with the management of the day-to-day operations of organizations and
are the cornerstone of modern enterprises.
Examples of such systems are Accounting
systems, hospital patient management systems, application management systems on
a commercial website and student registration systems at colleges and
institutes.
All these systems are concerned with
the day-to-day operations of enterprises and indispensable in modern
institutions.
Obviously, operational information
systems store and process a large amount of data every day, and that data
increases overtime to reach in some institutions to millions of gigabytes and
terabytes.
Sometimes such data exist in
insurance companies, universities, hospitals, and public government
sectors.
Such data is a real asset and is
invaluable if it is exploited by enterprises to study and analyze and then
presented to the decision-makers in the establishments in a way that enables
them to identify the performance of these facilities and formulate long-term
strategies for them.
Hence, there was a need only for
Decision Support Systems (DSS). One
of the most important applications of these systems, which is the main pillar of
these systems, is Data Warehousing.
Characteristics
of Data warehouse
Data warehouses are systems that are
concerned with studying, analyzing and presenting enterprise data in a way that
enables senior management to make decisions.
The data warehouses have some
characteristics that distinguish them from any other data and these
characteristics are as follows:
Subject-Oriented: For example, we define subjects in a medical facility as
doctors, nurses, medicines, diseases, and so on. They are sensual and moral
things.
Integrated: The relationship between the data and the way in which the
data is extracted and modified is the same regardless of the original source.
None-Volatile: None-Volatile: The data warehouse is not editable when loaded
in repositories and not updated in real-time.
This data is updated periodically by
data uploading, protecting it from the effect of instant change and only used
for analysis, study, and presentation.
Time-Variant: This is the most important point that this data is related to
the time factor and organized via time periods (weekly, monthly, yearly,
etc.), for example, Student Registration Data at the university for over
forty years.
5 Steps to Data Warehousing
Building a data warehouse in an organization is not an easy task, and requires a professional team in business
information systems analysis; it is also important to have professional
technical skills in that team.
Now, we will present the technical
steps to build data warehouses, far from the analysis and design phase that is
very close to the business view. These steps are arranged as follows:
1. Creating
Data Preparation Area: a
database with very high storage capacity; It stores all data coming from
different Operational Systems so that data is purified and modified before
being loaded into the data warehouse.
At this stage, the design of the database should be compatible with the design of the data warehouse.
2. Creating
Data Warehouse: A data warehouse is created where data will be uploaded after it is extracted and
purified.
Data warehouses are always designed
to allow relationships of different dimensions, such as the relationship
between the admissions of students to a specialization in a given year.
3. Data Marts: Some large organizations divide the data warehouse into a set
of data marts.
There is a Data Mart for financial
management data and another for human resources management or division based on
the branches of the institution.
4. Extract,
Transform, and Load (ETL) Process: At
this stage, we extract data from different sources to the Data Preparation
Area and we convert it from one image to another if required.
We also often integrate some data
with each other or define new data that did not exist before, in addition to
clearing the wrong data and deleting unimportant ones.
At the stage of Loading Data, The
data is loaded from the Data Preparation Area to Data Warehouse. During this
phase, the data is tested.
5. Creating
Decision Support Systems (DSS) Applications: At
this stage, implement DSS applications for displaying and analyzing the data
warehouse, called Online Analytical Processing (OLAP) Systems.
These applications display data in
several dimensions and use complex algorithms to analyze data.
At this stage, data mining tools are
also used to recognize patterns in the data sets and analyze big data.
Three-Tier
Data Warehouse Architecture
Most organizations adopt Three-Tier
Architecture in the work of data warehouses.
In this Architecture, the data
warehouse system is divided into three tiers (levels); Bottom Tier, Middle
Tier, and Top-Tier.
1. Bottom Tier: It is the data warehouse database server and an RDBMS
(Relational Database Management System). This may include many specific
data marts and metadata repositories.
You can use back end tools and
application program interfaces called a gateway to feed data at the bottom
level.
The gateways and back end tools
perform the extraction, cleaning, loading, and updating functions.
Examples of gateways
contain Open-Linking and Embedding for Databases (OLE-DB), Open
Database Connection (ODBC) and Java Database Connection (JDBC).
2. Middle Tier: A middle-tier consists of an online analytical processing
(OLAP) server which can be executed for a quick query of the data
warehouse. The OLAP server is implemented in the following ways:
I. Using the ROLAP
(Relational online analytical processing) model, i.e., an extended relational
database management system (DBMS) that assigns tasks to multidimensional data
for standard relational operations.
II. Using the
MOLAP (multidimensional online analytics processing) model, ie a
special-purpose server that directly implements multidimensional databases and
operations.
3. Top-Tier: A top-tier consists of front-end tools that contain reporting
tools analysis tools and query tools to display the results provided by online
analytical processing systems as well as additional tools for data mining.
Principles
of Data Warehousing
There are some data warehouse
principles that may include:
Load
Processing
In a data warehouse, several phases
and steps must be taken to load new data and process it including filtering,
reformatting, data conversion, indexing, and metadata updates.
Load
Performance
Data warehouses need to increase the
loading of new data from time to time within narrow time windows.
Load performance should be measured
in gigabytes per hour and in hundreds of millions of rows and should not
artificially disrupt the data business volume.
Data
Quality Management
The data warehouse verifies
contextual integrity, local stability, and global consistency despite
"dirty" sources and large-scale database size.
This is only useful and valuable to
the extent if business stakeholders trust the data and its resources as a
fact-based management system requires the highest quality of the data.
Strategic
Adaptability
Adaptability is critical to the
development of business requirements.
The business intelligence tools
available in the market must be taken into account to adapt to often unexpected
changes in business demands.
It is also very important to keep in
mind the pace at which BI tools evolve to include additional functions and
features.
In data warehouses, adaptability
requires a principle and method to use alternative BI tools in the future such
as various back-end or visualization tools.
Query
Performance
Reliable database management and the fact-based direction should not be decelerated by the performance of RDBMS
(relational database management system). Massive and complex queries should be
completed in seconds, not hours or days.
Terabyte
Scalability
A Terabyte contains more than a
trillion bytes of digital information - 1015 bytes.
Today, the size of the data
warehouses is evolving at staggering rates. This ranges from a few bytes to
hundreds of terabytes and gigabytes sized data warehouses.
So scalability should be taken into
account in measurement.
Conclusion
Finally, there are many companies
that provide systems and tools for creating data warehouses and extracting data
from their sources, most notably Oracle and its Architectures and Solutions
such as Oracle Data Warehouse Builder and Oracle Data Mining, etc.
Tags
BI tools
big data applications
business development
data analytics
data science
data warehouse
technological advances
Warehouse