Surrogate Keys Generated During Etl Process

9/20/2020

The final reason I can think of for surrogate keys is one that I strongly suspect but have never proven. Replacing big, ugly natural keys and composite keys with beautiful, tight integer surrogate keys is bound to improve join performance. The storage requirements are reduced, and the index lookups would seem to be simpler.

Surrogate Keys Generated During Etl Process Pdf
Surrogate Keys Generated During Etl Process Pdf
Surrogate Keys Generated During Etl Process System
Surrogate Keys Generated During Etl Processing
Surrogate Keys Generated During Etl Process Definition

Surrogate keys may or may not be supplied on update and delete transactions. But in terms of the semantics that have to be carried out, these distinctions don't matter. Regardless of how we design the process of assigning surrogate keys, the important point is that it is not just a matter of requesting a unique key value for an insert transaction.
Thatýs a great definition for the surrogate keys we use in data warehouses. A surrogate key is an artificial or synthetic key that is used as a substitute for a natural key. Actually, a surrogate key in a data warehouse is more than just a substitute for a natural key. In a data warehouse, a surrogate key is a necessary generalization of the.

The Multidimensional Warehouse is the third datastructure in EPM.

Image: Multidimensional Warehouse (MDW)

The followinggraphic illustrates the MDW component of the EPM architecture andthe target tables that are present in the MDW.

The MDW stores dimensionalized data that is groupedinto one or more business processes, better known as a dimensional schema, used for business intelligenceand ad hoc reporting. The data is stored in a star schema (a fact table associated with a series ofdimension tables) and generally contains data loaded from the OWS.

The star schema arrangement depends entirely on primary key and foreign key relationships. A primary key is a column (orcolumns) in a dimension table whose values uniquely identify eachrow in the table. Primary keys enforce entity integrity by uniquelyidentifying entity instances. A foreign key is a column or columnsin a fact table whose values match the primary key values of a givendimension table. This way references can be made between a fact anddimension table. Foreign keys enforce referential integrity by completingan association between two entities.

Note: MDW dimensions use a surrogatekey, a unique key generated from production keys by theETL process. The surrogate key is not derived from any data in theEPM database and acts as the primary key in a MDW dimension. See thenext topic for more information on surrogate keys in the MDW.

Image: Dimensional Model Example

The following graphic provides an example of a starschema and its primary and foreign key relationships:

Although data loaded into the MDW is primarily derivedfrom the OWS, there are exceptions to this rule. Profitability andGlobal Consolidations data for the Financial Management Solutions(FMS) Warehouse is loaded into the MDW from the OWE.

External survey data for the HCM Warehouse is loadedinto the MDW from the OWE.

Online Marketing data is loaded into the MDW directlyfrom the source system, and bypasses the Operational Warehouse entirely.

Surrogate Keys

Surrogate keys provide a means of defining uniquekeys whose values, with the exception of the Time and Calendar dimensions,are anonymous—that is, the value of a surrogate key has no significanceto the application using it and is strictly an artificial value. Thesystem uses surrogate keys specifically as a means of joining structures.To speed up query access, the MDW resolves PeopleSoft-specific programmingconstructs, such as SetIDs and effective dates and replaces them withsurrogate IDs as key columns. Surrogate keys have no relationshipto the business or production key. Surrogate keys are present in dimensiontables as the primary key and in fact tables as foreign keys to dimensions.However, the dimension record retains the business key as an alternate-keyattribute. Surrogate keys are four-byte integers and their size doesnot change even when production key changes in size.

Although surrogate keys usually do not have any'intelligence,' that is, their value has no meaning, in certain situations,such as the Gregorian Calendar and Time dimensions, intelligent surrogatekeys are used. These intelligent keys enable the ETL process to runmore quickly by providing the option of avoiding a lookup on correspondingdimensions.

Surrogate key fields usually have the suffix _SID (Surrogate ID).

Surrogate Keys and the ETL Process

Surrogate keys are generated from production keysusing the DataStage routine KeyMgtNextValueConcurent(), which receives an input parameter and a name identifying the sequence.The surrogate key can be unique per single dimension target (D) orunique across the whole (W) multidimensional warehouse. This processis enabled by the environment parameter named SID_UNIQUENESS. Thevalue for this parameter is provided at run time. If the value is D, then this routine is called with a dimensionjob name for which a surrogate key must be assigned and it returnsthe next available number. If not, the routine is called with EPM as the sequence identifier.

You do not have to take any action to create surrogatekeys; they are generated during the ETL process within the aforementionedDataStage routine. The DataStage routine retrieves the next surrogatekey value and assigns it to the surrogate key that it is currentlycreating. When the ETL process copies a dimension row from the sourcesystem into the MDW, the ETL process performs a lookup on the dimensiontable. If the dimension row (with same business keys) does not existin the dimension table, the process inserts a row with a new surrogatekey value. If the dimension row already exists in the dimension table,the process updates the existing row with the incoming row value.When the ETL process copies a fact row from the source system intothe MDW, for each dimension key in the fact row, the system performsa lookup on the dimension table and retrieves the corresponding surrogatekey value. This surrogate key is the foreign key value in the factrow in the MDW. If the system does not locate a dimension value inthe fact row in the dimension table, that is a data exception andan error results.

Surrogate Key Benefits

Surrogate keys provide benefits such as:

The ability to easily and structurallyconform a dimension when being sourced from multiple systems.
Disassociation from operationalsystem changes.

Because surrogate key generationis controlled by the warehouse, it is not influenced by operationalsystem changes.
The ability to handle unspecifiedor missing key values.
A graceful mechanism to handlechanges in history.

Multiple versions of a dimensioncan be maintained with different surrogate (primary) keys, yet withthe same business (identifying) key.
Performance enhancement of queries,because a surrogate key is a single column numeric key, thus the joinsusing surrogate keys are faster than ones using multi-column businesskeys.

Audit Fields

Audit fields track extract, transform, and load(ETL) loading information, such as when the row was loaded or lastmodified or the batch in which the row was loaded. This informationis included in a subrecord. The subrecord added to MDW tables is calledLOAD_MDW_SBR. Subrecords are always added at the end of a record;no fields exist after this subrecord in any table.

Image: LOAD_MDW_SBR record example

The followingexample shows a typical LOAD_MDW_SBR subrecord.

Data Aggregation

Tables in the MDW contain source data at the samegranularity as the source system. Required data aggregation is carriedout at run time by the business intelligence tool. This allows forbetter control of aggregation strategies by the business intelligencetool, because aggregation requirements vary from customer to customer.

MDW Dimension Tables

Dimensions are sets of related attributes that youuse to group or constrain detailed information that you measure inyour data mart. Dimensions are usually text (in character data type),relatively static, and often hierarchical.

Dimension tables contain surrogate keys as the primarykey and are a single column key containing only the surrogate keycolumn. Surrogate keys usually have _SID (surrogate ID) appended to the field name. Dimension tables retainsource system business key fields as non-key attribute columns inthe dimension table. However, these are not used for joins with facttables. For example, in the Customer dimension, the original businesskey field CUST_ID is retained, if it exists in the source table, butis no longer included in the key. The SetID is also retained, if itexists in the source table, as a nonkey attribute; the value containedin the SetID is the same as in the source system.

If a dimension is SetID-based, the MDW table containsthe source SetID and the performance (PF) SetID, which is named SETID.

If a dimension contains a description text, a relatedlanguage table is often defined for this dimension. The ETL processpopulates this table if a customer requires multilanguage processing.The key for this table is the surrogate key ID, plus the languagecode field, LANGUAGE_CD, whichcontains the code for the additional language.

Note: You can find more information about multilanguageprocessing for the multidimensional warehouse in your EPM Warehousespecific documentation (for example, the PeopleSoft EPM: Campus Solutions Warehouse).

Shared Dimensions

Dimensionssuch as Account, Customer, Department, or Person are examples of shareddimensions. Shared dimensions are either exactly the same—includingkey structure—or an exact subset of another dimension; that is, shareddimensions are structurally identical every place in which they areused. Shared dimensions are used across all EPM warehouse products,such as the Campus Solutions Warehouse and the Financial ManagementSolutions Warehouse.

When using a shared dimension, the system consistentlyinterprets attributes; hence rollups across data marts are possibleand consistent. When a warehouse is provided data from multiple sources,a shared dimension is typically (but not always) built from multiplesource structures.

Image: EPM conformed dimension

The followingis a sample MDW shared dimension shown in Application Designer.

MDW Dimension Table Naming Convention

MDW dimension tables use the following naming convention:D_[table name].

MDW Fact Tables

MDW fact tables (F_*) contain numeric performancemeasurement data—such as quantity, sales, and revenue—that is usedto build a data warehouse and its related reports. Facts help to quantifya company's activities. A fact is a typically an additive businessperformance measurement. That is, you can usually perform arithmeticfunctions on facts.

In a star schema, a fact table is the central table,each element of which is a foreign key derived from a dimension table.Dimension tables have a surrogate ID column that is the primary keyof that dimension. A fact table may use these dimension surrogateIDs as foreign keys to the dimension table. In the dimensional modelexample graphic presented previously, the Sales fact table containssix foreign keys, each one matching a dimension surrounding the facttable.

Periodic Snapshot Fact Tables

Surrogate Keys Generated During Etl Process Pdf

Periodic Snapshots provide a view of the cumulativeperformance of the business at regular, predictable time intervals.Unlike a transaction fact table that loads a row of data for eachevent occurrence, the periodic snapshot fact table captures the eventat the interval of a day, week, or month, and another capture at theinterval of the next period, and so on. These periodic snapshots arestacked consecutively into the fact table. The periodic snapshot facttable often is the only place to easily retrieve a regular, predictable,trend view of the key business performance metrics.

Accumulating Fact Tables

Accumulating snapshots represent an indeterminatetime span, covering the complete life of a transaction or discreteproduct. Accumulating snapshots almost always have multiple date stamps,representing the predictable major events or phases that take placeduring the course of a lifetime. Since many of these dates are notknown when the fact row is first loaded, we must use surrogate datekeys to handle undefined dates.

Surrogate Keys Generated During Etl Process Pdf

MDW Fact Table Naming Convention

MDW fact tables use the following naming convention:F_[table name].

If you are working on Data warehouse project, than you might have heard lot about surrogate keys. Surrogate keys are widely accepted data warehouse design standard. In this article, we will check data warehouse surrogate key design, advantages and disadvantages.

What are surrogate keys in Data warehouse?

Surrogate Keys Generated During Etl Process System

If you are a data warehouse developer, that you might be thinking what is surrogate key? How and where it is being used? You will get answers to all your questions here.

Data warehouse surrogate keys are sequentially generated meaningless numbers associated with each and every record in the data warehouse. These surrogate keys are used to join dimension and fact tables.

Usually, database sequences are used to generate surrogate key so it is always unique number
Surrogate keys cannot be NULLs. Surrogate key are never populated with NULL values.
It does not hold any meaning in data warehouse, often called meaningless numbers. It is just sequentially generated INTEGER number for better lookup and faster joins.

Why surrogate keys are used in Data warehouse?

Basically, surrogate key is an artificial key that is used as a substitute for natural key (NK) defined in data warehouse tables. We can use natural key or business keys as a primary key for tables. However, it is not recommended because of following reasons:

Natural keys (NK) or Business keys are generally alphanumeric values that is not suitable for index as traversing become slower. For example, prod123, prod231 etc
Business keys are often reused after sometime. It will cause the problem as in data warehouse we maintain historic data as well as current data.

For example, product codes can be revised and reused after few years. It will become difficult to differentiate current products and historic products. To avoid such a situation, surrogate keys are used.

Data Warehouse Surrogate Key examples

Surrogate Keys Generated During Etl Processing

Surrogate Keys are integers that are assigned sequentially in the dimension table which can be used as primary key. The surrogate key column could be identity column or database sequences are used.

Below is the sample example of surrogate key:

Surrogate Keys Generated During Etl Process Definition

PATIENT_SK	PATIENT_ID	PATIENT_NAME	PATIENT_AGE
1	P001	ABC	20
2	P002	BCD	25
3	P003	CDE	19
4	P004	DEF	45

Advantages of Surrogate Key

Below are some of advantages of using surrogate keys in data warehouse:

With help of surrogate keys, you can integrate heterogeneous data sources to data warehouse if they donâ€™t have natural or business keys.
Joining tables (fact and dimensions) using surrogate key is faster hence better performance
Surrogate keys are very helpful for ETL transformations.
Data warehouse Surrogate keys are usually small integer numbers that makes smaller index and better performance
Surrogate keys are required if you are implementing slowly changing dimension (SCD)

Disadvantages of Surrogate Key

Below are some of disadvantages of using surrogate keys in data warehouse:

Surrogate key generation and assignment takes unnecessary burden on ETL framework
You should not over use the surrogate keys as they donâ€™t have any meaning in data warehouse tables.
Data migration becomes difficult if you have database sequence associated with surrogate key columns. You should carefully take care of number surrogate key generation in new database otherwise you may end up with duplicate surrogate keys.

Surrogate Keys Generated During Etl Process

Surrogate Keys

Surrogate Keys and the ETL Process

Surrogate Key Benefits

Audit Fields

Data Aggregation

MDW Dimension Tables

Shared Dimensions

MDW Dimension Table Naming Convention

MDW Fact Tables

Periodic Snapshot Fact Tables

Surrogate Keys Generated During Etl Process Pdf

Accumulating Fact Tables

Surrogate Keys Generated During Etl Process Pdf

MDW Fact Table Naming Convention

What are surrogate keys in Data warehouse?

Surrogate Keys Generated During Etl Process System

Why surrogate keys are used in Data warehouse?

Surrogate Keys Generated During Etl Processing

Surrogate Keys Generated During Etl Process Definition

Advantages of Surrogate Key

Disadvantages of Surrogate Key

Author

Archives

Categories