Real-time Transactional Replication for Big Data A...

arun_sitaraman · ‎04-01-2015

Today’s enterprise is transforming from being reactive to proactive and predictive, with Big Data applications. The challenge is to bring together and make available a high volume, of a variety of data, at high velocity, for active insights. With SAP® Replication Server®, your disjointed data from disparate sources can be aggregated into Hadoop Hive Warehouse, in real time to insure that business data is always available, for your Big Data Applications.

You Can
Aggregate and synchronize high volume of data across geographies from multiple systems Provide application data in real time for predictive analytics Enable active decision support from disparate and disjointed data and schemas Enable high availability and disaster recovery for global enterprises

SAP REPLICATION SERVER

SAP Replication Server supports your Big Data enterprise data management environment by replicating and synchronizing transactions originating in SAP Adaptive Server® Enterprise (SAP ASE) and Oracle databases into Hadoop Hive File System (HDFS), as shown in Figure 1 and Table 1 below. It provides the high-performance, secure, reliable, and consistent delivery of data across the enterprise, addressing:

Data movement, aggregation, synchronization, and migration
Predictive analytics
Real-time reporting
Disaster recovery and high availability

Figure 1

DELIVERING KEY FUNCTIONALITY

With SAP Replication Server, you gain all the benefits of:

High-performance data movement
Nonintrusive transaction capture without degrading transactional database source
Dependable transaction application to target data-store, for zero data loss
Real-time synchronization across heterogeneous data sources
Centralized modeling and administration of complex deployment architectures
Flexible data translations
Distributed architecture with reliable and consistent data delivery

A key characteristic of the SAP Replication Server is high-performance. This performance can be attributed to the following features and functionality:

Continuous real-time, log-based transactional data capture
Optimized transaction performance to reduce latency between the source database and Big Data target data-store

Database Replication Matrix for Big Data Environments is shown in Table 1, below.

Table 1

Sources

SAP® Adaptive
Server® Enterprise (SAP ASE)
Oracle

Targets

Hadoop Hive
(HDFS)

SUPPORTING MISSION-CRITICAL ENTERPRISE DATA MOVEMENT NEEDS

SAP Replication Server provides the continuous movement of mission-critical application data. Once established, this environment can be automated to help ensure that information is replicated to meet changing business demands. Whatever the environment, no matter how complex or broadly distributed, and whatever the time constraints, SAP Replication Server can meet your organization’s most demanding data movement requirements. SAP Replication Server provides high-performance, secure, reliable, and consistent delivery of data across the enterprise.

BIG DATA REPLICATION ARCHITECTURE

Supported Capabilities

Table level replication from SAP® Adaptive Server® Enterprise (SAP ASE) and Oracle to Hadoop
- Insert/Update/Delete operations, with additional columns for operation type and timestamp
Manual and Direct Load Materialization from Oracle and SAP® Adaptive Server® Enterprise (SAP ASE) to Hadoop Hive
All data types from Oracle and SAP® Adaptive Server® Enterprise (SAP ASE) to Hadoop, except LOB
Anonymous and LDAP authentication when connecting from Oracle and SAP® Adaptive Server® Enterprise (SAP ASE) to Hadoop Hive
SSL secure communication from SAP® Replication Server to Hadoop Hive
Transaction recovery from SAP® Replication Server side
On-premise environments

IMPLEMENTATION MODEL

The replication model into Hadoop Hive follows a warehouse approach of writing all operations as INSERTS. So DELETES and UPDATES are also written as INSERTS, with all rows containing two special columns OP_TYPE (INT Datatype, a 1 indicates an INSERT, a 2 an UPDATE, and a 3 a DELETE) and OP_TIME (Timestamp Datatype) to indicate the type of operation that created that row and a timestamp to capture the source operation timestamp (NOTE: These columns have to be created manually). This is illustrated in Figure 2 below.

Figure 2

Based on this approach, two combinations of data representations are possible. One as shown above in Figure 2, a log of all operations on both the source tables with each operation in each table being represented as a row in the target Hadoop Hive warehouse. The replication architecture has to take into consideration the datatype mappings between the sources and the target so as to insure that all source datatypes are adequately represented in the target. A second model allows users to replicate each source table into a Hive Target (as shown in Figure 3 below).

Figure 3

In Figure 3 we see the source table T1 is mapped to target Hive table T1 and source table T2 is mapped to target Hive table T4. In this model all changes in source T1 are initially materialized into target T1 (#rows_T3 = #rows_t1) and then every operation on the source table T1 will be represented as an insert in the target table T1 (similarly with T2).

ASE TO HIVE DATATYPE MAPPING

This table identifies the ASE datatype, the Replication Definition datatype, and the Hive datatype.

ASE datatype	Replication Definition datatype	Hive datatype
bigint	bigint	bigint
int	Integer	int
smallint	smallint	smallint
unsigned bigint	unsigned bigint	varchar
unsigned int	unsigned int	bigint
unsigned smallint	unsigned smallint	int
tinyint	tinyint	smallint
int identity	integer	int
numeric	numeric	decimal
decimal	decimal	decimal
float	float	double
real	real	float
double precision	float	double
money	money	decimal
smallmoney	smallmoney	decimal
datetime	datetime	timestamp
smalldatetime	smalldatetime	timestamp
bigdatetime	bigdatetime	timestamp
date	date	date
time	time	timestamp
char	char	char
varchar	varchar	varchar
unichar	unichar	varchar
univarchar	univarchar	varchar
nchar	char	varchar
nvarchar	varchar	varchar
image	image	No Default
text	text	No Default
unitext	unitext	No Default
binary	binary	binary
varbinary	varbinary	binary
bit	tinyint	tinyint
timestamp	timestamp	No Default
sysname	varchar	varchar
longsysname	varchar(255)	varchar
bigtime	bigtime	timestamp

Note: When no default value is provided, the successful replication of the ASE type requires you to choose the customization tool of your choice to map to the Hive datatype based on your business needs.

ORACLE TO HIVE DATAYPE MAPPING

This table identifies the the Oracle datatype, Replication Definition datatype, and the Hive datatype.

Oracle datatype	Replication Definition datatype	Hive datatype
Integer	rs_oracle_decimal	double
Number	rs_oracle_decimal	double
Number(10) - Number(18)	rs_oracle_decimal	Bigint
Number(5) - Number(9)	rs_oracle_decimal	int
Number(2) - Number(4)	rs_oracle_decimal	smallint
Number(P,S)	rs_oracle_decimal/varchar	decimal/varchar
Number(1)	rs_oracle_decimal	tinyint
FLOAT	rs_oracle_float	double
BINARY_FLOAT	rs_oracle_float	double
BINARY_DOUBLE	rs_oracle_decimal	double
DATE	rs_oracle_datetime	timestamp
TIMESTAMP(N)	rs_oracle_timestamp9	timestamp
CHAR	Char	varchar
NCHAR	Unichar	varchar
VARCHAR2	Varchar	varchar
NVARCHAR	univarchar	varchar
NVARCHAR2	univarchar	varchar
BLOB	image	No Default
BFILE	image	No Default
NCLOB	unitext	No Default
INTERVAL	rs_oracle_interval	No Default
ROWID	rs_oracle_rowid	No Default
UROWID	rs_oracle_rowid	No Default
TIMESTAMP with time zone	rs_oracle_timestamptz	No Default
TIMESTAMP with local time zone	rs_oracle_timestamp9	No Default
ANYDATA	opaque	No Default
VARRAY	opaque	No Default
NESTEDTAB	opaque	No Default
Object	opaque	No Default
REF	rs_oracle_binary	No default
RAW	rs_oracle_binary	binary

Note: When no default value is provided, the successful replication of the Oracle type requires you to choose the customization tool of your choice to map to the Hive datatype based on your business needs. Refer to Oracle Datatype Compatibility and Oracle Datatype Restrictions of the Replication Agent Primary Database Guide for additional Oracle datatype information.

ENVIRONMENT ENHANCEMENTS FOR REPLCATION INTO HADOOP HIVE

Besides the usual environment variables including LD_LIBRARY_PATH or SHLIB_PATH or LIBPATH (depending on the platform), new variables introduced for replication into Hive are:

RS_HIVE_TEMP_FILE_PATH:
The value in this variable represents the directory where the staging files are written into by the replication server. This value is also used by the Hive service to load the data into the warehouse. Appropriate permissions and share/export must be enabled to insure that both the replication server can write into this directory and the Hive service can access and read from this directory.

RS_HIVE_AUTH_TYPE:
The value in this variable informs the replication server of the authentication mode to use – NoSASL or SASL (LDAP and anonymous are supported today).

TUNING

MapReduce overhead can be managed by increasing the values in dsi_compile_max_cmds and dsi_cdb_max_size properties, to stage more commands for each load data operation.

TROUBLESHOOTING

ECON trace dsi_buf_dump can be enabled to trace the Hive query command that is being sent to Hive by Hive Connector

OPERATING ENVIRONMENTS

SAP® Replication Server® platform/Operating System certification can be located at https://websmp203.sap-ag.de/~sapidb/012002523100008725382014E

Hive on Linux AMD64/SUSE 11.3 & RHEL 5.5 has been certified, but in general Linux on X86_64 (all versions of SUSE/RHEL) can be supported.

LEARN MORE

For more information about how SAP Replication Server can help meet your data challenges, visit http://scn.sap.com/community/sap-replication-server.

To learn more about the Hive replication capability and usage, please refer the New Features Guide below:

http://help.sap.com/Download/Multimedia/zip-rs1571204/Replication_Server_New_Features_Guide_en.pdf

Real-time Transactional Replication for Big Data Analytics

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win