Identify clearing and settlement activities in support of critical financial markets
Identify clearing and settlement activities in support of critical financial markets
Determine appropriate recovery and resumption objectives for clearing and settlement activities in support of critical markets
core clearing and settlement organizations should develop the capacity to recover and resume clearing and settlement activities within the business day on which the disruption occurs with the overall goal of achieving recovery and resumption within two hours after an event
Maintain sufficient geographically dispersed resources to meet recovery and resumption objectives
Back-up arrangements should be as far away from the primary site as necessary to avoid being subject to the same set of risks as the primary location
The effectiveness of back-up arrangements in recovering from a wide-scale disruption should be confirmed through testing
Routinely use or test recovery and resumption arrangements
One of the lessons learned from September 11 is that testing of business recovery arrangements should be expanded
Ensuring Business Continuity:
Ensuring Business Continuity:
Disaster Recovery
Restore business after an unplanned outage
High Availability
Meet Service Availability objectives, e.g., 99.9% availability or 8.8 hours of down-time a year
Shift focus from failover model to near-continuous availability model (RTO near zero)
Access data fromany site(unlimited distance between sites)
Multi-sysplex, multi-platform solution
“Recover my business rather than my platform technology”
Ensure successful recovery via automated processes (similar to GDPS technology today)
Can be handled by less-skilled operators
Provide workload distribution between sites (route around failed sites, dynamically select sites based on ability of site to handle additional workload)
Provide application level granularity
Some workloads may require immediate access from every site, other workloads may only need to update other sites every 24 hours (less critical data)
Current solutions employ an all-or-nothing approach (complete disk mirroring, requiring extra network capacity)
Software: user written applications (eg: COBOL programs) and the middleware run time environment (eg: CICS regions, InfoSphere Replication Server instances and DB2 subsystems)
Data: related set of objects that must preserve transactional consistency and optionally referential integrity constraints (such as DB2 Tables, IMS Databases and VSAM Files)
Network connectivity: one or more TCP/IP addresses & ports (eg: 10.10.10.1:80)
Two Production Sysplex environments (also referred to as sites) in different locations
Two Production Sysplex environments (also referred to as sites) in different locations
One active, one standby – for each defined update workload, and potential query workload active in both sites
Software-based replication between the two sysplexes/sites
However, GDPS requires that they are members of the Workload
Rationale
The A/A Controller needs to know the capture and apply engines that belong to a Workload in order to
Quiesce work properly including replication
Send commands to them
Provide DR for whole production sysplex (AA workloads & non-A/A workloads)
Provide DR for whole production sysplex (AA workloads & non-A/A workloads)
Restore A/A Sites capability for A/A Sites workloads after a planned or unplanned region switch
Restart batch workloads after the prime site is restarted and re-synced
The disk replication integration is optional
Option 1 – create new sysplex environments for active/active workloads
Option 1 – create new sysplex environments for active/active workloads
Simplifies operations as scope of Active/Active environment is confined to just this or these specific workloads and the Active/Active managed data
Option 2 – Active/Active workload and traditional workload co-exist within the same sysplex
Still will need new active sysplex for the second site
Increased complexity to manage recovery of Active/Active workload to one place, and remaining systems to a different environment, from within the same sysplex
Existing GDPS/PPRC customer will have to implement GDPS co-operation support between GDPS/PPRC and GDPS/Active-Active
Active /Query configuration
Active /Query configuration
Fulfills SoD made when the Active/Standby configuration was announced
Requires either CICS TS V5 for CICS/VSAM applications or CICS VR V5 for logging of non-CICS workloads
Support for IIDR for DB2 (Qrep) Multiple Consistency Groups
Enables support for massive replication scalability
Workload switch automation
Avoids manual checking for replication updates having drained as part of the switch process
GDPS/PPRC Co-operation support
Enables GDPS/PPRC and GDPS/A-A to coexist without issues over who manages the systems
Disk replication integration
Provides tight integration with GDPS/MGM for GDPS/A-A to be able to manage disaster recovery for the entire sysplex
Large Chinese financial institution
Large Chinese financial institution
Several critical workloads
Self-services (ATMs)
Internet banking
Internet banking (query-only)
Workloads access data from DB2 tables through CICS
Planned outages
Minor application upgrades (as needed)
Often included DB2 table schema changes
Quarterly application version upgrades
Other planned maintenance activities such as software infrastructure
Critical workloads were down for three to four hours
Critical workloads were down for three to four hours
Scheduled for 3rd shifts local time on weekends to limit impact to banking customers
Still affected customers accessing accounts from other world-wide locations
Site taken down for application upgrades, possible database schema changes, scheduled maintenance
All business stopped
Required manual coordination across geographic locations to block and resume routing of connections into data center
Reload of DB2 data elongated outage period
Goal was to reduce planned outage time for these workloads down to minutes
Solution provides
Solution provides
A transactional consistent copy of DB2 on a remote site
IBM InfoSphere Data Replication for DB2 for z/OS (IIDR) - provides a high-performance replication solution for DB2
A method to easily switch selected workloads to a remote site without any application changes
IBM Multi-site Workload Lifeline (Lifeline) - facilitates planned outages by rerouting workloads from one site to another without disruption to users
A centralized point of control to manage the graceful switch
GDPS Active/Active Sites - coordinates interactions between IIDR and Lifeline to enable a non-disruptive switch of workloads without loss of data
Reduced impact to their banking customers!
Total outage time for update workloads was reduced from 3-4 hours down to about 2 minutes
Total outage time for the query workload was reduced from 3-4 hours down to under 2 minutes
IBM Multi-site Workload Lifeline v2.0
IBM Multi-site Workload Lifeline v2.0
Advisor – runs on the Controllers & provides information to the external load balancers on where to send connections and information to GDPS on the health of the environment
There is one primary and one secondary advisor
Agent – runs on all production images with active/active workloads defined and provide information to the Lifeline Advisor on the health of that system
IBM Tivoli NetView Monitoring for GDPS v6.2 or higher
Runs on all systems and provides automation and monitoring functions. This new product pre-reqs IBM Tivoli NetView for z/OS at the same version/release. The NetView Enterprise Master runs on the Primary Controller
IBM Tivoli Monitoring v6.3 FP1
Can run on zLinux, or distributed servers – provides monitoring infrastructure and portal plus alerting/situation management via Tivoli Enterprise Portal, Tivoli Enterprise Portal Server and Tivoli Enterprise Monitoring Server
If running NetView Monitoring for GDPS v6.2.1 and NetView for z/OS v6.2.1, ITM v6.3 FP3 is required.
IBM InfoSphere Data Replication for DB2 for z/OS v10.2
IBM InfoSphere Data Replication for DB2 for z/OS v10.2
Runs on production images where required to capture (active) and apply (standby) data updates for DB2 data. Relies on MQ as the data transport mechanism (QREP)
IBM InfoSphere Data Replication for IMS for z/OS v11.1
Runs on production images where required to capture (active) and apply (standby) data updates for IMS data. Relies on TCPIP as the data transport mechanism
IBM Infosphere Data Replication for VSAM for z/OS v11.1
Runs on production images where required to capture (active) and apply (standby) data updates for VSAM data. Relies on TCP/IP as data transport mechanism. Requires CICS TS or CICS VR
System Automation for z/OS v3.4 or higher
Runs on all images. Provides a number of critical functions:
BCPii for GDPS
Remote communications capability to enable GDPS to manage sysplexes from outside the sysplex
System Automation infrastructure for workload and server management
Optionally the OMEGAMON XE products can provide additional insight to underlying components for Active/Active Sites, such as z/OS, DB2, IMS, the network, and storage
There are 2 “suite” offerings that include the OMEGAMON XE products (OMEGAMON Performance Management Suite and Service Management Suite for z/OS).
Active/Active Sites
Active/Active Sites
This is the overall concept of the shift from a failover model to a continuous availability model.
Often used to describe the overall solution, rather than any specific product within the solution.
GDPS/Active-Active
The name of the GDPS product which provides, along with the other products that make up the solution, the capabilities mentioned in this presentation such as workload, replication and routing management and so on.
Update Workloads
Update Workloads
Currently only run in what is defined as an active/standby configuration
performing updates to the data associated with the workload, and
has a relationship with the data replication component
not all transactions within this workload will necessarily be update transactions
Query Workloads
Run in what is defined as an active/query configuration
must not perform any updates to the data associated with the workload
a query workload must be associated with an update workload
A Consistency Group (CG) corresponds to a set of DB2 tables for which the replication apply process maintains transactional consistency - by applying data-dependent transactions serially, and other transactions in parallel
A Consistency Group (CG) corresponds to a set of DB2 tables for which the replication apply process maintains transactional consistency - by applying data-dependent transactions serially, and other transactions in parallel
Multiple Consistency Groups (MCGs) are primarily used to provide scalability
if and when one CG (Single Consistency Group) cannot keep up with all transactions for one workload
query workloads can tolerate data replicated with eventual consistency
Q Replication (V10.2.1) can coordinate the Apply programs across CGs to guarantee that a time-consistent point across all CGs can be established at the standby site, following a disaster or outage, before switching workloads to this standby side
GDPSoperations on a workload controls and coordinates replication for all CGs that belong to this workload
For example, 'STOP REPLICATION' for a workload, stops replication in a coordinated manner for all CGs (all queues and Capture/Apply programs)
GDPS supports up to 20consistency groups for each workload
A workload is the aggregation of these components
A workload is the aggregation of these components
Software: user written applications (eg: COBOL programs) and the middleware run time environment (eg: CICS regions, InfoSphere Replication Server instances and DB2 subsystems)
Data: related set of objects that must preserve transactional consistency and optionally referential integrity constraints (eg: DB2 Tables, IMS Databases, VSAM Files)
Network connectivity: one or more TCP/IP addresses and ports (eg: 10.10.10.1:80)
In DB2 Replication, the mapping between a table at the source and a table at the target is called a subscription
In DB2 Replication, the mapping between a table at the source and a table at the target is called a subscription
Example shows 2 subscriptions for tables T1 and T2
A subscription belongs to aQMap which defines the sendq that is used to send data for that subscription
Example shows that both subscriptions are using the same QMap (SQ1)
In IMS Replication, a subscription is a combination of a source server and a target server
The subscription is the object that is started/stopped by GDPS/A-A.
This corresponds to the QMap in Q Replication
Each IMS Replication subscription contains a list of replication mappings
There is one replication mapping for each IMS database being replicated
This corresponds to a subscription in Q Replication
Automation code is an extension on many of the techniques tried and tested in other GDPS products and with many client environments for management of their mainframe CA & DR requirements
Automation code is an extension on many of the techniques tried and tested in other GDPS products and with many client environments for management of their mainframe CA & DR requirements
Control code only runs on Controller systems
Workload management - start/stop components of a workload in a given Sysplex
Software Replication management - start/stop replication for a given workload between sites
Disk Replication management – ability to manipulate GDPS/MGM from GDPS/A-A
Routing management - start/stop routing of connections to a site
System and Server management - STOP (graceful shutdown) of a system, LOAD, RESET, ACTIVATE, DEACTIVATE the LPAR for a system, and capacity on demand actions such as CBU/OOCoD
Monitoring the environment and alerting for unexpected situations
Planned/Unplanned situation management and control - planned or unplanned site or workload switches; automatic actions such as automatic workload switch (policy dependent)
Powerful scripting capability for complex/compound scenario automation