Skip to content

Disaster Recovery Procedures

Introduction

Disaster Recovery Procedures
Prerequisites for the application:The application is fully functional / running - automatically reconnect and replication of sessions in case of problems works
Goal of the document:The primary goal of the document is to describe the procedure according to which IT Ops will be able to restore the full functionality of the application in various types of outages. The second goal is to record the behavior of the application in the mentioned outages.
Abbreviations:AS - Application server, DB - Database, FS - Filesystem, OS - Operation system, P - Primary, S - Secondary
The document is supplied by:Delivery manager
The document is accepted by:IT Ops

Application Server Outage

For all AS platforms

Step No.Activity 01: AS - CPU works on 100% on OS where AS runsThe procedure to be performed to make the application fully functional again
1Application run control and log monitoring or servlet monitoring. The application runs on both sides of the ASAfter releasing the OS resources, the response / functionality of the application should be restored. If this is not the case, a restart of the application is required
2Use CPU at 100% on the OS where all instances of AS are running. Necessary cooperation of OS admins
3Login of testers to the application
4Testing and evaluation of application behavior

Step No.Activity 02: AS - Filesystem filling where AS runsThe procedure to be performed to make the application fully functional again
1Application run control and log monitoring or servlet monitoring. The application runs on both sides of the ASAfter releasing the FS, a restart of the application is required. External components connect automatically after AS recovery.
2Login of testers to the application
3Fill FS to 100%, where all AS instances are running. / Tmp, / logs, / opt, / data should be considered
4Testing and evaluation of application behavior

Step No.Activity 03: AS Full disaster (more time consuming):The procedure to be performed to make the application fully functional again
1Stop all application components, nodes, and admin servers of the platform.AS is started automatically at OS startup.
2Execution of full archiving of the whole environment, standard backup, tar of important mount points
3OS crash simulation
4Restore an OS
5Restore a backup
6Environment start, connection control to admin console, application start

Database Outage

Step No.Scenario 01: DB - Database outgoing (Single instance DB)The procedure to be performed to make the application fully functional again
1Application run control and log monitoring or servlet monitoring. The application runs on both sides of the ASThe application has the ability to automatically reconnect to the database.
2Restart DB
3Test the basic functionality of the application after booting the DB

Note

For applications that also work in the so-called "closing" mode (so-called EndOfDay processing, etc.) it is necessary to execute the scenario 01 DB during this processing.

LDAP Outage

Step No.Activity 01: LDAP - DNS failoverThe procedure to be performed to make the application fully functional again
1Application run control and log monitoringExcalibur Facade has the ability to automatically reconnect to LDAP.
2Primary LDAP server failure during application run, functionality testing. The application should re-resolve the LDAP DNS record
3Test the basic functionality of the application

Step No.Activity 02: LDAP - Complete outageThe procedure to be performed to make the application fully functional again
1Application run control and log monitoringExcalibur Facade has the ability to automatically reconnect to LDAP.
2Disable all LDAP servers while the application is running. Functionality testing. Subsequent start of LDAP servers.
3Test the basic functionality of the application

Another Outages

Step No.Activity 01: Excalibur Facade (gMSA) - Correct stop of the secondary server (AS is connected to the secondary server before tracking)The procedure to be performed to make the application fully functional again
1Application run control and log monitoring or servlet monitoring.Excalibur Facade has the ability to automatically reconnect to AS
2Administrator stops the primary server
3Login of testers to the application
4Start primary server
5Testing, the ideally activity that lasts longer.
6Correct stop of the secondary server
7Requests should be redirected to primary server
8Testing and evaluation of application
9Repeat the entire test, but reverse the server tracking order

Step No.Activity 02: Excalibur Facade (gMSA) - Killing the secondary server (AS is connected to the secondary server before tracking)The procedure to be performed to make the application fully functional again
1Application run control and log monitoring or servlet monitoring.Excalibur Facade has the ability to automatically reconnect to AS
2Administrator stops the primary server
3Login of testers to the application
4Start primary server
5Testing, the ideally activity that lasts longer.
6Stop of the secondary server (KILL)
7Requests should be redirected to primary server
8Testing and evaluation of application
9Repeat the entire test, but reverse the server tracking order

Step No.Activity 03: Excalibur Facade (gMSA) - Complete outageThe procedure to be performed to make the application fully functional again
1Application run control and log monitoringExcalibur Facade has the ability to automatically reconnect to AS
2Shutting down Excalibur Facade servers while the application is running. Functionality testing. Subsequent launch of Excalibur Facade servers.
3Test the basic functionality of the application

Step No.Activity 04: Excalibur Facade (ADDC) - CPU utilization at 100% on the OS where the facade service is runningThe procedure to be performed to make the application fully functional again
1Application run control and log monitoringAfter releasing the OS resources, the response / functionality of the component should be restored. If this is not the case, a service restart is required.S
2Use CPU at 100%. Necessary cooperation of OS admins.
3Testing and evaluation of application

Step No.Activity 05: Excalibur Facade (ADDC) - Fill the file system where the facade service is runningThe procedure to be performed to make the application fully functional again
1Application run control and log monitoringAfter releasing the FS, a service restart is required.
2Fill FS to 100%. Necessary cooperation of OS admins.
3Testing and evaluation of application

Further materials

Excalibur Application Monitoring

Excalibur Administrator Manual