What is the maximum tolerable period in which data might be lost from an IT service due to a major incident?

Domain 6

Eric Conrad, ... Joshua Feldman, in CISSP Study Guide, 2010

Self test

1.

Maximum Tolerable Downtime (MTD) is also known as what?

A.

Maximum Allowable Downtime (MAD)

B.

Mean Time Between Failures (MTBF)

C.

Mean Time to Repair (MTTR)

D.

Minimum Operating Requirements (MOR)

2.

What is the primary goal of disaster recovery planning (DRP)?

A.

Integrity of data

B.

Preservation of business capital

C.

Restoration of business processes

D.

Safety of personnel

3.

What business process can be used to determine the outer bound of a Maximum Tolerable Downtime?

A.

Accounts receivable

B.

Invoicing

C.

Payroll

D.

Shipment of goods

4.

Your Maximum Tolerable Downtime is 48 hours. What is the most cost-effective alternate site choice?

A.

Cold

B.

Hot

C.

Redundant

D.

Warm

5.

A structured walkthrough test is also known as what kind of test?

A.

Checklist

B.

Simulation

C.

Tabletop Exercise

D.

Walkthrough Drill

6.

Which Plan provides the response procedures for occupants of a facility in the event a situation poses a threat to the health and safety of personnel?

A.

Business resumption/recovery plan (BRP)

B.

Continuity of Operations Plan (COOP)

C.

Crisis Management Plan (CMP)

D.

Occupant Emergency Plan (OEP)

7.

Which type of tape backup requires a maximum of two tapes to perform a restoration?

A.

Differential backup

B.

Electronic vaulting

C.

Full backup

D.

Incremental backup

8.

What statement regarding the Business Continuity Plan is true?

A.

BCP and DRP are separate, equal plans

B.

BCP is an overarching “umbrella” plan that includes other focused plans such as DRP

C.

DRP is an overarching “umbrella” plan that includes other focused plans such as BCP

D.

COOP is an overarching “umbrella” plan that includes other focused plans such as BCP

9.

Which HA solution involves multiple systems all of which are online and actively processing traffic or data?

A.

Active-active cluster

B.

Active-passive cluster

C.

Database shadowing

D.

Remote journaling

10.

What plan is designed to provide effective coordination among the managers of the organization in the event of an emergency or disruptive event?

A.

Call tree

B.

Continuity of Support Plan

C.

Crisis Management Plan

D.

Crisis Communications Plan

11.

Which plan details the steps required to restore normal business operations after recovering from a disruptive event?

A.

Business Continuity Planning (BCP)

B.

Business Resumption Planning (BRP)

C.

Continuity of Operations Plan (COOP)

D.

Occupant Emergency Plan (OEP)

12.

What metric describes how long it will take to recover a failed system?

A.

Minimum Operating Requirements (MOR)

B.

Mean Time Between Failures (MTBF)

C.

The Mean Time to Repair (MTTR)

D.

Recovery Point Objective (RPO)

13.

What metric describes the moment in time in which data must be recovered and made available to users in order to resume business operations?

A.

Mean Time Between Failures (MTBF)

B.

The Mean Time to Repair (MTTR)

C.

Recovery Point Objective (RPO)

D.

Recovery Time Objective (RTO)

14.

Maximum Tolerable Downtime (MTD) is comprised of which two metrics?

A.

Recovery Point Objective (RPO) and Work Recovery Time (WRT)

B.

Recovery Point Objective (RPO) and Mean Time to Repair (MTTR)

C.

Recovery Time Objective (RTO) and Work Recovery Time (WRT)

D.

Recovery Time Objective (RTO) and Mean Time to Repair (MTTR)

15.

Which draft Business Continuity guideline ensures business continuity of the Information and Communications Technology (ICT), as part of the organization's Information Security Management System (ISMS)?

A.

BCI

B.

BS-7799

C.

ISO/IEC-27031

D.

NIST Special Publication 800-34

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B978159749563900007X

Domain 7: Security Operations (e.g., Foundational Concepts, Investigations, Incident Management, Disaster Recovery)

Eric Conrad, ... Joshua Feldman, in CISSP Study Guide (Third Edition), 2016

Backups and Availability

Although backup techniques are also reviewed as part of the Fault Tolerance section discussed previously in this chapter, discussions of Business Continuity and Disaster Recovery Planning would be remiss if attention were not given to backup and availability planning techniques. In order to be able to successfully recover critical business operations, the organization needs to be able to effectively and efficiently backup and restore both systems and data. Though many organizations are diligent about going through the process of creating backups, verification of recoverability from those backup methods is at least as important and is often overlooked. When the detailed recovery process for a given backup solution is thoroughly reviewed, some specific requirements will become obvious. One of the most important points to make when discussing backup with respect to disaster recovery and business continuity is ensuring that critical backup media is stored offsite. Further, that offsite location should be situated such that, during a disaster event, the organization can efficiently access the media with the purpose of taking it to a primary or secondary recovery location.

A further consideration beyond efficient access to the backup media being leveraged is the ability to actually restore said media at either the primary or secondary recovery facility. Quickly procuring large high-end tape drives for reading special-purpose, high-speed, high-capacity tape solutions is untenable during most disasters. Yet many recovery solutions either simply ignore this fact or erroneously build the expectation of prompt acquisition into their MTTR calculations.

Due to the ever-shrinking MTD calculations at many organizations, with some systems now actually requiring Continuous Availability (an MTD of zero), organizations now often must review their existing backup paradigms to determine whether the MTTR of the standard solution exceeds the MTD for the systems covered. If the MTTR is greater than the MTD, then an alternate backup or availability methodology must be employed. While traditional tape solutions are always getting faster and capable of holding more data, for some critical systems, tape-oriented backup and recovery solutions might not be viable because of the protracted recovery time associated with acquiring the necessary tapes and pulling the associated system image and/or data from the tapes.

Note

When considering the backup and availability of systems and data, be certain to address software licensing considerations. Though some vendors only require licenses for the total number of their product actively being used at one time, which could accommodate some recovery scenarios involving failover operations, others would require a full license for each system that might be used. Also, when recovering back to the primary computing facility, it is common to have both the primary and secondary systems online simultaneously, and, even if that is not typically the case, to consider whether the vendor expects a full license for both systems. Another point regarding licensing and recovery is that many vendors will allow cheaper licenses to cover the hot spare, hot standby, failover, or passive system in an active-passive cluster as long as only one of those systems will be processing at any given time. The complexities and nuances of individual vendors’ licensing terms are well beyond the scope of both this book and the CISSP® exam, but be certain to determine what the actual licensing needs are in order to legally satisfy recovery.

Hardcopy Data

In the event that there is a disruptive event such as a natural disaster that disables the local power grid, and power dependency is problematic, there is the potential to operate the organization’s most critical functions using only hardcopy data. Hardcopy data is any data that are accessed through reading or writing on paper rather than processing through a computer system.

In such weather-emergency-prone areas such as Florida, Mississippi, and Louisiana, many businesses develop a “paper only” DRP, which will allow them to operate key critical processes with just hard copies of data, battery-operated calculators, and other small electronics, as well as pens and pencils. One such organization is the Lynx transit system responsible for public bus operations in the Florida Orlando area. In the event that a natural disaster disables utilities and power, the system does have a process in place where all bus operations will move to paper-and-pencil record keeping until such a time as when power can be restored.

Electronic Backups

Electronic backups are archives that are stored electronically and can be retrieved in case of a disruptive event or disaster. Choosing the correct data backup strategy is dependent upon how users store data, the availability of resources and connectivity, and what the ultimate recovery goal is for the organization.

Preventative restoration is a recommended control: restore data to test the validity of the backup process. If a reliable system (such as a mainframe) copies data to tape every day for years, what assurance does the organization have that the process is working? Do the tapes (and data they contain) have integrity?

Many organizations discover backup problems at the worst time: after an operational data loss. A preventative restoration can identify problems before any data is lost.

Full Backups

A full system backup means that every piece of data is copied and stored on the backup repository. Conducting a full backup is time consuming, bandwidth intensive, and resource intensive. However, full backups will ensure that any necessary data is assured.

Incremental Backups

Incremental backups archive data that have changed since the last full or incremental backup. For example, a site performs a full backup every Sunday, and daily incremental backups from Monday through Saturday. If data are lost after the Wednesday incremental backup, four tapes are required for restoration: the Sunday full backup, as well as the Monday, Tuesday, and Wednesday incremental backups.

Differential Backups

Differential backups operate in a similar manner as the incremental backups except for one key difference. Differential backups archive data that have changed since the last full backup.

For example, the same site in our previous example switches to differential backups. They lose data after the Wednesday differential backup. Now only two tapes are required for restoration: the Sunday full backup and the Wednesday differential backup.

Tape Rotation Methods

A common tape rotation method is called FIFO (First In First Out). Assume you are performing full daily backups, and have 14 rewritable tapes total. FIFO (also called round robin) means you will use each tape in order, and cycle back to the first tape after the 14th is used. This ensures 14 days of data is archived. The downside of this plan is you only maintain 14 days of data: this schedule is not helpful if you seek to restore a file that was accidentally deleted 3 weeks ago.

Grandfather-Father-Son (GFS) addresses this problem. There are 3 sets of tapes: 7 daily tapes (the son), 4 weekly tapes (the father), and 12 monthly tapes (the grandfather). Once per week a son tape graduates to father. Once every 5 weeks a father tape graduates to grandfather. After running for a year this method ensures there are backup tapes available for the past 7 days, weekly tapes for the past 4 weeks, and monthly tapes for the past 12 months.

Electronic Vaulting

Electronic vaulting is the batch process of electronically transmitting data that is to be backed up on a routine, regularly scheduled time interval. It is used to transfer bulk information to an offsite facility. There are a number of commercially available tools and services that can perform electronic vaulting for an organization. Electronic Vaulting is a good tool for data that need to be backed up on a daily or possibly even hourly rate. It solves two problems at the same time. It stores sensitive data offsite and it can perform the backup at very short intervals to ensure that the most recent data is backed up.

Because electronic vaulting occurs across the Internet in most cases, it is important that the information sent for backup be sent via a secure communication channel and protected through a strong encryption protocol.

Remote Journaling

A database journal contains a log of all database transactions. Journals may be used to recover from a database failure. Assume a database checkpoint (snapshot) is saved every hour. If the database loses integrity 20 minutes after a checkpoint, it may be recovered by reverting to the checkpoint, and then applying all subsequent transactions described by the database journal.

Remote Journaling saves the database checkpoints and database journal to a remote site. In the event of failure at the primary site, the database may be recovered.

Database Shadowing

Database shadowing uses two or more identical databases that are updated simultaneously. The shadow database(s) can exist locally, but it is best practice to host one shadow database offsite. The goal of database shadowing is to greatly reduce the recovery time for a database implementation. Database shadowing allows faster recovery when compared with remote journaling.

HA Options

Increasingly, systems are being required to have effectively zero downtime, an MTD of zero. Recovery of data on tape is certainly ill equipped to meet these availability demands. The immediate availability of alternate systems is required should a failure or disaster occur. A common way to achieve this level of uptime requirement is to employ a high availability cluster.

Note

Different vendors use different terms for the same principles of having a redundant system actively processing or available for processing in the event of a failure. Though the particular implementations might vary slightly, the overarching goal of continuous availability typically is met with similar though not identical methods, if not terms.

The goal of a high availability cluster is to decrease the recovery time of a system or network device so that the availability of the service is less impacted than would be by having to rebuild, reconfigure, or otherwise stand up a replacement system. Two typical deployment approaches exist:

Active-active cluster involves multiple systems all of which are online and actively processing traffic or data. This configuration is also commonly referred to as load balancing, and is especially common with public facing systems such as Web server farms.

Active-passive cluster involves devices or systems that are already in place, configured, powered on, and ready to begin processing network traffic should a failure occur on the primary system. Active-passive clusters are often designed such that any configuration changes made on the primary system or device are replicated to the standby system. Also, to expedite the recovery of the service, many failover cluster devices will automatically, with no required user interaction, have services begin being processed on the secondary system should a disruption impact the primary device. It can also be referred to as a hot spare, standby, or failover cluster configuration.

Software Escrow

With the ubiquity of the outsourcing of software and application development to third parties, organizations must be sure to maintain the availability of their applications even if the vendor that developed the software initially goes out of business. Vendors who have developed products on behalf of other organizations might well have intellectual property and competitive advantage concerns about disclosing the source code of their applications to their customers. A common middle ground between these two entities is for the application development company to allow a neutral third party to hold the source code. This approach is known as software escrow. Should the development organization go out of business or otherwise violate the terms of the software escrow agreement, the third party holding the escrow will provide the source code and any other information to the purchasing organization.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128024379000084

Business Impact Analysis

Susan Snedaker, Chris Rima, in Business Continuity and Disaster Recovery Planning for IT Professionals (Second Edition), 2014

Recovery time requirements

Related to impact criticality are recovery time requirements. Let’s define a few terms here that will make it easier throughout the rest of the analysis to talk in terms of recovery times. As you read through these definitions, you can refer to Figure 5.3 for a representation of the relationship of these elements.

What is the maximum tolerable period in which data might be lost from an IT service due to a major incident?

Figure 5.3. Business recovery timeline.

Maximum tolerable downtime (MTD). This is just as it sounds—the maximum time a business can tolerate the absence or unavailability of a particular business function. (Note: The BCI in the United Kingdom uses the phrase MTO instead.) Different business functions will have different MTDs. If a business function is categorized as mission-critical, or Category 1, it will have the shortest MTD. There is a correlation between the criticality of a business function and its maximum downtime. The higher the criticality, the shorter the MTD is likely to be. Downtime consists of two elements, the systems recovery time and the work recovery time. Therefore, MTD = RTO + WRT.

Recovery time objective (RTO). The time available to recover disrupted systems and resources (systems recovery time). It is typically one segment of the MTD. For example, if a critical business process has a 3-day MTD, the RTO might be 1 day (Day l). This is the time you will have to get systems up and running. The remaining 2 days will be used for work recovery (see “work recovery time”). The RTO is a measure of when the system will be available to begin processing recovery work before being put back into a normalized production mode.

Work recovery time (WRT). The second segment that comprises the MTD. If your MTD is 3 days, Day 1 might be your RTO and Days 2-3 might be your WRT. It takes time to get critical business functions back up and running once the systems (hardware, software, and configuration) are restored. Upstream and downstream systems or interfaces need to be synchronized, data need to be tested to ensure backups are correct and in sequence, data captured manually during a downtime needs to be input, validated, and integrated into existing data. This is an area that some planners overlook, especially from IT. If the systems are back up and running, they’re all set from an IT perspective. From a business function perspective, there are additional steps that must be undertaken before it’s back to business. These are critical steps and that time must be built into the MTD. Otherwise, you’ll miss your MTD requirements and potentially put your entire business at risk.

Recovery point objective (RPO). The amount or extent of data loss that can be tolerated by your critical business systems. For example, some companies perform real-time data backup, some perform hourly or daily backups, some perform weekly backups. They may be full backups, incremental, differential, mirrored, local, remote, or cloud-based. If you perform weekly backups, someone made a decision that your company could tolerate the loss of a week’s worth of data (which should be validated during the BIA process). If backups are performed on Saturday evenings and a system fails on Saturday afternoon, you’ve lost the entire week’s worth of data. This is the RPO. In this case, the RPO is 1 week. If this is not acceptable, your current backup processes must be reviewed and revised. The RPO is based both on current operating procedures and your estimates of what might happen in the event of a business disruption. For example, if a tornado touches down in your town and your data center is without power, you may implement your BC/DR plan. If you have an alternate computing location, you may transfer operations to that location.

Your next step would be to determine the status of the data. Are you attempting to update systems using backups or were these alternate locations kept up to date? When was the last data backup performed relative to business operations? What do you need to bring systems up to date? These are the questions you’d need to answer after a business disruption. Therefore, it’s important to define your RPO beforehand and ensure your recovery processes address these timelines.

Let’s look at how these elements interact. Figure 5.3 graphically depicts the interplay between MTD, RTO, WRT, and RPO. If your company has mission-critical and vital business processes that do not interact with computer systems of any kind, you still need to perform a BIA in order to understand how these manual systems may be impacted by a business disruption, especially natural disasters. At the end of this chapter, we walk through an example to help illustrate these concepts. Most companies use technology and computer systems to some extent and the graphic in Figure 5.3 shows how the recovery time is impacted by a business disruption.

Point 1: RPO—The maximum sustainable data loss based on backup schedules and data needs.

Point 2: RTO—The duration of time required to bring critical systems back online.

Point 3: WRT—The duration of time needed to recover lost data (based on RPO) and to enter data resulting from work backlogs (manual data generated during system outage that must be entered).

Points 2 and 3: MTD—The duration of the RTO plus the WRT.

Point 4: Test, verify, and resume normal operations.

During normal operations, there is usually some gap between the last backup performed and the current state of the data. In some operations, this may be minutes or hours; in most organizations it is hours or days. This time frame is the RPO. In most organizations, this is the same as the period of time between backups. We see at circle 1 that there is a gap showing the point of the last backup and the state of current data, just before the disruption occurs. That’s the point at which one or more critical systems becomes unavailable and BC/DR planning activities are initiated. The first phase of the MTD is the RTO. This is the time frame during which systems are assessed, repaired, replaced, and reconfigured. The RTO ends when systems are back online and data are recovered to the last good backup. The second phase, WRT of the MTD then begins.

This is the phase when data are recovered through automated and manual data collection processes. There are two elements of WRT. The first is the manual collection and entry of data lost, typically because systems went down between backups. The second phase addresses the backlog of work that may have built up while systems were down. Most companies try to recover the data up to the disruptive event to bring the systems current and then address the backlog, but your business processes may dictate a different recovery order. The key is to understand that there is a delay between the time the systems are back online and the time when normal operations can resume. During the periods indicated by circles 2 and 3, emergency work-arounds and manual processes are being used. These are processes that will be developed later in your BC/DR planning process. For example, if a CRM system is down, what processes will your sales, marketing, and customer sales service teams use to interface with and manage customer service delivery? You’ll define that in the planning process. Circle 4 indicates the transition from DR and BC back to normal operations. There may be some overlap as manual processes are turned back over to automated processes and you may choose to do it in a rolling fashion—perhaps by department or geographic region.

As you collect your impact data, you’ll also need to begin determining the RTOs. You may choose to create a rating system, so you can quickly determine RTOs. For example, you might determine that mission-critical business systems or functions should have recovery windows as follows:

Category 1: Mission-critical—0-12 hours

Category 2: Vital—13-24 hours

Category 3: Important—1-3 days

Category 4: Minor—more than 3 days

You and your team, with input from the SMEs, can determine the appropriate MTD requirements. For some companies, a mission-critical business function could have an MTD of a week. For others, it might be 0-2 hours. There is an inverse correlation between the amount of time you can tolerate an outage and the cost of setting up systems that allow you to recover in that time frame. If you can’t afford much downtime, you’ll clearly have to invest more in preventing downtime and in having systems in place that allow fast recovery times. If you’re a small company and can afford a longer MTD, you can spend less on preventing or recovering from outages.

Let’s look at an example. In a small company, you may very well be able to do without even mission-critical systems for a couple of days or a week if you really had to. It’s possible that you contract with an outside IT service provider to maintain, troubleshoot, and repair your computer systems. If you want a guaranteed 2-hour response time, your monthly maintenance costs will be significantly higher than if you sign up for a guaranteed next business day response. So, if you really can’t afford to be without that mission-critical business function for more than about 8 hours (2-hour response time and 6-hour repair time), you’ll have to pay more to your service company and you’ll probably also have to purchase additional computer equipment to provide some redundancy to prevent extended downtime. These costs add up and the less disruption your business can afford, the more it will cost you to prevent or mitigate those risks. We’ll discuss this in more detail in Chapter 6, but it’s within the BIA segment where you have to begin making these kinds of assessments.

Let’s look at another example on the other end of the spectrum. Suppose you manage the centralized IT department for a multihospital, multistate healthcare system, and a serious power surge in the data center causes the hardware cluster running your electronic medical record system to fail. The potential cost to patients, providers, and the hospital system for that system to be down is enormous, so the investment the organization should be willing to make to ensure that data are always available (highly available, highly redundant, and no single point of failure) should also be large. In general, there should be a direct correlation between the criticality of the data and the investment the organization is willing to make to protect it.

It’s important to note during your impact analysis and subsequent mitigation planning phases that there is an optimal recovery point. Figure 5.4 shows the inverse relationship between the cost of disruption and the cost of recovery. Earlier in this book, we discussed the fact that any BC/DR plan had to be tailored to the unique needs and constraints of the organization. This is particularly true when it comes to the financial costs involved with disruption and recovery.

What is the maximum tolerable period in which data might be lost from an IT service due to a major incident?

Figure 5.4. Relationship between cost of disruption and cost of recovery.

You can see that the longer you allow a disruption to go on, the more expensive it becomes to the business. Conversely, the longer you have to recover, the less expensive recovery itself becomes. This makes sense when you understand that the longer a business disruption goes on, the more lost revenues, lost sales, and lost customers you accumulate. At the same time, if you need to recover your systems immediately, it’s going to cost more to implement things such as zero downtime solutions and hot sites. If you can afford to take a bit more time to recover you have more options, and these options are typically less expensive. If you start plotting these points, you will find an optimal point between these two costs, shown in Figure 5.4 by point A. Each company’s intersecting points (point A) will be different based on your company’s financial constraints and operating requirements.

Looking Ahead…

Making the Business Case Makes Your Life Easier

During the assessment and implementation of IT systems over the course of the past few years, you may already have addressed (and invested in) some of the elements needed to reduce the time to recover or to reduce the cost of a disruption. If so, be sure to make note of these systems or investments and be sure to include them in your planning. One way to help make the business case for continued investment is to show how the systems already implemented have made an impact or have contributed to your BC/DR plan. For example, suppose you implemented a mirrored site to allow users to gain access to key data more quickly. That mirrored site also serves as a backup and reduces the cost of disruption to a single site. It also reduces the amount of time it takes to recover, thereby pulling point A (Figure 5.4) down and to the left (toward lower cost and less time). This investment, then, has contributed to optimizing your balance between cost of disruption and cost to recover while also improving user productivity. Being able to establish and articulate these kinds of IT benefits within your organization will help you win support for your BC/DR plan and create a solid foundation for future investment decisions.

Next, let’s look at what the entire analysis process looks like, as shown in Figure 5.5. After we explore this, we take a look at the specific data required for inputs and outputs to this process.

What is the maximum tolerable period in which data might be lost from an IT service due to a major incident?

Figure 5.5. End-to-end business impact analysis.

In this segment of BC/DR planning, we’re looking at business functions, processes, and IT systems to determine criticality. Business functions can be defined as activities such as sales, marketing, or manufacturing. Business processes can be defined as how those activities occur. How are sales or revenues generated? How are orders processed? How are services delivered? How are employees hired? How is payroll paid? These are business processes; they describe how the functions get done. By first identifying business functions, you then can focus on the key processes in each function to develop a comprehensive view of your company. The third input area, shown in Figure 5.5, is IT systems. In most companies, the business processes are carried out in part through computer systems, applications, and other automated systems. Identifying mission-critical business functions and processes and how they intersect with IT systems will help you map out your BC/DR strategies.

Once you have compiled that data, you’ll perform the analysis to generate the needed outputs, including the criticality assessment, the impact assessments (financial and operational), required recovery objectives, dependencies, and work-around procedures. The work-around procedures will enable you to get critical business functions back up and running as quickly as possible. These work-around procedures may be used during the RTO and WRT periods discussed earlier and shown in Figure 5.3. As you can see, the output is a comprehensive corporate impact analysis. This is the same output shown in Figure 5.2 and is the end of the larger risk assessment phase in our overall BC/DR planning process. The impact analysis will be used as input to the risk mitigation planning segment of the BC/DR project and we’ll discuss that in Chapter 6.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780124105263000052

Security Component Fundamentals for Assessment

Leighton Johnson, in Security Controls Evaluation, Testing, and Assessment Handbook, 2016

Numbers that Matter – Critical Recovery Numbers

The assessor always needs to keep the numbers that matter to the business objectives and mission when reviewing the CP and COOP documentation, evidence, and testing results. So, what are these numbers?

Maximum tolerable downtime (MTD)

Recovery time objective (RTO)

Recovery point objective (RPO)

The ISCP Coordinator should next analyze the supported mission/business processes and with the process owners, leadership and business managers determine the acceptable downtime if a given process or specific system data were disrupted or otherwise unavailable. Downtime can be identified in several ways.

Maximum Tolerable Downtime (MTD). The MTD represents the total amount of time the system owner/authorizing official is willing to accept for a mission/business process outage or disruption and includes all impact considerations. Determining MTD is important because it could leave contingency planners with imprecise direction on (1) selection of an appropriate recovery method, and (2) the depth of detail which will be required when developing recovery procedures, including their scope and content.

Recovery Time Objective (RTO). RTO defines the maximum amount of time that a system resource can remain unavailable before there is an unacceptable impact on other system resources, supported mission/business processes, and the MTD. Determining the information system resource RTO is important for selecting appropriate technologies that are best suited for meeting the MTD. When it is not feasible to immediately meet the RTO and the MTD is inflexible, a Plan of Action and Milestone should be initiated to document the situation and plan for its mitigation.

Recovery Point Objective (RPO). The RPO represents the point in time, prior to a disruption or system outage, to which mission/business process data can be recovered (given the most recent backup copy of the data) after an outage. Unlike RTO, RPO is not considered as part of MTD. Rather, it is a factor of how much data loss the mission/business process can tolerate during the recovery process.

Because the RTO must ensure that the MTD is not exceeded, the RTO must normally be shorter than the MTD. For example, a system outage may prevent a particular process from being completed, and because it takes time to reprocess the data, that additional processing time must be added to the RTO to stay within the time limit established by the MTD.10

Example:

COOP Versus ISCP – The Basic Facts

Recovery times

COOP functions must be sustained within 12 h and for up to 30 days from an alternate site; ISCP RTOs are determined by the system-based BIA.

Information systems that support COOP functions must have an RTO that meets COOP requirements.

Information systems that do not support COOP functions do not require alternate sites as part of the ISCP recovery strategy, but may have an alternate site security control requirement.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128023242000117

Security component fundamentals for assessment

Leighton Johnson, in Security Controls Evaluation, Testing, and Assessment Handbook (Second Edition), 2020

C Numbers that matter—critical recovery numbers

The assessor always needs to keep the numbers that matter to the business objectives and mission when reviewing the contingency planning and COOP documentation, evidence, and testing results. So, what are these numbers?

Maximum Tolerable Downtime (MTD)

Recovery Time Objective (RTO)

Recovery Point Objective (RPO)

“The ISCP Coordinator should next analyze the supported mission/business processes and with the process owners, leadership and business managers determine the acceptable downtime if a given process or specific system data were disrupted or otherwise unavailable. Downtime can be identified in several ways.

Maximum Tolerable Downtime (MTD). The MTD represents the total amount of time the system owner/authorizing official is willing to accept for a mission/business process outage or disruption and includes all impact considerations. Determining MTD is important because it could leave contingency planners with imprecise direction on (1) selection of an appropriate recovery method, and (2) the depth of detail which will be required when developing recovery procedures, including their scope and content.

Recovery Time Objective (RTO). RTO defines the maximum amount of time that a system resource can remain unavailable before there is an unacceptable impact on other system resources, supported mission/business processes, and the MTD. Determining the information system resource RTO is important for selecting appropriate technologies that are best suited for meeting the MTD. When it is not feasible to immediately meet the RTO and the MTD is inflexible, a Plan of Action and Milestone should be initiated to document the situation and plan for its mitigation.

Recovery Point Objective (RPO). The RPO represents the point in time, prior to a disruption or system outage, to which mission/business process data can be recovered (given the most recent backup copy of the data) after an outage. Unlike RTO, RPO is not considered as part of MTD. Rather, it is a factor of how much data loss the mission/business process can tolerate during the recovery process.

Because the RTO must ensure that the MTD is not exceeded, the RTO must normally be shorter than the MTD. For example, a system outage may prevent a particular process from being completed, and because it takes time to reprocess the data, that additional processing time must be added to the RTO to stay within the time limit established by the MTD.”10

Example:

What is the maximum tolerable period in which data might be lost from an IT service due to a major incident?

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128184271000112

Preparing the Business Impact Analysis

Laura P. Taylor, in FISMA Compliance Handbook, 2013

Terminology

When it comes to system outages, there are different ways to represent downtime. NIST SP 800-34, Revision 1, Contingency Planning Guide for Federal Information Systems, recognizes three key terms that help organizations plan for outages.

Maximum Tolerable Downtime (MTD) represents the total amount of time the system owner/authorizing official is willing to accept for a mission/business process outage or disruption and includes all impact considerations.

Recovery Time Objective (RTO) represents the maximum amount of time that a system resource can remain unavailable before there is an unacceptable impact on other system resources, supported mission/business processes, and the MTD.

Last, Recovery Point Objective (RPO) represents the point in time, prior to a disruption or system outage, to which mission/business process data can be recovered (given the most recent backup copy of the data) after an outage.

RTO must ensure that the MTD is never exceeded and is therefore a shorter period of time than the MTD. A system with a short RTO likely has in place expensive recovery solutions such as highly available servers and network devices. RPO represents how much data loss the organization can tolerate from any given outage. You’ll want to come up with estimates for each of these downtime and recovery objectives in your BIA.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780124058712000130

Domain 7

Eric Conrad, ... Joshua Feldman, in Eleventh Hour CISSP® (Third Edition), 2017

Conduct BIA

BIA is the formal method for determining how a disruption to the IT system(s) of an organization will impact the organization's requirements, processes, and interdependencies with respect to the business mission.2 It is an analysis to identify and prioritize critical IT systems and components. It enables the BCP/DRP project manager to fully characterize the IT contingency requirements and priorities.2 The objective is to correlate the IT system components with the critical service it supports. It also aims to quantify the consequence of a disruption to the system component and how that will affect the organization. The primary goal of the BIA is to determine the Maximum Tolerable Downtime (MTD) for a specific IT asset. This will directly impact what disaster recovery solution is chosen.

Identify critical assets

The critical asset list is a list of those IT assets that are deemed business-essential by the organization. These systems' DRP/BCP must have the best available recovery capabilities assigned to them.

Conduct BCP/DRP-focused risk assessment

The BCP/DRP-focused risk assessment determines what risks are inherent to which IT assets. A vulnerability analysis is also conducted for each IT system and major application. This is done because most traditional BCP/DRP evaluations focus on physical security threats, both natural and human.

Determine MTD

The primary goal of the BIA is to determine the MTD, which describes the total time a system can be inoperable before an organization is severely impacted. MTD is comprised of two metrics: the Recovery Time Objective (RTO), and the Work Recovery Time (WRT) (see later).

Alternate terms for MTD

Depending on the business continuity framework that is used, other terms may be substituted for MTD. These include Maximum Allowable Downtime, Maximum Tolerable Outage, and Maximum Acceptable Outage.

Failure and recovery metrics

A number of metrics are used to quantify how frequently systems fail, how long a system may exist in a failed state, and the maximum time to recover from failure. These metrics include the Recovery Point Objective (RPO), RTO, WRT, Mean Time Between Failures (MTBF), Mean Time to Repair (MTTR), and Minimum Operating Requirements (MOR).

Recovery point objective

The RPO is the amount of data loss or system inaccessibility (measured in time) that an organization can withstand. “If you perform weekly backups, someone made a decision that your company could tolerate the loss of a week's worth of data. If backups are performed on Saturday evenings and a system fails on Saturday afternoon, you have lost the entire week's worth of data. This is the RPO. In this case, the RPO is 1 week.”3

The RPO represents the maximum acceptable amount of data/work loss for a given process because of a disaster or disruptive event.

Recovery time objective and work recovery time

The RTO describes the maximum time allowed to recover business or IT systems. RTO is also called the systems recovery time. This is one part of MTD; once the system is physically running, it must be configured.

Crunch Time

WRT describes the time required to configure a recovered system. “Downtime consists of two elements, the systems recovery time and the WRT. Therefore, MTD = RTO + WRT.”3

Mean time between failures

MTBF quantifies how long a new or repaired system will run before failing. It is typically generated by a component vendor and is largely applicable to hardware as opposed to applications and software.

Mean time to repair

The MTTR describes how long it will take to recover a specific failed system. It is the best estimate for reconstituting the IT system so that business continuity may occur.

Minimum operating requirements

MOR describe the minimum environmental and connectivity requirements in order to operate computer equipment. It is important to determine and document what the MOR is for each IT-critical asset because in the event of a disruptive event or disaster, proper analysis can be conducted quickly to determine if the IT assets will be able to function in the emergency environment.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128112489000073

Business Continuity/Disaster Recovery Plan Development

Susan Snedaker, Chris Rima, in Business Continuity and Disaster Recovery Planning for IT Professionals (Second Edition), 2014

Selection criteria

Selection criteria are the factors you develop to help you determine how to select the best alternate site solution for your company. This includes cost, technical and functional requirements, timelines, quality, availability, location, and more. Be sure to consider connectivity and communications requirements in this section along with your recovery requirements such as MTD. Remember that you need to find the best balance between risk and mitigation—so your selection criteria should not be so rigorous as to exclude all but the most expensive, iron-clad options. You may choose to use prioritization in your selection criteria language. For example, availability and technical requirements might be your first priority; location might be second or even third. By prioritizing, you can ensure you don’t box yourself into a solution that is overengineered (or under-engineered) for your needs.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780124105263000076

Emergency Response and Recovery

Susan Snedaker, Chris Rima, in Business Continuity and Disaster Recovery Planning for IT Professionals (Second Edition), 2014

Summary

In this chapter, you learned about emergency plans and emergency responses that should be included in your BC/DR plan. Emergency response is the initial response to a disaster or disruption. The first response should be to get people out of harm’s way and to determine if there are fatalities or injuries. Secondary efforts should be to stop the source of the problem whether that’s through calling civil emergency responders (fire, bomb squad, police) or through attempting to address the problem with an ERT (fighting a fire, turning off gas or electric sources, containing hazardous spills, etc.). Emergency responders should be trained in appropriate skills such as safe building evacuation methods, CPR and first aid, firefighting, hazardous material containment, and others. Emergency plans should be well conceived and well rehearsed because people will fall back on their training in an emergency.

The CMT may activate the emergency response or the emergency responders may notify the CMT of an event. In any case, the CMT coordinates emergency efforts and activates the BC/DR plan based on the specifics of the situation. The CMT is also responsible for coordinating recovery efforts and should manage these activities through the business continuity stage. Roles and responsibilities should be well defined to avoid confusion or working at cross-purposes. Activities the CMT typically manages can include the emergency and disaster response, activating alternate work sites and facilities, managing corporate communications, interfacing with insurance and legal representatives, and working with the finance department. You can define other appropriate activities for your CMT to reflect the specifics of your business.

Because disasters are by their very nature chaotic events, it helps to have checklists you and your team can use to manage activities in the aftermath of a major disaster or disruption. We’ve included several checklists in the Appendix of this book, so you can easily refer to them and use them in your planning activities. DR tasks fall into two major categories: activation and recovery. Activation includes all activities related to assessing a situation and determining what recovery plans should be implemented as well as taking initial steps toward that end.

Within DR, there are specific IT recovery tasks that should be performed as well. Separate IT recovery checklists should be created so that you have a clear plan about how to recover from various events. These checklists should include information regarding the MTD and other recovery metrics that have been established. The lists also should include timelines, milestones, and dependencies that need to be addressed. Some companies form CIRTs or computer emergency response teams (CERTs) to respond quickly and effectively to computer-based incidents. The activities of the CIRT occur in the day-to-day operations of the company (outside the BC/DR domain) and are also part of BC/DR activities. Defining how the CIRT should operate and interact with your BC/DR plan is vital to ensure an effective response.

Business continuity activities begin after recovery efforts have concluded, though there is usually some overlap. Business continuity activities include the limited resumption of business operations, typically in manual or work-around mode. These activities pose a unique set of challenges from an IT and operations perspective because data must be managed differently until IT systems are fully back online and normal operations can resume. The business continuity checklist should include steps needed to resume limited operations, it should identify requirements and dependencies, and it should include timelines, milestones, and checkpoints. The resumption of normal business operations typically occurs when the company either reoccupies its original facility and all equipment is back up and running or when the company decides on a permanent business location (which may be the alternate site or newly acquired site). Criteria for determining the cutover to “normal operations” should be developed and the CMT should hand over operations to the management team toward the end of the business continuity phase. Clearly defining this cutover as well as roles and responsibilities will help prevent confusion during this last phase of activity.

Key concepts

Emergency management overview

Emergencies are chaotic events that require a coordinated response.

Lack of a coordinated response after Hurricane Katrina exacerbated the problems.

Contact emergency responders first but understand what their priorities will be in the aftermath of a serious event.

Companies should be prepared to be somewhat self-sufficient in the immediate aftermath of an event.

Emergency response plans

Emergency response plans deal with protecting people first, property second.

Emergency responses should attempt to contain, control, or end the emergency. This includes evacuating buildings, fighting fires, turning off utilities, and other response activities.

ERTs should have the skills required to address the specific needs of your company’s operations.

Training is imperative for ERT members. Training should be refreshed and tested periodically.

Training for ERT members may include firefighting, CPR, first aid, hazardous material containment, and other skills appropriate to the location and nature of your business.

Emergency response checklists help keep people calm and focused on next steps. Develop emergency response checklists in conjunction with expertise from your ERT and local civil emergency responders (fire, police, hazmat, bomb squad, etc.).

Crisis management team

The CMT may activate the emergency response or it may be activated by the ERT.

The CMT manages, directs, and oversees the DR efforts.

CMT responsibilities include emergency and disaster response as well as coordinating efforts related to alternate facilities and work sites, communications, human resources, insurance, legal, and finance.

CMT roles and responsibilities should be clearly delineated.

MTD and other recovery metrics should be well understood by the CMT and addressed by recovery plans.

Disaster recovery

Activation checklists can be used to determine if, how, and when to activate the BC/DR plan. In some cases, activation of part of the plan may be warranted.

Clear activation checklists help responders understand what steps to take and help them make better decisions in the confusion that surrounds major disasters or disruptions.

DR checklists should include MTD and other recovery metrics, so the CMT can make decisions appropriate to these requirements.

DR checklists should address the safety and well-being of personnel first, then address physical facilities, buildings, equipment, and other business assets.

IT recovery

Having clear and concise service levels and step-by-step application recovery procedures expedite recovery operations according to business requirements.

Review your IT recovery checklists to determine what pre-IT recovery information must be readily accessible and part of the BC/DR plan, prior to actual IT recovery procedures.

Every company should have a CIRT that responds to incidents related to IT equipment.

Incidents may be unusual activity, intentional or unintentional breaches, hardware failures, and so on.

CIRT activities are both day-to-day and part of BC/DR activities.

The responsibilities of the CIRT include monitoring, alerting, mobilizing, assessing, stabilizing, resolving, and reviewing all IT-related incidents (incidents as defined by the team).

CIRT skills should be kept up to date, so they are aware of and can respond to the latest threats, vulnerabilities, and issues on the IT realm.

Business continuity

Business continuity activities typically involve the resumption of limited business operations.

These activities typically involve manual and work-around systems, while equipment and IT systems are being fully restored.

The decision to move to a permanent facility, whether returning to the original location, staying at the alternate site, or acquiring a new location, typically triggers the final stage of business continuity and signals the resumption of normal operations.

Business continuity checklists should be used to ensure that required systems are in place and functional. Checklists should also contain references to timelines, milestones, dependencies, and other business metrics.

Once business continuity activities end and normal business resumes, the BC/DR teams should review lessons learned so they can be incorporated into the BC/DR plan.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780124105263000088

What is the maximum tolerable time upto which one can withstand loss of data?

Your Recovery Point Objective (RPO) determines the point in time to which you will recover. This is defined by the maximum acceptable amount of data loss measured in time. For example, having a maximum tolerable data loss of 20 minutes will set your RPO to 20 minutes.

What is the maximum tolerable downtime?

Definition(s): The amount of time mission/business process can be disrupted without causing significant harm to the organization's mission.

What is an acceptable amount of data loss?

Any significant packet loss on a voice call can create a significant distraction and the same goes for video. Thus, with voice and video calls, 3-5% packet loss could be considered “acceptable”.

What defines the maximum time period an organization is willing to lose data during a major IT outage event?

RPO is the maximum acceptable time between backups. If data backups are performed every 6 hours, and a disaster strikes 1 hour after the backup, you will lose only one hour of data.