Which one of the following measures the average amount of time that it takes to repair a system?

Reliability Theory

Harry F. Martz, in Encyclopedia of Physical Science and Technology (Third Edition), 2003

I.C.2 Mean Time to Failure

The mean time to failure (MTTF) is the expected (or average) time for which the device performs successfully and is given by

(3)E(T)=∫0∞tf(t)dt.

If limt→∞tR(t) = 0, then E(T) alternatively becomes

(4)E(T)=∫0∞R(t)dt.

If the device is renewed through maintenance and repair, E(T) is also known as the mean (operating) time between failures (MTBF).

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0122274105006591

Video Management Systems

Vlado Damjanovski, in CCTV (Third Edition), 2014

Hard disk failures

Since a hard disk drive is an electromechanical device, wear and tear will cause it to eventually fail.

It’s not a question if a drive will fail, the question is when it will fail?

When analyzing hard disks life expectancy, there are three common parameters manufacturers give.

The Annualized Failure Rate (AFR), which is the percentage failure share of a certain amount of hard disks (extrapolated to one year based on expectation values).

The Mean Time Between Failures (MTBF) specifies the expected operating time between two consecutive failures of a device type in hours (definition according to IEC 60050 (191)). The MTBF considers the life cycle of a device that fails repeatedly, then is repaired and returned to service again. Because repair of hard drives rarely happens, we don’t really have mean time between failures. We only have mean time to a failure, after which the drive is discarded. Therefore, Mean Time To Failure (MTTF) is used, which specifies the expected operating time until the failure of a device type in hours (definition according to IEC 60050 (191)).

The acronyms MTBF and MTTF are often used synonymously in terms of drives. Some manufacturers, for example, estimate the MTBF as the number of operating hours per year divided by the projected failure rate. This view is based on a failure without repair. As such , the MTTF would be more practical parameter, but it will still show unrealistic high number of hours of life expectancy, something that is not the case with a standard electronic definition of MTBF.

An extreme case of a broken hard drive

For example, a typical hard disk MTBF’s numbers range between 300,000 and 1,000,000 hours. This is quite a high number, equivalent to 34 to 114 years. These numbers are way too high, and practical experience shows that the posted lifetime of a drive is more likely one tenth of it, typically 3~5 years. In addition, technology progress and new standards and increased capacity does not allow hard drives to be used for more than a couple of years. Moore’s law can easily be applied to hard disk capacity too, they double almost every year or two.

The stated MTBF/MTTF high numbers are a result of a specific definition by hard disk manufacturers, which refers to a testing a number of drive, the testing period selected (in hours) and the percentage of drives failed in that period. This can be written as the following formula:

(47)MTTF=test periodxnumberofdrives /numberoffaileddrives

For example, if testing has been conducted over one month (720 hrs), and out of 1,000 drives, two have failed, the MTTF will be:

MTTF=720×1,000/2=360,000hrs.

In reality, this does not mean any drive may fail on and around 360,000 hrs of operation. The better interpretation of our example is that out of 1,000 drives tested for one month, two have failed, which is equivalent to having failure of one drive out of 500, each month (1,000 / 2). This is a 0.2% for one month. Over a period of one year, as the annual failure rate (AFR) defines it, this would be equivalent to having probability of 2.4% drives to fail. So, following our example, in a system with 1,000 drives with MTTF of 360,000 hrs, statistically, there will be 24 failed hard disk drives. If the MTTF was 1,000,000 hrs, which is typically quoted for enterprise drives, this statistics will mean 1 failed rive among 1,000 over a period of 1 year.

This same calculation can be generalized using the common exponential model of distribution:

(48)FailureProbability=Rt=1−Ft =1−e−t/M=1−e−λt

where e is the natural base number e = 2.71, t is the time for which this probability is calculated, and M is the MTBF.

So, if we do the same calculation for the previous example, for a drive with 360,000 hour MTBF drive (M = 360,000), we could calculate the failure probability for 1 year (t = 8,760 hrs) to be:

Rt=1‐e−8,760/360,000=0.024=2.4%.

Clearly, the above numbers are statistical only and very much depend on the environment in which the tests have been conducted. Most notable are certainly temperature, humidity, mechanical shocks, and static electricity during handling and installing. It is quite understandable that manufacturers would try and conduct such tests in as ideal conditions as possible. This means, in practice, we can only expect higher numbers than the above, drawn as a statistical calculation from manufacturers’ tests.

An interesting study was made by Carnegie Mellon University, which confirmed the empirical knowledge in mass-produced consumer electronics: if a new product doesn’t fail immediately after its inital usage, it will serve its purpose until approximately its MTBF time.

This study evaluated the data of about 100,000 hard drives that were used in several large-scale systems and found a large deviation of the manufacturer’s information. The average failure rate of all hard drives was six times higher for systems with an operating time of less than three years and even 30 times higher for systems with an operating time of 5-8 years. This statistic took them to a conclusion that in the early “infancy” usage time there is much higher rate of failure, then it settles for a longer period, which is the expected useful working life. After that, it starts failing due to age, wear, and tear, which coincides with practical experience of around five years (60 months) before increased failure rate is experienced.

The Carnegi Mellon bath-tub curve

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780124045576500094

Integrated Dependability and Security Evaluation Using Game Theory and Markov Models

Bjarne E. Helvik, ... Svein J. Knapskog, in Information Assurance, 2008

8.2.5 Model Parametrization

In order to obtain measures (MTFF, MTTF), the stochastic model has to be parametrized (i.e., the elements qij ∈Q need to be evaluated). The procedure of obtaining accidental failure and repair rates has been practiced for many years in traditional dependability analysis, and will therefore not be discussed in this chapter. However, choosing the accumulated attack intensities λij(a) remains a challenge. One solution is to let security experts assess the intensities based on subjective expert opinion, empirical data, or a combination of both. An example of empirical data is historical attack data collected from honeypots. The data can also be based on intrusion experiments performed by students in a controlled environment. Empirical data from such an experiment conducted at Chalmers University of Technology in Sweden [26] indicates that the time between successful intrusions during the standard attack phase is exponentially distributed. Another ongoing project at the Carnegie Mellon CyLab in Pittsburgh, PA [27] aims to collect information from a number of different sources in order to predict attacks. Even though the process of assessing the attack intensities is crucial, and an important research topic in itself, it is not the primary focus of this chapter.

Obtaining realistic πi(a) (i.e., the probabilities that an attacker chooses particular attack actions in certain system states) may be more difficult. In this chapter, we use game theory as a means for computing the expected attacker behavior. The procedure is summarized in Section 8.3.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123735669500100

Internet of Things—robustness and reliability

S. Sarkar, in Internet of Things, 2016

11.1 Introduction

Building a reliable computing system has always been an important requirement for the business and the scientific community. By the term reliability, we mean how long a system can operate without any failure. Along with reliability, there is another closely related quality attribute, called availability. Informally, availability is the percentage of time that a system is operational to the user. An internet of things (IoT) system deploys a massive number of network aware devices in a dynamic, error-prone, and unpredictable environment, and is expected to run for a long time without failure. To commission such a system and to keep it operational, it is therefore essential that the system is designed to be reliable and available. Let us understand these two attributes in detail.

Since the exact time of a failure of any operational system is not known a priori, it is appropriate to model the time for a system to fail as a (continuous) random variable. Let f(t) be the failure probability density function, which denotes the instantaneous likelihood that the system fails at time t. Next, we would like to know the probability that the system will fail within a time t, denoted by F(t). Let T be the time for the system to fail. The function F(t)=Pr{T≤t}= ∫0tf(u)du, also known as the failure function, is the cumulative probability distribution of T. Given this distribution function, we can predict the possibility of a system failing within a time interval (a,b] to be Pr{a<T≤b}=∫abf(t)dt =F(b)−F(a). The reliability of a system R(t) can be formally defined as the probability that the system will not fail till the time t. It is expressed as R(t) = Pr{T > t}= 1−F(t).

The mean time to failure (MTTF) for the system is the expected value E[T] of the failure density function = ∫0∞tf(t)dt which can rewritten as −∫0∞tR′(t)dt=−[tR(t)]0∞+∫0∞R(t) dt

When t approaches ∞, it can be shown that tR(t) tends to 0. Therefore, MTTF, which intuitively is the long-run average time to failure, is expressed as: ∫0∞R(τ)dτ

With this MTTF value, availability A can be computed as: A=MTTFMTTF+MTTR where MTTR denotes the average time the system takes to be operational again after a failure. Thus, the definition of availability takes reliability also into account.

Availability has been one of the most important quality attributes to measure the extent of uninterrupted service that a distributed and more recently a cloud-based system provides. It has also been an important metric to define the service level agreement (SLA) between the service provider (a SaaS or an IaaS provider) and the service consumer.

From the definition, it is obvious that a system which is highly reliable (high MTTF) will tend to be highly available as well. However, the mean time to recover or MTTR brings another alternative means to achieve high availability. One can design a highly available system even with components having relatively poor reliability (not very large MTTF), provided that the system takes a very little time to recover when it fails. Although the hardware industry has always strived to make the infrastructure reliable (ie, increase MTTF), today it has possibly reached its limit. Increasing MTTF beyond a certain point is extremely costly, and sometimes impossible. In view of this, it becomes quite relevant to design a system equipped with faster recovery mechanismsa. This observation has led to the emergence of recovery oriented computing (ROC) [20] paradigm, which has now been considered to be a more cost-effective approach to ensure the service continuity for distributed and cloud-based systems. The fundamental principle of ROC is to make MTTR as small as possible. For an IoT-based system, the participating components can have high failure possibilities. In order to ensure that an IoT system always remains operational, the ROC becomes an attractive and feasible approach.

Along with reliability and availability, the term serviceability coined by IBM (https://en.wikipedia.org/wiki/Serviceability_(computer)) is frequently used to indicate the ease with which the deployed system can be repaired and brought back to operation. Thus, serviceability implies reduction of the MTTR using various failure-prevention methods (prediction, preventive maintenance), failure detection techniques (through monitoring), and failure handling approaches (by masking the impact of an error, recovery). The goal of serviceability is obviously to have a zero repair time, thereby achieving a near 100% availability.

In the remainder of this chapter we will discuss suitable serviceability techniques to improve the reliability and availability of IoT systems.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128053959000113

Introduction

In Architecture Design for Soft Errors, 2008

1.9.2 SDC and DUE Budgets

Typically, silicon chip vendors have market-specific SDC and DUE budgets that they require their chips to meet. This is similar in some ways to a chip's power budget or performance target. The key point to note is that chip operation is not error free. The soft error budgets for a chip would be set sufficiently low for a target market such that the SDC and DUE from alpha particles and neutrons would be a small fraction of the overall error rate. For example, companies could set an overall target (for both soft and other errors) of 1000 years MTTF or 114 FIT for SDC and 25 years MTTF or about 4500 FIT for DUE for their systems [16]. The SDC and DUE due to alpha particle and neutron strikes are supposed to be only a small fraction of this overall budget.

Table 1.1 shows examples of SDC and DUE tolerance in sample application servers. For example, databases often have error recovery mechanisms (via their logs) and can often tolerate and recover from detected errors (see Log-Based Backward Error Recovery in Database Systems, p. 317, Chapter 8). But they are often not equipped to recover from an SDC event due to a particle strike. In contrast, in the desktop market, software bugs and device driver crashes often account for a majority of the errors. Hence, processors and chipsets in such systems can tolerate more errors due to particle strikes and may not need as aggressive a protection mechanism as those used in mission-critical systems, such as airplanes. Mission-critical systems, on the other hand, must have extremely low SDC and DUE because people's lives may be at stake.

TABLE 1.1. ▪ SDC and DUE Tolerance in Different Application Segments

	Data Integrity Requirement	Availability Requirement
Mission-critical applications	Extremely low SDC	Extremely low DUE
Web-server applications	Moderate SDC tolerated	Low DUE
Back-end databases	Very low SDC	Moderate DUE tolerated
Desktop applications	Higher SDC tolerated	Higher DUE tolerated

▪

EXAMPLE

A system is to be composed of a number of silicon chips, each with an SDC MTTF of 1000 years and DUE MTTF of 10 years (both from soft errors only). The system MTTF budgets are 100 years for SDC and 5 years for DUE. What is the maximum number of chips that can fit into the overall soft error budget?

SOLUTION 10 chips can fit under the SDC budget (= 1000/100) and two chips under the DUE budget (= 10/5). Hence, the maximum number of chips that can be accommodated is two chips.

▪

EXAMPLE

If the effective FIT rate of a latch is 0.1 milliFIT, then how many latches can be accommodated in a microprocessor with a latch SDC budget of 10 FIT?

SOLUTION Total number of latches that can be accommodated is 100 000 (= 10/0.0001). The Fujitsu SPARC64 V processor (announced in 2003) had 200 000 latches [2]. Modern microprocessors can have as many as 10 times greater number of latches as that in the Fujitsu SPARC64 V. Consequently, it becomes critical to protect these latches to allow the processor to meet its SDC budget.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123695291500033

Literature review

Wenhao Li, ... Dong Yuan, in Reliability Assurance of Big Data in the Cloud, 2015

2.1.1.2 Disk reliability metrics

In general, there are two metrics that are widely used for describing the permanent disk failure rates, which are the mean time to failure (MTTF) and annualized failure rate (AFR). MTTF is the length of time that a device or other product is expected to last in operation. It indicates how long the disk can be reasonably expected to work. In industry, the MTTF of disks are obtained by running many or even many thousands of units for a specific number of hours and checking the number of disks that permanently failed. Instead of using MTTF for describing disk reliability, some hard drive manufacturers now use AFR [34]. AFR is the estimated probability that the disk will fail during a full year of use. Essentially, AFR can be seen as another form of MTTF expressed in years, which can be obtained according to equation (2.1) [35]:

(2.1)A FR=1−exp(−8760/MTTF)

where 8760 is to convert the time unit from hour to year (1 year = 8760 hours). The advantage of using AFR as the disk reliability matric is that it is more intuitive and easier to understand by non-computer specialists. For example, for a disk with MTTF of 300,000, the AFR is 2.88% per year, that is, a probability of 2.88% that the disk is expected to fail during one year of use.

However, in practice, the AFR value is sometimes not consistent with the MTTF value specified in the datasheets of the disks [3,18]318. Because of a variety of factors, such as working temperature, work load, and so forth, actual disk drive reliability may differ from the manufacturer’s specification and vary from user to user [18]. MTTF and AFR values of disks were comprehensively investigated according to records and logs collected from several large production systems for every disk that was replaced in the system [3]. According to the results of these collected records and logs, the AFR of disks typically exceeds 1%, with 2–4% as a norm, and sometimes more than 10% can be observed. The datasheet MTTF of those disks, however, only ranges from 1,000,000 to 1,500,000 hours (i.e., an AFR of at most 0.88%). Disk reliability analysis based on Google’s more than 100,000 ATA disks also observed an average AFR value higher than 1%, which is from 1.7% for disks that were in their first year of operation to as high as 8.6% for older disks of 3 years old [17].

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128025727000026

Risk management

Anthony Scott Brown, in Clinical Engineering (Second Edition), 2020

Risk analysis

Theoretical perspectives on risk will not be elaborated on here as they have already been discussed in a recent Institute of Physics and Engineering in Medicine (IPEM) report entitled Risk Management and its Application to Medical Device Management (Brown and Robbins, 2007). Risk analysis requires the identification of key safety characteristics of the device, and identification of the hazards. For each of the hazards an estimation of the risk should be made. Essentially this is about considering the consequences (severity) and likelihood (probability) of a hazard becoming a risk. This is a systems approach to failure as conceptualized by Reason's (2000) “Swiss cheese” model. “The systems approach concentrates on the conditions under which individuals work and tries to build defenses to avert errors or mitigate their effects” (Reason, 2000, p. 768).

To ensure consistency, each of these terms should be defined or quantified in the risk management file. Typical definitions are given in Tables 5.1 and 5.2.

Table 5.2. Typical definitions for likelihood.

Improbable (rare)	Rarely occurs, >1–5 years
Remote (unlikely)	Not expected to happen more than yearly
Occasional (possible)	May reoccur occasionally, >6 monthly
Probable (likely)	Likely to reoccur > monthly
Frequent (almost certain)	Frequently reoccurs > weekly

When considering medical device design, a risk may be the failure of a safety critical component, in which case the likelihood may be estimated in mean time to failure (MTTF), for risks where is not possible to estimate the likelihood, for example, of software failure, or malicious tampering the worst-case scenario should be used. ISO 14971 gives some useful examples of sources of information or data for estimating risks (BSI, 2007, p. 10):

•

Published standards

•

Scientific technical data

•

Field data from similar devices already in use, including published reported incidents

•

Usability tests employing typical users

•

Clinical evidence

•

Results of appropriate investigations

•

Expert opinion

•

External quality assessment schemes

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B978008102694600005X

Generic data reliability model in the cloud

Wenhao Li, ... Dong Yuan, in Reliability Assurance of Big Data in the Cloud, 2015

4.1.1 Reliability metrics

As mentioned in Section 2.1, there are two fundamental disk reliability metrics that are currently used for describing the permanent disk failure rates, which are the mean time to failure (MTTF) and annualized failure rate (AFR). In this book, we apply the AFR as the disk reliability metric to our research because of the following two reasons.

•

First, AFR is easier to be understood by nonexpert readers. The representation of MTTF is by time, which is calculated according to the equation MTTF=TestDiskNumber⋅TestHours/DiskFailures. For example, a disk manufacturer tested a sample of 1000 disks for a period of 1000 hours (i.e., 41.5 days), and within that period of time, one disk failure occurred. According to the equation, the MTTF value is 1,000,000 hours. From the reader’s point of view, the MTTF value that equals 114 years would be hard to understand because no single disk could survive for that long. In contrast, the representation of AFR is by percentage, which indicates the expected probability of disk failure occurrence during 1 year of usage. For the MTTF value of 1,000,000 hours, according to equation (2.1) in Section 2.1, the equivalent AFR value is 0.87%, meaning that 0.87% of all the disks are expected to fail during 1 year of usage. Compared with MTTF, the advantage of AFR on readability can be easily seen.

•

Second, as mentioned in Section 2.1, MTTF is obtained in the industrial test by running many disks for a specific period of time. On the contrary, AFR is obtained from the real scenario by checking the running history of disks in the system via system logs. Therefore, the AFR value could better reflect the actual reliability level of disks in a real storage system. In addition, much existing research conducted by industry researchers applies AFR for disk reliability evaluation. In this book, results from existing industrial research are well investigated and applied in our evaluation as well.

Based on the AFR disk reliability metric, the data reliability is presented in the similar style. In our novel reliability model, the data reliability is described in the form of annual survival rate, which indicates the proportion of the data that survives during 1 year storage.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B978012802572700004X

Hardware Fault Tolerance

Israel Koren, C. Mani Krishna, in Fault-Tolerant Systems (Second Edition), 2021

2.3.2 Stress Migration

Metal interconnects are deposited on a silicon substrate; the two materials expand at different rates when heated. This causes mechanical stress to the interconnects, which triggers migration among the metal atoms. The mean time to failure caused by stress migration of metal interconnect is often modeled by the following expression:

(2.15)MTTFSM=ASM⋅σ−m⋅exp⁡(Ea(SM)kT) ,

where ASM is a constant of proportionality, σ is the mechanical strain, Ea(SM) is the activation energy for stress migration (often taken as between 0.6 to 1.0 eV), k the Boltzmann constant, and T the absolute temperature. The exponent, m, is usually taken as between 2 and 4 for soft metals like Al and Cu; it rises to between 6 and 9 for strong, hardened materials.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128181058000127

Reliable and power-aware architectures

A. Vega, ... R.F. DeMara, in Rugged Embedded Systems, 2017

3.2 Effectiveness Metrics

Effectiveness metrics quantify the benefit in system resilience provided by a given technology or set of resilience techniques. Such metrics tend to be measured as probabilistic figures that predict the expected resilience of a system, or that estimate the average time before an event expected to affect a system’s normal operating characteristics is likely to occur. These include:

•

Mean Time to Failure (MTTF)—Indicates the average amount of time before the system degrades to an unacceptable level, ceases expected operation, and/or fails to produce the expected results.

•

Mean Time to Repair (MTTR)—When a system degrades to the point at which it has “failed” (this can be in terms of functionality, performance, energy consumption, etc.), the MTTR provides the average time it takes to recover from the failure. Note that a system may have different MTTRs for different failure events as determined by the system operator.

•

Mean Time Between Failures (MTBF)—The mean time between failures gives an average expected time between consecutive failures in the system. MTBF is related to MTTF as MTBF = MTTF + MTTR.

•

Mean Time Between Application Interrupts (MTBAI)—This measurement gives the average time between application level interrupts that cause the application to respond to a resilience-related event.

•

Probability of Erroneous Answer—This metric measures the probability that the final answer is wrong due to an undetected error.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128024591000026

Which one of the following measures the average amount of time that it takes to repair a system application?

MTTR (mean time to repair) is the average time it takes to repair a system (usually technical or mechanical).

What is MTTR stand for?

Mean time to repair (MTTR) is a maintenance metric that measures the average time required to troubleshoot and repair failed equipment. It reflects how quickly an organization can respond to unplanned breakdowns and repair them.

How is a system level mean time to repair calculated?

How is MTTR calculated? MTTR is calculated by dividing the total downtime caused by failures by the total number of failures. If, for example, a system fails three times in a month, and the failures resulted in a total of six hours of downtime, the MTTR would be two hours.

What does a high MTTR mean?

The MTTR is an indicator of maintainability (how easily a piece of equipment can be repaired). A higher Mean Time to Repair may indicate that replacing a given asset is cheaper or preferable to repairing it.