The Systems Engineering Relationship between Qualification, Environmental
Stress Screening and Reliability James A. Robles The Boeing Company Copyright 2009 SAE International ABSTRACT The Systems Engineering Relationship between Qualification, Environmental Stress Screening (ESS), and Reliability is often poorly understood: as a consequence resources are expended on efforts that degrade inherent hardware reliability and vitiate reliability predictions. This article expatiates on the Systems Engineering relationship between Qualification and ESS, and how their proper application enhances inherent reliability and supports credible reliability predictions. Examples of how their uninformed application degrades inherent hardware reliability and vitiates reliability predictions, and how program/equipment managers can avoid this, are presented. INTRODUCTION There is a problem with the reliability of recently fielded systems: Department of Defense (DoD) concerns have been widely reported. Emerging data shows that a significant number of U.S. Army systems are failing to demonstrate established reliability requirements during operational testing and many of these are falling well short of their established requirement. . . . Enclosure 1 outlines the process for establishing and reporting the new reliability threshold, as well as a mechanism for detecting and reporting threshold breaches. The routine use of this process and the implementation of reliability best practices (Enclosure 2) will help the Army achieve its reliability requirements. 1 Ensure programs are formulated to execute a viable systems engineering strategy from the beginning, including a RAM growth program, as an integral part of design and development. 2 I share your concerns regarding the recent downward trend of reliability, availability, and maintainability (RAM) test results, and agree with your assessment that RAM considerations must be strengthened as our weapons systems move through the development and production phases and into operational service. 3 This is not, as discussed below, a commercial-off-the- shelf (COTS) vs. custom or military specification design issue: the focus of the above references is on Program Management and System Engineering processes and best practices. This focus on processes and practices is a positive development; however it is essential that we get the content right. Two areas where there seems to be widespread failure to do so are the definition of durability environments, and the application of ESS: showing why this is so requires some review of fatigue engineering principles, the bathtub curve, and the limitations of our reliability prediction methods. FATIGUE ENGINEERING PRINCIPLES Materials will fatigue 4, 5, 6, 7 under the repeated application of stresses and strains that do not cause failure on the first application: imagine bending a paper clip back and forth. For engineering materials the stress level (SL) vs. cycles to failure typically plots as a straight line on a log-log scale: this is a power law SAE Int. J. of Aerosp. | Volume 2 | Issue 1 268 relationship (see ARP 5890 8 , Equation B-9). The stress level may be expressed as pounds per square inch, as strain (inch per inch), power spectral density (PSD) for random vibration, or as the magnitude of a temperature cycle, etc. The cycles to failure may be expressed as cycles or as time (assuming a consistent cyclic rate). There is scatter in the data. The time or cycles to failure for a number of identical samples tested at the same stress level will form a Gaussian distribution. This scatter is due to the variety of defects in each sample: as a consequence larger samples, with a greater probability of containing a more severe defect, will fail sooner. We use Miners Rule 9 to determine fatigue damage accumulation and a Composite Damage Index (CDI) for items subjected to combinations of different stress levels. CDI = the sum of n x /N x n x = number of applied stress cycles at stress level x N x = number of cycles to failure at stress level x Failure is expected to occur when the CDI is approximately equal to one. THE BATHTUB CURVE The bathtub curve 10 , describing failure rate as a function of time, is described in a number of sources and shown in Figure 1. This bathtub curve can be used to describe a range of phenomena including human death rates as a function of age, and electronic failure rates as a function of time. The Infant Mortality portion of the curve is the initial section for which the failure (death) rate decreases with time (age). For military electronics this higher initial failure rate is purported to be due to latent manufacturing defects. Environmental Stress Screening (ESS), comprising random vibration and temperature cycling, is used to precipitate these defects as failures so that they can be repaired to produce items without infant mortality defects. The Constant Failure Rate portion of the curve is the section after Infant Mortality defects have been eliminated, but before Wearout has begun to occur. Failures are random. This is the period for which Constant Failure Rate statistical prediction techniques (MIL-HDBK-217 11 , VITA 51.1 12 , etc.) have some validity. The Wearout portion of the curve is the last section and has a Gaussian distribution that goes to zero when the last item in a set starting population has failed. Failures in this portion of the curve are due to fatigue, and follow the Gaussian distribution that was previously discussed for fatigue phenomena. Durability 8 verification (analysis and/or test) during item Qualification, as shown in Topic 2.2 Bathtub Curve 10 , is commonly used to demonstrate that wearout will not occur during the planned life of the item. Typically, reliability analysis is aimed at assessing the random failures that will occur in the equipment during its useful life. These failures are usually assumed to be repairable, and may be due to a variety of causes, such as defects in the equipment, improper use, damage due to unusual conditions, inadequate maintenance, etc. Durability analysis, on the other hand, assesses failures due to wearout of certain elements of the design. 8 LIMITATIONS OF THE RELIABILITY PREDICTION PROCESS MIL-HDBK-217 (as do most similar analysis techniques) relies on a number of assumptions, two of which are germane here: 1) infant mortality failures have been eliminated by good process control, or screened out by an effective ESS program that consumes a relatively small tranche of demonstrated life, and 2) the period of performance, after ESS, is within the demonstrated life of the item, so that wearout failures will not occur: these analysis techniques typically do not model wearout mechanisms 13 . Selection of the appropriate MIL-HDBK-217 PiE-factor will not remotely compensate for the failure to adequately specify durability environments. The PiE- factor ratios assume, as does everything in the MIL- HDBK-217 methodology, that durability has been demonstrated and that the item is in the constant failure rate portion of the bathtub curve: they do not account for limited life due to wearout. DURABILITY ENVIRONMENTS The salient contributors to equipment durability environments are vibration (high cycle fatigue) and temperature (low cycle fatigue). The deleterious effects of these environments are widely understood, and have been thoroughly investigated in a number of venues. AVIP also broadens the tools and focus of electronic packaging design to address the life cycle issues through fatigue analysis 14 Complexities include 1) Surface Mount Technology, . . . Thermal cycling fatigue life of electronics was improved through 1) Coefficient of thermal expansion (CTE) matching, 2) Omega and other strain relieving lead wire designs for large devices, 3) Plated Thru hole improvements . . . The F-22 programs typical durability life test requires from 500 to 1500 thermal cycles on one unit. . . . The design analysis included consideration of the damaging effects on electronics from thermal fatigue. 15 SAE Int. J. of Aerosp. | Volume 2 | Issue 1 269 Constant Failure Rate Infant Mortality Wearout Figure 1 The Bathtub Curve Figure 2 The Bathtub Curve with 95% to 100% of Demonstrated Life consumed by ESS Failure Rate Time ESS Wearout Failure Rate Time ESS Qualification Margin Against Wearout Margin on Elimination of Infant Mortality Failures Table 1 -- Vibration Durability Life Consumed by ESS ESS Durability Percent of Demonstrated Life Consumed by ESS Duration (Minutes) PSD (g2/Hz) Duration x PSD^4 Duration (Minutes) PSD (g2/Hz) Duration x PSD^4 10 0.04 2.56E- 05 300 0.002 4.8E-09 99.98% 0.004 7.68E-08 99.70% 0.008 1.2288E-06 95% 0.016 1.96608E-05 57% 0.032 0.000314573 8% 0.040 0.000768 3% 0.064 0.005033165 1% 0.128 0.080530637 0.03%
SAE Int. J. of Aerosp. | Volume 2 | Issue 1 270 The Bolton Memorandum 1 Enclosure 2 Reliability Best Practices also confirms the need to address the fatigue aspects of both thermal and vibration fatigue. The supplier routinely conducts thermal and vibration analyses to address potential failure mechanisms and failure sites (i.e., a physics-of- failure approach to reliable design). These analyses would likely include the use of fatigue analysis tools, finite element modeling, dynamic simulation, heat transfer analyses, etc. 1 Appendix 1.4 Standard Evaluation Criteria, Reliability Analysis Comprehensive Thermal and Vibration analyses and/or Finite Element Analyses (FEA) are conducted to address potential failure mechanisms and failure sites. From ANSI/GEIA-STD-009, Figure 1 Engineering analysis and test data indentifying the system/product failure modes and distributions that will result from the life-cycle loads. 16 Ideally the durability environments should be derived from the planned usage of the item. The Bolton Memorandum 1 Enclosure 2 Reliability Best Practices also affirms that The supplier has characterized the critical loads and stresses. A good design team will characterize the life cycle environment and operational duty cycle stresses that their components will see. From the Halpin Highlights 14 a. Realistic systems requirements derived from the users intended application. b. Through understanding of operational usage and environments. Enclosure 1. Section C Statement of Work Reliability Language and Tailoring Instructions, 4. System-Level Operational & Environmental Life-Cycle Loads. The contractor shall estimate and periodically update the operational & environmental loads (e.g., mechanical shock, vibration, and temperature cycling) that the system is expected to encounter in actual usage throughout the life cycle. From ANSI/GEIA-STD-009, Figure 2 User and environmental profile that defines the system/products life cycle (operating and non- operating environments, expected operating and non-operating times, etc.). 16 From ANSI/GEIA-STD-009, 4.5.1.4 The developer shall estimate the user and environmental loads (e.g., mechanical shock, vibration, and temperature/humidity cycles . . .) The temperature cycling fatigue environment is usually the result of the combination of diurnal nighttime low temperatures; and the maximum temperature achieved at each potential failure site (solder joint, component lead, etc.) as a result of diurnal daytime high temperatures, cooling system performance, operational cycles and equipment power on-off cycles. Experience on programs where Durability fatigue analyses have been conducted, and validated, show that the temperature cycling fatigue contribution is typically eighty percent (80%) to ninety percent (90%) of the Composite Damage Index (CDI): this is true even for platforms with relatively severe vibration environments. Vibration and Temperature Cycling Environments are Orthogonal to Each Other - In the case of a circuit card assembly (CCA) vibration fatigue (primarily component leads and solder joints) is typically due to the flexure (Figure 3) of the CCA, perpendicular to the plane of the CCA: as the CCA flexes repeatedly the strains imposed on the component leads and solder joints lead to the accumulation of fatigue damage. Again in the case of a CCA temperature cycling fatigue (again primarily component leads and solder joints) is due coefficient of thermal expansion (CTE) mismatch (Figure 4) between the component and the CCA in the plane of the CCA: as the CCA goes through repeated thermal cycles the strains imposed on the component leads and solder joints lead to the accumulation of fatigue damage. Figure 3 CCA Flexure in Vibration Figure 4 CTE Mismatch Strains Leads and Solder Joints SAE Int. J. of Aerosp. | Volume 2 | Issue 1 271 Changes to improve performance in one durability environment can degrade performance in the other durability environment. For example, stiffening the card to improve vibration performance, could degrade performance in temperature cycling. It follows that long life in one durability environment does not imply any life in the other. ENVIRONMENTAL STRESS SCREENING As noted above the intent of ESS is to precipitate infant mortality (latent manufacturing) flaws so that they can be repaired, and the fielded item will be at the beginning of the flat (Constant Failure Rate) portion of the bathtub curve. A long-standing industry rule of thumb holds that power spectral density (PSD) levels below 0.04 g 2 /Hz 17 are insufficient to precipitate flaws: vibration at lower levels is otiose. We have another industry rule of thumb that ESS should not consume more than five percent (5%) of the demonstrated durability life of the item: this is to increase the probability that the item remains on the flat portion of the Bathtub Curve (Figure 1) for its planned useful life. Table 1 uses the equation from MIL-HDBK-810F 18 , Paragraph 2.2 Fatigue Relationship to determine the percentage of demonstrated durability life consumed by ESS on a hypothetical program. For this hypothetical program ESS is performed for 10 minutes at 0.04 g 2 /Hz. Durability vibration testing is conducted for five (5) hours (300 minutes) at different levels depending on the item installation zone. In this hypothetical case conducting ESS for items installed in installation zones with PSDs of 0.04g 2 /Hz or above may, assuming that the items do have infant mortality defects, make sense. For items installed in the zones with lower PSDs, the conduct of ESS is non-value added (the field/durability vibration level is too low to precipitate any infant mortality defects), and deleterious (an excessive portion of demonstrated durability vibration life is consumed) to the items reliability. ESS is an attempt to inspect in quality for low production rate equipment. Defects in high production rate equipment can be reduced or eliminated by the application of statistical process control and automation. High production rate equipment is far more likely to be COTS than custom military specification design. It follows that COTS is far more likely to be defect free (at least prior to ESS) than custom military specification design. One way to decompose reliability is into two questions. First, is the item inherently robust (durability environments address this) enough? Second, is the item defect (ESS is intended to address this) free? We have experience flying COTS items such as Ricoh Printers, Sony Satellite Dish Receivers, and HP Servers (without conducting ESS) on military derivative aircraft: in this relatively benign environment (commercial aircraft converted to a military application) these COTS items have proven to be considerably more reliable than the military specification Government Furnished Equipment (GFE). These COTS items are clearly not robust enough for severe environment platforms, such as fighter aircraft, but their reliable performance on military derivative aircraft confirms that ESS would be non-value added since field experience has shown these items to be relatively free of infant mortality defects. In addition given that they were not designed for flight environment, ESS would be more likely to degrade reliability by consuming an excessive portion of the items durability life. ESS vs. Burn-In - ESS is distinct from "Burn-In" which is typically applied to components and subassemblies to accelerate/screen time-temperature dependent thermally activated failure mechanisms that can be modeled using Arrhenius relationship: this includes solid state reactions such as diffusion, grain growth etc. As with ESS, Burn- in is intended to ensure that components or subassemblies start off in the constant failure rate region. ESS focuses on thermo-mechanical mechanisms that are related more to component assembly (leads, solder joints, etc.), but does not necessarily drive solid state (thermally activated) mechanisms to the constant failure rate region. ESS DEGRADING INHERENT RELIABILITY AND VITIATING RELIABILITY PREDICTIONS Consider a hypothetical example from the hypothetical program described above. As noted above the validity of our reliability predictions, for fielded items, rest on two assumptions: 1) Infant mortality defects have been eliminated; and 2) the item does not enter the wearout portion of the Bathtub Curve during its planned useful life. In this hypothetical case: The durability vibration requirement (Table 1) is five (5) hours at 0.008 g 2 /Hz. A durability temperature cycling requirement is not specified. The ESS requirement is five (5) minutes of random vibration, followed by twelve thermal cycles, and then another five (5) minutes of random vibration. The last five (5) thermal cycles and the final five (5) minutes of vibration must be failure free: there is no limit on the number of repeats allowed to achieve the required five thermal cycles and five minutes of vibration failure free. SAE Int. J. of Aerosp. | Volume 2 | Issue 1 272 An item that is no better than required by this set of requirements would be inherently unreliable the moment it was fielded. In this hypothetical case, the item went though ESS, prior to qualification, without having to repeat the last five (5) temperature cycles or the final five (5) minutes of vibration. The total demonstrated vibration durability life is the sum of five (5) hours (300 minutes) at 0.008 g 2 /Hz and ten minutes at 0.040 g 2 /Hz: as discussed above, assuming that a production unit went through ESS without having to repeat the last five (5) minute of vibration, ninety-five percent (95%) of the demonstrated useful life would have been consumed before the item was fielded. If the unit had to repeat (again, there is no limit to how many times this could happen) the last five minutes of vibration, after correction of a failure, then well over 100% of demonstrated useful life would have been consumed. For temperature cycling, even assuming that there are no repeated cycles following correction of a failure, at least 100% of demonstrated useful life has been consumed when ESS is completed, since in the absence of a durability temperature cycling requirement one pass through ESS is all that is included in the demonstrated temperature cycling durability life. If there are repeat ESS cycles then the situation would be considerably worse. In this case, the bathtub curve would be as shown in Figure 2: the actual item might be better than the requirements, but there would be no evidence or data to show that this is the case. The flat (constant failure rate) portion of the Bathtub Curve, where our reliability predictions have some validity does not exist, so our reliability prediction is vitiated. The inherent reliability of the unit has been degraded by the fatigue damage it has accumulated. In the case of vibration, this was done in the attempt to eliminate latent defects that the field level is too low to precipitate, thus artificially activating failure mechanisms not relevant to the field environment. HOW PROGRAM/EQUIPMENT MANAGERS CAN AVOID THESE PITFALLS 1. Durability environments must include vibration and temperature cycling requirements that are consistent with the planned usage and the planned useful life. Note: the temperature cycling verification does not have to be an expensive test, but in many cases may be accomplished by analysis or similarity. 2. ESS vibration and temperature cycling must be limited, in each case, to some small portion of demonstrated (typically five percent [5%]) useful life, including a specified number of allowed repeat/repair cycles. 3. Vibration ESS should not be conducted when the durability vibration level is too low to precipitate Infant Mortality (latent manufacturing) defects. 4. ESS should not be conducted on items (typically COTS) that have been shown to be free of infant mortality (latent manufacturing) defects. These measures can enhance reliability while reducing cost. CONCLUSION Finally, Program/Equipment Managers should have a Useful Life Strategy that reflects the expected field fatigue life of each class of items, and the customers desire for technology insertion/refresh. For example, if the item can only be expected to survive three to five years in the field and the customer desires technology insertion (how often do you replace your laptop?) every three years, then attempting to ruggedize/qualify/ESS the item for an longer life will only add cost while degrading reliability. The proper application of qualification, ESS and reliability prediction methods to determine a useful life strategy while avoiding the system engineering pitfalls described herein, will minimize total ownership cost while enhancing effectiveness for the war fighter. All activities, methods and tools used should be evaluated and applied in a manner that adds demonstrated value to the program, at optimized life cycle cost and utilization of resources 16 . . . . (which may include COTS, NDI, and CFI, as well . . .) shall identify and confirm through analysis, test, or accelerated test, the failure modes and distributions that will result when these life-cycle loads are imposed on these items. REFERENCES 1. MEMORANDUM FOR SEE DISTRIBUTION, SUBJECT: Reliability of U.S. Army Materiel Systems; 06 DEC 2007; Claude M. Bolton Jr.; DEPARTMENT OF THE ARMY, Assistant Secretary of the Army, Acquisition Logistics and Technology 2. MEMORANDUM FOR DIRECTOR, OPERATIONAL TEST AND EVALUATION, DEPUTY UNDER SECRETARY OF DEFENSE FOR ACQUISITION AND TECHNOLOGY; SUBJECT: Report of Reliability Improvement Working Group, Office of the Secretary of Defense 3. MEMORANDUM FOR UNDER SECRETARY OF DEFENSE (ACQUISITION, TECHNOLOGY, AND LOGISTICS); SUBJECT: Reliability, Availability, and Maintainability Policy; Department of the Air Force SAE Int. J. of Aerosp. | Volume 2 | Issue 1 273 4. Robert C. Junvinall; Stress, Strain, and Strength; McGraw-Hill. 5. Joseph Edward Shigley, Mechanical Engineering Design, McGraw-Hill. 6. Joseph H. Faupel,\; Engineering Design; John Wiley and Sons, Inc. 7. Edited by Rao R. Tummala and Eugene J. Rymaszewski Microelectronics Packaging Handbook, Van Nostrand Reinhold. 8. Guidelines for Preparing Reliability Assessment Plans for Electronic Engine Controls; ARP 5890. 9. Dave S. Steinberg, Vibration Analysis for Electronic Equipment, John Wiley & Sons; Chapter 10 Structural Fatigue. 10. RAC (Reliability Analysis Center) Reliability Toolkit: Commercial Practices Edition A Practical Guide for Commercial Products and Military Systems Under Acquisition Reform. 11. Military Handbook, Reliability Prediction of Electronic Hardware; MIL-HDBK-217F, Notice 2; Rome Air Development Center; 28 February 1995. 12. Reliability Prediction, MIL-HDBK-217, Subsidiary Specification; VITA 51.1; June 2008. 13. Lori Bechtold, Physics of Failure in Handbook Reliability Predictions, Components for Military & Space Electronics (CMSE), 2009. 14. Halpin, Dr. J. C.; Avionics/Electronics Integrity (AVIP) Highlights. 15. Glista, Stefan; Lessons Learned from the F-22 Avionics Integrity Program, 0-7803-5086-3 /98 IEEE. 16. ITAA Standard, Reliability Program Standard for Systems Design, Development, and Manufacturing; ANSI/GEIA-STD-0009-2008; November 13, 2008. 17. Navy Manufacturing Screening Program, Decrease Corporate Costs, Increase Fleet Readiness; Department of the Navy; NAVMAT P-9492; May 1979. 18. Department of Defense Test Method Standard for Environmental Engineering Considerations and Laboratory Test; MIL-STD-810F; 1 January 2000. CONTACT james.a.robles@boeign.com https://www.e-standard.org SAE Int. J. of Aerosp. | Volume 2 | Issue 1 274