Root Cause Analysis Handbook 3e by Heuvel, Lorenzo, Jackson, Hanson, Rooney, Walker

From CNM Wiki
Jump to: navigation, search

Root Cause Analysis Handbook 3e by Heuvel, Lorenzo, Jackson, Hanson, Rooney, Walker is the 3rd edition of the Root Cause Analysis Handbook: A Guide to Efficient and Effective Incident Investigation handbook authored by Lee N. Vanden Heuvel, Donald K. Lorenzo, Laura O. Jackson, Walter E. Hanson, James J. Rooney, and David A. Walker, branded together as ABS Consulting, and published by Philip Jan Rothstein, FBCI, Rothstein Associates Inc., The Rothstein Catalog On Service Level Management, Brookfield, Connecticut U.S.A. in 2008.

The copyright belongs to ABSG Consulting Inc., Houston, TX USA.

  • Accident. An incident with unexpected or undesirable consequences. The consequences can be related to personnel injury or fatality, property loss, environmental impact, business loss, harm to the company's reputation, or a combination of these.
  • Apparent Cause Analysis (ACA). An analysis that identifies the causal factors for an incident and develops recommendations to address them, but does not necessarily identify the root causes of the incident.
  • Causal Factor. EPGs and FLPPGs that caused an incident, allowed an incident to occur, or allowed the consequences of the incident to be worse than they might have been.
    • Casual factors are gaps in the performance of equipment or front-line personnel. A performance gap is the difference between the desired performance of the equipment or human and the actual performance of the equipment or human.
    • For a typical incident, there are multiple causal factors. Causal factors are identified during the first stage of the analysis. Each causal factor is an event or condition that we want to eliminate. For each causal factor, underlying causes are identified and recommendations are developed.
  • Condition. A state of being.
    • Includes process states, such as pressure, temperature, composition, and level. Also includes the state of training of an employee, the condition of supplies, and the state of equipment. If it describes a performance gap, then it may be a causal factor, intermediate cause, or root cause.
    • Condition descriptions usually include passive verbs such as "was" and "were." Time is not typically associated with a condition.
  • Consequences. Undesirable or unexpected outcomes that may result in negative effects for an organization.
    • These consequences range from minor injuries to major events involving loss of life, extensive property loss, environmental damage, and breaches of security.
      • Negative effects can include property damage or loss, personnel injury or illness, spills, loss of sales, loss of reputation, etc. Consequences can differ in magnitude. For example, a spill could be small but have the potential to be much larger. Another spill could be large and result in environmental damage. The same level of effort might be put into investigating these two incidents: the first based on the potential consequences (a near miss) and the second based on the actual consequences (an accident).
      • The consequences and potential consequences of the incident should determine the level of effort invested in the analysis.
    • Equipment Performance Gap (EPG). Equipment performance that deviates from the desired performance of the item.
      • EPGs are one of the two types of causal factors.
      • EPGs include equipment failure. For a failure, the performance gap is simply the difference between the equipment functioning (the desired performance) and the failure of the equipment (the actual performance).
      • The definition does not indicate failure to perform as designed, but failure to perform as desired. This means that items can perform as designed and still fail to perform as desired. For example, a pump is designed to deliver 100 gallons per minute (gpm); however, during emergency conditions 150 gpm is needed. Since the pump cannot deliver 150 gpm (the desired performance), there is an EPG. By defining failures in this way, equipment design issues can contribute to EPGs.
    • Event. A happening caused by humans, automatically operating equipment/components, external events, or natural phenomenon. Event descriptions typically include action verbs, such as "walked," "turned," "opened," "said," "radioed," "discovered," "decided," "saw," etc. If it describes a performance gap, then the event may also be a causal factor, intermediate cause, or root cause.
    • External Factors. Issues outside the direct control of the organization. Examples include actions of the public, weather conditions, suicides or homicides, and operations of facilities not owned or controlled by the organization.
    • Front-line Personnel. Personnel in an organization who are directly involved in producing or providing the organization's final product or service.
      • For a typical manufacturing operation, front-line personnel would include operational and maintenance personnel.
      • For a sales firm, front-line personnel would include the sales force.
      • For an engineering firm, front-line personnel would include the design engineers.
      • For a department store, front-line personnel would include those who interact directly with the customers.
    • Front-line Personnel Performance Gaps (FLPPGs). Performance of front-line personnel that deviates from the desired performance.
      • The performance gap is the difference between the desired performance of front-line personnel and their actual performance.
      • This definition is not failure to perform as directed, but failure to perform as desired. An individual can follow a procedure precisely and still create an FLPPG because the individual does not perform as desired. This can happen when the procedure specifies the incorrect method for performing a task.
      • Human errors that are causal factors (FLPPGs) are performed by front-line personnel (operators, mechanics, electricians, technicians, etc.).
    • Incident. An unplanned sequence of actions and conditions that results, or could have reasonably resulted, in undesirable consequences for a system stakeholder.
      • This definition includes both accidents and near misses (defined below).
      • Incidents are a series of actions and conditions that contain a number of EPGs and/or FLPPGs, as well as positive actions and conditions. An incident can be depicted using a timeline that includes the actions and conditions that occurred and existed during the incident. However, it also includes information about the context in which the actions are performed and the conditions exist.
    • Intermediate Cause. An underlying reason why a causal factor occurred, but it is not deep enough to be a root cause. Intermediate causes are underlying causes that link causal factors and items of note to root causes.
    • Item of Note (ION). A deficiency, error, failure, or performance gap that is not directly related to the incident sequence but which is discovered during the course of the investigation. IONs are performance gaps, like causal factors. However, elimination of IONs would not have altered the outcome of the incident (i.e., the magnitude of the loss event). IONs are similar to audit findings. If left uncorrected, these IONs may become causes of future incidents. Underlying causes and recommendations can be developed for IONs as part of the investigation. However, most organizations assign responsibility for causal analysis of IONs to the departments responsible for the activity and not to the incident investigation team.
    • Loss Event. The specific statement of the resulting loss experienced by the system stakeholder.
      • Loss events are the specific statements of loss that appear on a causal factor chart, timeline, and/or cause and effect tree. They are developed by the investigator/investigation team to define the scope of the investigation or analysis.
      • The loss can be expressed as either an event or a condition. When a loss event is used, it describes the occurrence of the loss. When described as a condition, it describes the end result of a series of events. Generally, the term "loss event" is generically used to reference both an event or a condition.
      • The loss event selected will define the scope of the analysis. For example, selecting "Valve failure" as the loss event will result in focusing on the valve failure. Selecting "One-thousand-gallon spill of chemical X" as the loss event will result in focusing on the valve failure as well as the spill. Selecting "Three personnel exposed to chemical X" as the loss event will result in the investigation of all three aspects of the incident. Because of this, the loss event should be selected carefully and be defined precisely.
      • A loss event definition that only includes the immediate consequences results in recommendations that are fairly narrow in scope. A loss event definition that also includes the subsequent consequences of the incident results in recommendations that are broader in scope.
      • Multiple loss events may be identified as part of a single investigation. Multiple loss events are usually needed when there are different types of consequences and/or the consequences affect different stakeholders.
      • Loss events can occur in the past or in the future. For example, a loss event can be "The chemical storage facility was destroyed by fire" (has already occurred). However, a loss event can also be "The chemical storage facility will be unusable for a period of 6 months" (a future event).
      • Finally, loss events can be actual or potential losses. For accidents, the actual losses are stated. For example, "One thousand gallons of chemical X spilled into the containment dike for tank S10." For near misses, the loss event is a statement of the potential loss. For example, "Potential spill of up to 1,200 gallons of chemical X into the containment dike of tank S10."
    • Management System. A system put in place by management to encourage desirable behaviors and discourage undesirable behaviors.
      • Examples of management system elements include policies, procedures, training, communications protocols, acceptance testing requirements, incident investigation processes, design methods, and codes and standards.
      • Management systems strongly influence the behavior of personnel in an organization.
    • Near Miss. (1) An incident with no consequences that could have reasonably resulted in adverse consequences to a system stakeholder or (2) an incident that had some consequences that could have reasonably resulted in much more severe adverse consequences to a system stakeholder.
      • An incident can be both an accident and a near miss. If the incident has immediate consequences, it is an accident; however, it is also a near miss because the incident could have resulted in more severe consequences.
      • Everyone in the organization needs to have an understanding of how near misses are defined by the organization so that they can report appropriate incidents that meet the definition. You cannot investigate what is not reported. Examples of what are and what are not reportable near misses help employees determine what to report. This is a key element in a successful reporting process.
    • Recommendation. A suggestion to develop, modify, or enhance management systems or safeguards. Recommendations can be made to address the causal factor, intermediate cause, and/or root cause levels of the incident. Recommendations are the most important product of the analysis. They state what will be done to change the organization's behavior to (1) prevent recurrence of the incident or (2) minimize the consequences of the incident.
    • Resolution. The disposition of a recommendation. Recommendation resolution often results in implementation of the recommendation. However, resolution can also result in implementing an alternate recommendation or deciding to take no action at all.
    • Root Causes. Deficiencies of management systems that allow causal factors to occur or exist.
      • Performance gaps performed by support or management personnel are classified as root causes (as opposed to causal factors, which are performance gaps performed by front-line personnel).
      • Root causes must be within the control of management to address.
      • There are one to four root causes for a typical causal factor.
      • Root causes are usually as deep as a typical investigation will go in attempting to identify the underlying causes of an incident. As discussed in Section 5, organizational culture issues could also be identified and addressed, but most investigations do not go to this level because developing and implementing recommendations at the organizational culture level is too difficult. However, changes at the organizational culture level can be the most effective.
    • Root Cause Analysis (RCA). An analysis that identifies the causal factors, intermediate causes, and root causes of an incident and develops recommendations to address each of these causes. "RCA" is also used as a generic term to describe the process of performing any type of formal investigation.
    • Safeguard. A physical, procedural, or administrative control that prevents or mitigates the consequences associated with an incident. Safeguards are physical, procedural, and administrative systems controlled by the organization's management systems. For example, a design process (the management system) will result in installation of dual electric generators (the safeguard). As another example, the procedure development process (the management system) will generate a procedure on how to fill a vessel (the safeguard).
    • Stakeholder. A stakeholder is anyone who is interested in the performance of the system. Stakeholders can be interested in safety, quality, reliability, environmental, and/or financial performance.