Risk management of a manufacturing process requires a deep dive into the functional safety, cybersecurity and alarm management lifecycles. Each of these lifecycles is dictated by a different safety standard and traditionally carried out by different teams within an organization. With little communication between the groups, it is a challenge to account for all risks and create a comprehensive event response plan during plant operation.
By integrating the three automation lifecycles, it is possible to ensure awareness of all potential hazards and required risk reduction, improve efficiency and communication, and achieve a complete plant and enterprise view of risk management in an organization.
IEC 61511 FUNCTIONAL SAFETY LIFECYCLE
The international functional safety standard IEC 61511 provides the safety lifecycle as a steadfast guideline to assess and mitigate risk for manufacturing processes including refineries, chemical, petrochemical, pulp and paper, and power plants.
Over time, the tasks of the functional safety lifecycle have been adopted internationally by the top companies in the process industry, creating a well-defined, streamlined work process meant to address process hazards. Traditionally, this work is carried out by the engineering team and is essential to implement a functionally safe system (Figure 1).
However, we now know that properly managing risk at a facility and throughout the company requires careful consideration of cyberattacks as well as process hazards. Indeed, the new revision of IEC 61511 released in 2016 highlights the need for a Cyber Risk Assessment, emphasizing the responsibility of the owner/operating company to identify the threat, likelihood and consequences of cybersecurity events. They must also determine requirements for additional risk reduction and implement measures to reduce or remove threats.
IEC 62443 CYBERSECURITY LIFECYCLE
It is no longer adequate for plant operators, engineers, design and support personnel to only be aware of process hazards and risk. Cyberattacks not only impact business from a financial perspective; they can also initiate process safety incidents. In response, IEC 62443 has presented a cybersecurity lifecycle.
The scope includes assessment of a system for inherent risk and subsequent design, implementation and maintenance of countermeasures against cyber threats. Traditionally, this work is carried out by operation technology (OT), with help from information technology (IT) teams within an organization (Figure 2).
ISA 18.2 ALARM MANAGEMENT LIFECYCLE
Both safety and cyber lifecycles include implementation of safeguards or countermeasures against a hazard scenario. In many cases, these include alarms. The identification and rationalization of alarms are addressed in the Alarm Management lifecycle as defined by ISA 18.2 and IEC 62682. The full scope of this lifecycle also includes design and implementation of alarms and operation, maintenance, monitoring and management of change of the master alarm database for a system. Traditionally, this is carried out by the engineering and operations teams within an organization.
INTEGRATED FUNCTIONAL SAFETY, CYBERSECURITY AND ALARM MANAGEMENT LIFECYCLES
Each of these lifecycles has a similar structure that includes analysis or assessment of the system for inherent risk, and subsequent design, implementation and operation of safeguards or countermeasures against that risk. These similarities provide opportunities to leverage best practices to create one integrated work process the addresses functional safety, cybersecurity and alarm management. Integrating the lifecycles, and opening the lines of communication between the engineering, operations, operation technology and information technology teams, results in awareness of all potential hazards and required risk reduction as well as a comprehensive event response plan.
The lifecycles overlap for each of the following tasks:
- Hazard Identification
- Process Hazard Data to Alarm Rationalization
- Cyber Hazard Data to Alarm Rationalization
- Alarm Rationalization Process
- Process Hazard Data to Cyber Risk Assessment, SIL and SL Verification Process
- Event Response Management
- Certification of Devices, Elements, Systems and Personnel
In the functional safety lifecycle, Process Hazard Analysis (PHA) is often done using the HAZOP methodology. Here the process is divided into smaller parts called units and nodes. Any challenge to the process is a deviation. The cause and consequence of that deviation are documented, and risk is determined by the frequency of the cause and the severity of the consequence. For high-risk scenarios, safeguards are implemented to mitigate that risk. These safeguards may include alarms with operator intervention, pressure relief devices and safety instrumented functions.
The Cyber Risk Assessment is similar, with the system divided into smaller parts called cyber zones and cyber nodes. Any path that can be used to gain access is called a threat vector. The cause and consequence of the threat must be documented. Risk is determined by the likelihood of the threat and the severity of the consequence.
For high risk scenarios, countermeasures can be implemented to mitigate the risk. These countermeasures may include alarms with operator intervention, network devices such as firewalls and switches with access controls, physical security of engineering work stations, among others. Best practices are leveraged here by using the same methodology for assessment, as well as sharing findings and recommendations between the safety and cyber teams.
Hazard Data to Alarm Rationalization
Each alarm safeguard or countermeasure accounted for in hazard identification must be included in the alarm rationalization process. The master alarm database design basis includes documentation of the cause, consequence, corrective action and time to respond for each alarm. Much of this information is already documented in the PHA or Cyber Risk Assessment. Cross referencing the alarm rationalization with Safety and Cyber Analysis tasks improves traceability and clearly communications design criteria.
Hazard Data to SIL and SL Verification
For safety and cyber, the next step includes using frequency-based targeting the determine the design criteria for safeguards and countermeasures, respectively. For safety, a Layer of Protection Analysis (LOPA) is performed. In the LOPA, the frequency of each cause is multiplied by the probability of failure for each independent layer of protection, resulting in an actual frequency of the hazard scenario. This is compared to the tolerable frequency. If they are not equal, the result is the amount of risk reduction needed, and the Safety Integrity Level (SIL) required to design the Safety Instrumented System (SIS). SIL Verification calculations solidify the conceptual design by ensuring the Safety Instrumented Functions (SIFs) meet the target SIL.
A similar method is used for SL Verification of your cyber countermeasures. In this case, the likelihood of each cyber threat is multiplied by the probability of failure of each countermeasure, resulting in the mitigated likelihood of the cyber event scenario. The intention is to close the gap between the actual likelihood and the target likelihood. This methodology is meant to ensure the countermeasures implemented can provide the required amount of risk reduction. By utilizing a similar method as the LOPA, this becomes a straightforward, efficient way to verify the countermeasures meet the target security level.
Event Response Management
Each lifecycle requires testing of safeguards, countermeasures and alarms prior to start-up. The Factory Acceptance Test (FAT) involves testing of equipment prior to field installation and includes verification that the application program for SIF logic solvers and alarms, as well as cyber security countermeasures are implemented correctly.
Alarms are usually separate safeguards from the SIFs, which would include sensors, logic solvers and final elements that are usually remote-actuated valves. For a specific hazard scenario, an alarm could be the first layer of protection, and if the operator fails to respond, the SIF could open or close a specific valve to mitigate the hazard.
The Site Acceptance Test (SAT) involves testing of equipment after installation in the field and includes verification that all safeguards and countermeasures are implemented correctly, as well as alarm triggers and notification in HMI, and means for successful operator response. It is more efficient to do this testing together, saving engineering hours while assuring all safeguards and countermeasures work.
Operation and maintenance of safeguards, countermeasures, and alarms all include monitoring during operation, routine maintenance and testing, periodic assessment and potential for modification. In all cases to demonstrate compliance with safety standard it is a requirement that data is collected during the life of the plant to validate the conceptual design. Storing all data in one centralized database will streamline evaluation of safeguard and countermeasure health, and validation of the design.
Finally, during operation of the plant, operators must have a comprehensive event-response plan. Their duty includes keeping the plant online, physical security of the site and engineering station, process hazards (including any demands on the process, proof testing, device failures) and cyber hazards (cyber alarms, active and passive diagnostics). Integration of the lifecycles and communication between groups will give a full picture of the operator’s responsibilities, ensuring they are manageable. Since operator response is key to alarm layers of protection, this is of utmost importance.
Personnel competency is a priority for all three lifecycles and is meant to ensure the right people are carrying out each task. Personnel capable of performing work may be identified through a documented competency assessment process, including auditable records of training and qualification assessments, witness of suitable performance by experienced individuals or external accreditation by a recognized professional body.
Functional Safety professionals can become Certified Functional Safety Experts (CFSE), and Cybersecurity professionals can become Certified Automation Cybersecurity Experts (CACE).
Devices, elements and systems should be certified by an accredited, independent third party. For functional safety, compliance to IEC 61508 is evaluated including review of the development process, product testing, and SIL capability. For cybersecurity, compliance to IEC 62443 is evaluated and includes review of the development process, product testing and SL capability. For device certification it is important to make sure the certification body is accredited.
With an integrated automation lifecycle, each area of overlap represents an opportunity to leverage best practices from established work processes to improve efficiency and drive communication between different teams. This method guarantees awareness of all potential hazards and required risk reduction, increases project velocity, and reduces project cost and schedule. Operational benefits include increased availability, reduced operation and maintenance cost and a comprehensive event response plan.
1. IEC 61511 (2.1 edition), Functional Safety: Safety Instrumented Systems for the process industry sector, 2017, International Electrotechnical Commission, Geneva, Switzerland.
2. IEC 62443, Security for industrial automation and control systems, 2018, International Electrotechnical Commission, Geneva, Switzerland.
3. ANSI/ISA-18.2-2016, Management of Alarm Systems for the Process Industries, 2016, International Society of Automation, Research Triangle Park, NC.
4. Hildenbrandt, K.M., van Beurden, I., Integration of Automation Lifecycles; How Functional Safety, Cybersecurity, and Alarm Management Work Together, presented at ISA Process Control and Safety Symposium, November 8, 2017, Houston, TX.