This is a writeup of a speech given during the Real-Time workshop at ROSCon 2019. The complete slides of the speech are available at https://aliasrobotics.com/realtimesecurity.pdf.
The focus of the presentation was to share a few of the lessons our team learned over the past years on real-time and security when applied to robotics.
1. Real-time or real-fast
The first point involved challenging our understanding of real-time. Commonly, in control theory we often associate real-time with *meeting specific deadlines and ensuring the latencies remain under those deadlines, whatever they might be.
More formally, literature defines real-time as:
(A) Real-time control system means that the control system must provide the control responses our actions to the stimulus or requests within specific times, ...
Surprinsingly (or maybe not?), when looking for a definition and common understanding of "real-time" in the security field, it comes out something quite different. Security researchers often assume that real-time implies extremely low latencies or what people in the control community more commonly refers to as to "Real-Fast". Here's a definition taken from the Veracode (a security company) web-page:
... Real-time, zero-latency technologies capable of detecting attacks ...
This misconception is somewhat concerning. Having security researchers focused on ensuring protected low latencies, while interesting, is far from enough for robotic systems. Meeting deadlines is tighly attached to real-time performance. To us, real-time security or "security applied to real-time" should focus on employing a series of security practices that ensure robotic systems meet their deadlines regardless of the activity of malicious actors. This implies analyzing robotic systems from the ground up (refer to our threat modeling article), understanding the attack vectors, applying mitigations and making compromises when neccessary.
2. The relevance of hardware for real-time
Our second lesson learned refers to the importance of hardware. We've often seen that many robotic startups start building their solutions on top of platforms and technology not meant for safety-critical scenarios which in most cases demand of security. Understanding the criticality of your application is a must and mapping it to the right hardware architecture and configuration should be the starting point of any embedded project.
Approaching real-time systems using a bottom up (hardware and software) is needed. The same applies to security, one needs to ensure end-to-end compliance across all layers.
3. The link layer
The third lesson shared refers to the Link Layer (OSI stack layer 2). Most often, roboticists tend to focus on layers 3-7 of the OSI stack and disregard the channel. This has been the case of several robots that hit the market over the last decade and that use Ethernet (both MAC and PHY).
Ethernet is not real-time capable. It wasn't designed to provide responses within specific times. It's important to acknowledge this fact.
Several of us disregard this due to a variety of reasons. "Most often because it is not the main bottleneck". Is this true? Can we disregard the latencies caused by the link layer? Are they negligible compared to the ones caused by other upper layers above in the OSI stack?
We presented some of our findings over the past year obtained through experimental results:
The picture above shows that a 1Gbps Ethernet saturated can lead to latencies above 2 ms. This is undesirable and depending on the environment where your robot is operating might turn your communications channel totally unusable for real-time communications.
From a security perspective, an attacker with access to the local network might easily turn your Ethernet-based robotic system unusable by dumping traffic into the channel and breaking real-time. It does not necessarily need to be a malicious attacker. Sudden burst of traffic caused by workers or other departments connected to the same network might have a similar effect.
Instead, over past work, we made use of TSN standards, particularly of the Time-Aware Shaper (TAS) which allows us to mitigate the congestion points, even with 90% of the network saturation.
Figure above shows how the latency is contained to the micro-second level using the TSN TAS. This allows to contain latencies and mitigate the impact of undesired traffic or malicious attackers.
Wrapping up lesson 3, Link Layer optimization can be critical and if not handled could lead to latencies above 2 ms in saturated Ethernet-based networks. Selecting a good hardware platform that allows you to use real-time oriented link layers (such as TSN) is relevant. One of such platforms offering mixed criticality capabilities is Xilinx's Zynq SoCs which offers flexibility and adaptability across a variety of different scenarios of application in robotics.
4. RTOS and networking stack
The fourth lesson learned referred to the Real-Time Operating System (RTOS) and its networking stack. It's common knowledge that the vanilla Linux kernel is not real-time capable. Having an RTOS, either running directly in the micro-processor, microcontroller or through a layer of abstraction (as a hypervisor) becomes critical yet this is not enough for meeting deadlines in the communication across different machines. One also needs to care about the networking stack that runs on top of the RTOS.
The default networking stack configuration of the Linux kernel helps very little to achieve certain bounded latencies. During this short speech, we shared some of the lessons learned while fine tuning the Linux kernel and its networking stack for real-time. After some optimizations (described in the paper) we show empirically how one can obtain latencies below 1 ms, even in scenarios with background traffic up to 100 Mbps.
Ensuring relience of communications, even when faced with bursts of background traffic is critical to secure our (robotic) distributed systems.
5. Communications middleware and ROS 2
On top of the RTOS and its networking stack, it's worth discussing the communications and robotics middeware. I briefly touched into ROS 2 and DDS and showed a few bits of how we managed to obtain 2 ms (end-to-end, two-ways) bounded latencies for distributed communications which empowered real-time communications on top of ROS 2.
Synchronization is often disregarded, yet it's critical for achieving bounded latencies. This is the topic of the 6th lesson learned.
Figures below show how unsynchronized systems might lead to arrival time offsets (from the expected period) of up to 100 ms. That's huge for robotic systems!
In the paper described below, we show how using PTP (Precision Time Protocol), we manage to obtain synchronization at the submicrosecond-level leading to arrival time offsets that are almost negligible when compared to the figures above of the robotics middleware latencies.
Synchronization is thereby critical and should be secured. As showed experimentally, attacks to the synchronization system could lead to increased latencies of up to 100 ms causing robotic systems to behave erratically due to unmet deadlines in their control systems. It's critical then to maintain an up-to-date, bug-free and resilient-to-attacks version of PTP while caring about the overall attack vectors that could affect synchronization.
7. Real-time security
Last but certainly not least are the lessons learned on security. A real-time system that gets presented unprotected offers a wide attack surface for malicious actors. Ensuring availability, integrity and confidentiality is key for maintaing a real-time system's behavior over time.
A quick literature search shows some prior results that illustrate the impact of introducing several means of security on top of ROS 2. Not surprisingly, applying these security mechanisms has relevant impact in the latency. In one of the sources, it's estimated to be up to an order of magnitude.
This simply means that roboticists aiming to secure their real-time systems will need to reach a compromise. One between fulfilling their deadlines and the protection mechanisms that ensure these real-time systems will remain secure over time.
Historically, it's been shown how industrial systems not designed to be secure are hardly able to be patched appropriately for security. Examples include buses such as MODBUS, CAN or even EtherCAT where adding a security layer would disrupt latencies significantly.
Moreover, there's no one security solution that fits all use cases. What works (security-wise) for a robotics manipulator might not apply to a self-driving car. Each system should be analyzed and requirements should be judged for each one of the subsystems reaching compromises between "mitigating", "removing" and "accepting" threats.
The way forward requires one to analyze systems bottom-up from the very beginning with a security mindset. This is often called threat modeling (see our previous article on threat modeling).
A threat model identifies threats that apply to the robot and/or its components (both software and hardware) and provides recommendations to mitigate them. A threat model should focus on identifying attack vectors and is often a pre-condition to perform a full pentesting assessment involving red-teams (teams that attack the system).
Once you have a good understanding of your system's security landscape (through a threat model), you could then allocate resources and budget to specific attack vectors in order to mitigate those attack avenues that you most care about.
From our experience so far, complete security isn't achievable. Security is not a product, it's a process. It needs to be continously and periodically assessed.
Alias' mission is to remove 0-days from robotics. We focus on collaborating with manufacturers and end-users to ensure their robotic systems remain secure.