Memory Protection with Dynamic Authentication Trees

Matthew Millar  
Department of Computer Engineering  
Rochester Institute of Technology  
xmx1898@rit.edu

Marcin Łukowiak  
Department of Computer Engineering  
Rochester Institute of Technology  
xmleec@rit.edu

Stanisław Radziszowski  
Department of Computer Science  
Rochester Institute of Technology  
spr@cs.rit.edu

Abstract—As embedded devices increase in use and handle more critical information and functionalities, the importance of security grows even greater. Defense against bus attacks such as spoofing, splicing, and replay attacks is of particular concern. Traditional memory authentication techniques, such as hashes and message authentication codes, require significant amounts of on-chip memory and introduce a large performance impact when protecting off-chip memory during run-time. Balanced authentication trees such as the well-known Merkle tree or TEC-Tree can be used to reduce this cost. This work proposes a new method of dynamic authentication trees, which updates a tree structure based on a processor's memory access pattern. An HDL model for use in an FPGA has been developed as a transparent and highly customizable AXI-4 memory controller. The performance of our tree design is comparable to that of the TEC-Tree in several memory access patterns. Speedup over the TEC-Tree is possible to achieve when applied in scenarios that frequently access previously processed data.

Index Terms—embedded security, memory authentication, FPGA

I. INTRODUCTION

Embedded devices that process sensitive data and provide essential services have become increasingly common as modern digital infrastructures have grown at a rapid pace [1]. This raises security concerns as there are more reasons than ever for a malicious party to try to exploit these systems. In an attempt to provide security, encryption methods are oftentimes applied to any sensitive data to provide confidentiality [2]. However, data confidentiality itself is not enough to fully protect a system from an attacker. For example, erroneous data may be injected into the system in an attempt to disrupt the normal functionality of the device. In order to protect against such attacks, it is necessary to verify that any data the device processes is provided by the expected source. In addition, there needs to be a method of ensuring that the data has not been tampered with. Methods of authentication are then used to confirm the integrity of data processed in the system. Existing authentication methods, such as hashes and message authentication codes (MACs), are able to provide the intended protection; however, some of these methods are costly in terms of the device resources and performance overhead required to implement them.

In this paper we present a new method of dynamic authentication trees, which update a tree structure based on a processor’s memory access patterns. An AXI-4 based framework is developed as a transparent and highly customizable memory controller. This design is then synthesized onto an FPGA and verified. While this design is motivated by vulnerabilities in embedded systems, the method presented is scalable and may be applied to any computing system. Different configuration options are provided allowing for the design to be used in applications with different constraints and requirements.

The remainder of this paper is organized as follows. Section II details background information on the threat model and current authentication techniques. Section III outlines the proposed design. Section IV analyses the design and presents the results of the implementation. Section V provides overall conclusions.

II. BACKGROUND

A. Threat Model

In many applications, the physical security of the computing device is either too costly or too difficult to properly achieve. This is especially true of applications that require the use of small-scale embedded systems. In our threat model, it is assumed that the CPU in such a system is secure and trusted as presented in Figure 1. Additionally, it is expected that the OS running on the processor is trusted and that the on-chip caches cannot be monitored or tampered with.

Attacks on memory are performed when an adversary has the capability to read or modify any data traveling between the processor and off-chip memory. All data and addresses sent on this memory bus are exposed to examination and modification. Of particular concern are hardware man-in-the-middle (MITM) attacks such as spoofing, splicing, and replay attacks [3]. In the simplest of attacks, an adversary may probe the memory bus, reading and writing data as it is transmitted to and from RAM.

Figure 1. Region of Protection
B. Authentication

Protection against the previously mentioned MITM attacks requires the additional step of data authentication as well as encryption. A summary of a few existing authentication techniques is presented in Table I.

<table>
<thead>
<tr>
<th>Performance Overhead</th>
<th>Hashes</th>
<th>MACs</th>
<th>AREA</th>
<th>Auth Trees</th>
</tr>
</thead>
<tbody>
<tr>
<td>Off-chip Overhead</td>
<td>High</td>
<td>High</td>
<td>High</td>
<td>Medium</td>
</tr>
<tr>
<td>On-chip Overhead</td>
<td>High</td>
<td>High</td>
<td>None</td>
<td>Low</td>
</tr>
<tr>
<td>Provides Encryption</td>
<td>No</td>
<td>No</td>
<td>Yes</td>
<td>Varies</td>
</tr>
</tbody>
</table>

**Hashes.** One of the simplest forms of data authentication can be performed using cryptographic hashes. A hash function creates a unique fixed-size output given a unique variable-sized input [4]. A block of data can be hashed, and that hash can be stored securely on-chip in order to prevent attackers from accessing it. Every time data is read, a newly computed hash of the data can then be compared with the hash stored on-chip.

**Message Authentication Codes.** Message Authentication Codes are an authentication technique that utilizes a keyed hash function. This is very similar to the previous technique, with the difference that only a secret key needs to be stored securely from the attacker [5].

**Block-level AREA Authentication.** Using the diffusion properties of block-level encryption [6], the block-level Added Redundancy Explicit Authentication (AREA) scheme is able to achieve encryption and authentication [7]. In the AREA technique, a unique nonce is appended to the end of a plaintext block, and the entire block is encrypted. On decryption, this added nonce is checked in order to determine if the data block has been corrupted. Traditionally, in implementations of a block-level AREA authentication technique, the nonce and the secret key must be stored securely while the ciphertext is visible to an attacker. An encryption mode that has an error propagation property, such as the AES in the cipher block chaining (CBC) mode, must be used.

**Authentication Trees.** Authentication trees are used as an attempt to limit authentication overhead compared to other techniques. The fundamental design of this method is a tree structure that stores additional encrypted metadata [8].

**TEC-Trees.** A TEC-Tree is a balanced authentication tree that protects data nodes using the Block-level AREA method. To ensure data integrity, each node contains a nonce that is verified after decryption. These nonces are generated using a secure on-chip counter [9].

**Dynamic Authentication Trees.** Most authentication tree methods rely on static balanced tree structures. A balanced tree approach may result in excessive overhead for access patterns that access the same address space frequently. A Dynamically Skewed Authentication Tree attempts to increase the performance of traditional authentication tree methods by reorganizing the structure of the tree at runtime. Shifting frequently accessed nodes to higher levels allows for less tree traversal time by reducing the number of verification computations and intermediate node accesses [10].

III. OUR DESIGN

A. Authentication

Modified block-level added redundancy explicit authentication (block-level AREA) scheme is employed in our ordered Dynamic Authentication Tree (DAT) method. Utilizing this approach has the benefit of providing encryption inherently with the authentication operations [7].

B. Tree Nodes

The leaf nodes of the authentication tree structure contain the data blocks that are directly protected by the tree. These nodes are referred to as “data nodes,” and are separated into two parts: the data that is being protected and the nonce. The nonce contains the tree metadata required for tree traversal and a count of the number of times the node has been accessed. Each data node protects a block of data that can be any length specified by the application. Intermediate tree nodes contain the same metadata used for tree traversal, as well as the counts of each of their child nodes. These intermediate nodes are referred to as counter nodes. Bottom part of Figure 2 shows the general structure of a data node and a counter node as used by our ordered dynamic authentication tree (DAT).

C. Tree Structure

The structure of the tree depends on the weighted frequency of accesses to each data node. When a memory access is performed, the entire data block including the nonce, is first read and decrypted. The corresponding parent node is then also read and decrypted. This parent node holds the access count of its children for comparison to apply the block-level AREA scheme and provide authentication. This authentication process is repeated recursively until the final root of the tree is reached. Secure on-chip storage is used to store a master counter that is incremented on every write operation and used
to generate new nonces. For authentication purposes, the root counter node’s count is compared to this secure root counter. If at any point, the authentication fails due to a counter mismatch, the processor can be alerted to an authentication error. Otherwise, the requested data operation is performed as usual. To prevent additional unnecessary performance overhead, data access frequencies are only updated on a write operation. In a write operation, the data node’s frequency count itself must first be updated. As the tree is traversed upwards, each parent node’s stored left and right counts are updated accordingly to their child nodes’ weights. In addition, the parent’s frequency access count is incremented. An example of a tree structure with additional metadata information is presented in Figure 2.

On a read operation for every level higher a data node is stored, it requires one less counter node to be read and compared before it is fully authenticated. This differs from other authentication tree methods which rely upon a balanced tree that is agnostic to the number of times a data node has been accessed. Given most memory access patterns, the more often a data block is accessed, it is more likely that the node will be accessed again in the future. Thus weighting data nodes and restructuring the tree as such should provide a substantial improvement in performance for data blocks that are accessed often.

D. Dynamic Restructuring

As memory blocks are accessed, the tree structure is dynamically updated to relocate memory blocks with a higher frequency of access closer to the tree’s root. To ensure that authentication checks function properly, the count of each node isincremented, and its parent node’s left or right counter is similarly updated as each node is accessed. When a node is accessed, it is accessed along with its parent, sibling, and uncle in order to perform the updates and navigate the tree. An uncle node refers to a parent’s sibling node. The weight of the uncle node is compared to the new weight of the current node. The tree is flagged for rebalancing if the current node’s new weight is greater than the uncle node’s weight. There are two approaches to rebalancing. The first method permits the tree’s leaf nodes to be rearranged, but their order is not preserved. The second approach adds the restriction of requiring the leaf node order to remain constant. The ordered tree design’s rebalancing procedure consists of three different possible cases that are described below. The first rebalancing case is described by Algorithm 1.

By shifting the parent node to the side and the uncle node down a level, the current node being accessed is shifted up a level. To perform these operations, the parent and sibling relationship stored in each node is able to be updated with the new case. It is necessary to read and decrypt the parent, uncle, sibling, and current node data from memory to access all of the information required to perform these operations. If it is not desired to retain the order of the data nodes, this rebalancing algorithm is the only one necessary to use. However, if the order of the nodes must remain the same, this case is only valid if the current node’s left/right position is different than its uncle node’s left/right position. If the first rebalancing method were to be used, the leaf node order would be modified, thus invalidating the ordered design. Algorithm 2 shows the second rebalancing method used for handling this additional scenario. The second rebalancing method requires the information from the grand uncle and grand parent nodes in addition to the information from the first method. When the current node is shifted upwards, the parent node is updated with the grand

---

**Algorithm 1: Ordered Rebalancing Method #1**

**Data:** Parent, Uncle, Sibling, Current

if Current’s LR ≠ Uncle’s LR then
  // Shift Uncle down
  Uncle’s Parent ← Parent;
  Uncle’s Sibling ← Sibling;
  // Rotate Parent
  Parent’s sibling ← Current;
  Parent’s LR ← ¬ Parent’s LR;
  // Update and Rotate Sibling
  Sibling’s sibling ← Uncle;
  Sibling’s LR ← ¬ Sibling’s LR;
  // Shift Current Node Up
  Current’s Parent ← Parent’s Parent;
  Current’s Sibling ← Parent;
end

---

**Algorithm 2: Ordered Rebalancing Method #2**

**Data:** Parent, Uncle, Sibling, Current, Grand Uncle, Grand Parent

if (Current’s LR = Uncle’s LR) ∧ (Current’s LR ≠ Grand Uncle’s LR) then
  // Shift Parent Up with Current Node
  Parent’s Parent ← Grand Uncle’s Parent;
  Parent’s Sibling ← Grand Uncle’s Sibling;
  // Update Uncle
  Uncle’s Sibling ← Current Node;
  // Shift Grand Uncle Down
  Grand Uncle’s Parent ← Parent;
  Grand Uncle’s Sibling ← Sibling;
  // Update Grand Parent
  Grand Parent’s sibling ← Parent;
  // Shift Sibling Up with Current Node
  Sibling’s sibling ← Grand Uncle;
  Sibling’s LR ← ¬ Sibling’s LR;
  // Shift Current Node Up
  Current’s Parent ← Grand Parent;
  Current’s Sibling ← Uncle;
  Current’s LR ← ¬ Current’s LR;
end
the tree is checked to see if the data node has been accessed and a write request pipeline. The Tree Request Generator component that generates requests. Figure 3 depicts the result, the pipeline must return the data read from memory to the information that the TEC-Tree requires. Generally, a dynamic authentication tree requires additional computational overhead on each write transaction compared to the TEC-Tree as a recursive check for rebalancing needs to be applied. A counter node must be read for each level in addition to the information that the TEC-Tree requires. 

The root is stored in BRAM, which is embedded in the programmable logic, and can be considered secure against an attacker with physical access. If the root has not yet been accessed, the tree’s data must first be initialized. To correctly populate the initial tree data, a Tree Initializer component has been added.

### IV. Analysis & Results

In this section, we compare the results of the proposed design’s hardware implementation and two similar designs: the TEC-Tree and the Dynamically Skewed Tree. All three were integrated within the previously stated pipeline framework described by Figure 3.

#### A. Memory Overhead

The off-chip memory overhead is presented in Equation (1), where the variables $l_p$, $n$, and $A$ are the data payload size in bits, the bit length of nonces, and the arity of the tree respectively. For the arity of 2, the relationship between off-chip memory costs for our method, TEC-Tree, and Dynamically Skewed Tree is identical apart from the variation in nonce sizes [9], [12].

$$O = \frac{l_p + n \times A}{l_p(A - 1)}$$

(1)

The nonce sizes $n$ for each method are shown below using Equations (2), (3), and (4). Where $c$ is the bit length of the counters used for authentication, and $D$ is the total number of data nodes used in the tree.

$$n_{TEC} = c + \log_2 N$$

(2)

$$n_{skewed} = c + 3 \times \log_2 N$$

(3)

$$n_{DAT} = c + 3 \times \log_2 N$$

(4)

The proposed method’s off-chip memory cost is slightly increased relative to the TEC-Tree, but remains the same as the Dynamically Skewed Tree. This increase is due to additional stored metadata that is needed in order to traverse an unbalanced tree structure. The size of the counter value used for authentication and the root of the tree are customizable. Only a single counter per tree is required to be stored in on-chip memory. The on-chip memory cost is the same for all three methods.

#### B. Dynamic Tree Performance

A dynamic authentication tree requires additional computational overhead on each write transaction compared to the TEC-Tree as a recursive check for rebalancing needs to be applied. A counter node must be read for each level in addition to the information that the TEC-Tree requires. Generally, a dynamic tree rebalance does not happen frequently, especially with memory access patterns that focus on particular ranges of the whole memory address space. Due to the number of extra computations performed, it is expected that a memory access pattern that utilizes writes to memory more frequently

#### E. Encryption and Authentication Pipeline

Aiming to reduce design complexity, our implementation utilizes a modified version of the AXI-4 transparent memory encryption pipeline provided in [11]. In order to generate appropriate requests for new intermediate nodes in dynamic trees, tree metadata must first be read from memory. As a result, the pipeline must return the data read from memory to the component that generates requests. Figure 3 depicts the modified pipeline, which includes authentication components. The pipeline is divided into two paths: a read request pipeline and a write request pipeline. The Tree Request Generator component possesses a master state machine to determine which tree nodes must be accessed and updated. The root of the tree is checked to see if the data node has been accessed
Copyright © 2023, IEEE. All rights reserved.

Performance speedup occurs when a read or write transaction occurs on an address that has been shifted higher in the dynamic tree than it would be in the corresponding balanced tree structure. In order for a dynamic tree to perform better than the TEC-Tree design, this speedup must outweigh the slowdown caused by the additional tree calculations.

C. Implementation Results

The TEC-Tree, the Dynamically Skewed Tree, and our design were all implemented targeting the ZedBoard development kit with Xilinx Zynq xc7z020clg484-1 part, and Xilinx Vivado version 2020.1 tools. The TEC-Tree design was provided by [11]. The Dynamically Skewed Tree was manually developed and modeled in VHDL based on the framework outlined in [12]. The performance was evaluated with a clock frequency of 50 MHz. For programmable logic utilization comparisons, the total number of resources available for various Xilinx devices is displayed in Table II. Part xc7z2020 is the SoC used by the ZedBoard. Compared to some other FPGA parts available, the resources provided by the Zedboard are relatively small. The resulting utilization is presented in Table III.

As the simplest design, the TEC-Tree uses the fewest number of each component. Most of the differences in utilization can be accounted for by the added components for the dynamic tree request generation and initialization. While the TEC-Tree design can simply generate all necessary requests without reading data from memory, the other two designs require logic to both store read data and translate the read data to appropriate requests for tree traversal and modification. The Dynamically Skewed Tree design also requires fewer resources to implement than our ordered DAT. This can be attributed to the additional tree structure and rebalancing complexity that comes with the added condition of maintaining terminal nodes’ ordering. While there is only one rebalancing algorithm used in the Dynamically Skewed Tree design, the ordered design requires three different algorithms to be implemented based on the current state of the tree to be restructured. However, the overall usage for the ordered DAT is still notably low if compared to any of the FPGA models listed in Table II.

D. Memory Controller Performance

All three designs were evaluated for correctness and performance using the Xilinx AXI Verification IP (AXI VIP). The verification IPs ensure signal behavior and timing are appropriate for the AXI-4 interface. They do not, however, ensure data written to and read from memory represent proper values. A wrapper testbench was written in SystemVerilog to both control the AXI VIP’s and ensure the memory contained the correct data on a read and write operation. A master AXI VIP was used to simulate the processor sending initial read or write requests to the memory controller. A second AXI VIP was configured for slave responses in memory-mapped mode, allowing the IP to simulate a BRAM with appropriate timings. The designs for all three methods were evaluated for performance in various conditions with different memory access patterns. Each designs’ customizations were standardized to make direct performance comparisons easier. The master interface of the pipeline was configured to accept a

<table>
<thead>
<tr>
<th>Part</th>
<th>LUTs</th>
<th>Flip Flops</th>
<th>DSPs</th>
<th>BRAM Tiles</th>
</tr>
</thead>
<tbody>
<tr>
<td>xc7z2020</td>
<td>46,200</td>
<td>92,400</td>
<td>160</td>
<td>95</td>
</tr>
<tr>
<td>xczu9eg</td>
<td>274,080</td>
<td>548,160</td>
<td>2,520</td>
<td>912</td>
</tr>
<tr>
<td>xcvu9p</td>
<td>1,182,240</td>
<td>2,364,480</td>
<td>6,840</td>
<td>2,160</td>
</tr>
<tr>
<td>xc250</td>
<td>1,341,000</td>
<td>2,749,000</td>
<td>11,508</td>
<td>1,766</td>
</tr>
</tbody>
</table>

Table II

REFERENCE RESOURCES AVAILABLE FOR A SAMPLE SET OF XILINX FPGAS

<table>
<thead>
<tr>
<th>Part</th>
<th>LUTs</th>
<th>Flip Flops</th>
<th>DSPs</th>
<th>BRAM Tiles</th>
</tr>
</thead>
<tbody>
<tr>
<td>TEC-Tree</td>
<td>3270</td>
<td>8394</td>
<td>11649</td>
<td></td>
</tr>
<tr>
<td>Dynamically Skewed Tree</td>
<td>2893</td>
<td>5651</td>
<td>7582</td>
<td></td>
</tr>
<tr>
<td>Ordered DAT</td>
<td>0</td>
<td>15</td>
<td>21</td>
<td></td>
</tr>
</tbody>
</table>

Table III

AUTHENTICATION TREE IMPLEMENTATION RESULTS

<table>
<thead>
<tr>
<th></th>
<th>TEC-Tree</th>
<th>Dynamically Skewed Tree</th>
<th>Ordered DAT</th>
</tr>
</thead>
<tbody>
<tr>
<td>LUTs (Encryption)</td>
<td>9018</td>
<td>11251</td>
<td>14506</td>
</tr>
<tr>
<td>Flip Flops (Encryption)</td>
<td>4122</td>
<td>6419</td>
<td>8350</td>
</tr>
<tr>
<td>DSPs (Encryption)</td>
<td>0</td>
<td>15</td>
<td>21</td>
</tr>
<tr>
<td>BRAM Tiles (Encryption)</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
</tbody>
</table>

Figure 3. Modified Memory Encryption and Authentication Pipeline [11]
data length of 32 bits, while the slave interface was configured for a data length of 64 bits. The size of the memory to be protected was configured for 256 MB and a data block size of 64 bytes for each design.

Because of the dynamic nature of the proposed tree designs, multiple scenarios were required to thoroughly investigate the various performances of the design. First, the designs were tested utilizing a fully random test suite. As each memory address is random, there is a randomly distributed spread of accesses throughout the entire tree. While this type of memory access pattern is very unrealistic in practical applications, this type of access pattern is the worst-case scenario for a dynamic tree which is important to note.

The second test applied to each authentication tree was designed to test the best-case scenario for a dynamic tree. For this test, only a single data node is accessed to allow the dynamic tree to restructure it to the top of the tree.

The third test was run to provide a realistic memory access pattern. In this test, contiguous memory was accessed in order to simulate reading and writing to a large array in memory. The previous test may provide an accurate representation of accessing a smaller array in memory, but it would require the array to remain within the size range of the tree’s data block configuration. Figures 4 and 5 contain the timing results of each test.

The necessity of embedded security grows as devices with crucial functions become more common. In this paper we presented a new ordered Dynamic Authentication Tree (DAT) method, which updates the tree structure based on memory access patterns. As expected, initial performance tests with simple synthetic benchmarks show that our method is comparable to TEC-Tree in some access patterns and it has potential to outperform it when it is necessary to frequently access previously processed data. Our future work will focus on adding tree caches to avoid high cost computations and memory accesses for frequently accessed memory cells, development and adopting verity of additional benchmark applications, optimizing the design to allow for using higher frequency clock, and to allow the framework to work with trees that have arities larger than 2.

**REFERENCES**


