A Formal Model for Performance and Energy Evaluation of Embedded Systems
© Bruno Nogueira et al. 2011
Received: 2 June 2010
Accepted: 21 September 2010
Published: 29 September 2010
Skip to main content
© Bruno Nogueira et al. 2011
Received: 2 June 2010
Accepted: 21 September 2010
Published: 29 September 2010
Embedded systems designers need to verify their design choices to find the proper platform and software that satisfy a given set of requirements. In this context, it is essential to adopt formal-based techniques to evaluate the impact of design choices on system requirements. To be useful, such techniques must produce accurate results with minimal computation time. This paper proposes an approach based on Coloured Petri Nets for evaluating embedded systems performance and energy consumption. In particular, this work presents a method for specifying and evaluating the workload and the platform components, such as processors and shared or private memories. The method is applied to model single processor and multiprocessor platforms. Experimental results demonstrate an average accuracy of 96% in comparison with the respective measures assessed from the real hardware platform.
The design of embedded systems usually must take into account several nonfunctional constraints, such as performance, size, weight, cost, reliability, and durability. The rapid growth of embedded systems in new application domains introduces new restrictions, which in turn raises new research and technical challenges. One prominent research area is related to battery-operated devices, in which energy consumption plays an important role. The low-power design has grown in importance with the proliferation of such devices. The main challenge is to reduce energy consumption without jeopardizing the performance requirements.
Modern embedded systems are composed of a set of interconnected processing, communication, and storage elements. Very often, these elements are integrated into a single circuit (System-on-Chip). Software (instructions streams/workload) executing on the processing elements drives the behavior of the system. In contrast to a desktop system, which executes a variety of workloads, normally embedded systems execute only one workload, repeatedly. The characteristics of the workload and the processing elements dictate the usage of communication and storage elements. In turn, the characteristics of the communication and storage elements influence the rate at which the workload is executed. Therefore, energy consumption and performance are a function of the characteristics of the workload and the architectural elements, and thus, estimating these metrics is not an ordinary task.
Given the wide range of platform options and software optimizations, designers need to verify their design choices to find the proper platform and software that satisfy a given set of requirements. Measurement of the actual performance and energy consumption characteristics on real hardware is often not feasible, since this would require the construction of a large number of hardware prototypes. In this context, many model-based approaches for estimating energy consumption and performance have been developed over the last years (e.g., [1–4]). Some of these model the energy consumption adopting cycle-level simulators (also known as architecture level model approach) [2, 5]. Despite providing very accurate estimates, the low abstraction level adopted by current approaches demands an enormous computational effort, which restricts the applicability for large codes.
This work presents a discrete event modeling strategy, based on Coloured Petri Net formalism (CPN) , for performance and energy consumption evaluation of embedded systems using the architecture level model approach. In particular, this paper presents a novel method for specifying and evaluating the performance and energy consumption of embedded systems considering different configurations for workload and the platform components, such as processors and memories. The method is applied to model a real platform, namely, NXP LPC2106, and a theoretical multiprocessor platform. The high level of abstraction of the proposed models allows for fast but accurate estimates. Additionally, although specific platforms have been considered, the modeling approach can be easily applied to other architectures.
Petri Nets (PNs)  are well suited to model computer architectures, since both parallelism and conflict, two important characteristics present in modern computer systems, are easily modeled using this formalism. Besides, PN extensions, such as CPN, have proven to be a powerful technique to evaluate performance indices in computer systems .
This paper is organized as follows. Section 2 presents related work. Section 3 introduces the required concepts for a better understating of this work. Section 4 presents the proposed approach. Section 5 presents some experiments and Section 6 concludes the paper.
Many approaches have been conceived to model energy consumption in embedded systems. However, few consider multiprocessor architectures. The approaches can be generally classified into two main categories: (i) architecture level (or hardware level) models and (ii) instruction level models. Architecture level models calculate power and energy from detailed descriptions that may comprise circuit level, gate level, and register transfer (RT) level. Instruction level models deal only with instructions and functional units from the software point of view and without knowledge of the underlying hardware organization .
The first energy instruction model was introduced in [1, 10]. These works assign an energy cost to each instruction (or sequence of instructions). The cost per instruction is assessed by measuring the average current of the processor when it executes that instruction. Interinstruction effects are also considered. However, the time required to characterize an architecture is a great issue, since the number of measurements grows exponentially with the number of instructions in the Instruction Set Architecture (ISA).
Oliveira et al.  proposed a simulation approach based on Coloured Petri Net. That work proposed a stochastic model for the 8051 microcontroller instruction set. The method adopted CPN to model the control flow of a given application and assigned probabilities to conditional branch instructions, which were translated to CPN transition guard expressions. The main drawback of that strategy is the model complexity, which grows with the application size, hence causing considerable negative impact on simulation time. Such an approach does not allow the evaluation of real-life complex applications or even reasonable size programs. That method was extended in  to simplify the model. Although the simulation time is significantly reduced, it is still heavily affected by the code size.
Another instruction level approach, known as functional-level power analysis (FLPA), was introduced in  and further extended in . In this method, the processor is separated into functional blocks (such as fetch unit, processing unit, and internal memory). The power consumption of each block is characterized through mathematical functions obtained from several measurements and/or simulations. Thus, the power consumption is obtained by adding up the consumption of all blocks. Although being very fast and having relative good accuracy for estimating power consumption, the proposed analytical modeling presents some limitations for estimating execution time, which in turn affects the energy consumption estimation as shown in their experimental results .
Since existing approaches work at a very low level of abstraction (e.g., [2, 5]), architecture level models are known to be very time consuming. Besides, those approaches also need a low-level representation (such as RTL level) of the architecture to allow the power characterization. However, these details of implementation are rarely available for most commercial processors.
A stochastic discrete event system (SDES)  is a system which occupies a single state for some duration of time, after which an atomic event causes an instantaneous state transition to occur. They are called discrete event systems because their state does not change between subsequent events, whereas state changes occur continuously in a continuous event system. In SDES, stochastic delays (described by probability distribution functions) and probabilistic choices  are used to model uncertainties in the system, which may be introduced by many factors such as unpredictable human actions and machine failures. Many SDES models have been developed, for instance, stochastic automata, queuing models, and stochastic Petri nets. In this work, Coloured Petri Nets (CPNs) and Discrete Time Markov Chains (DTMCs) are adopted to model, respectively, the platform and the workload. A comprehensive overview of the modeling possibilities with SDES is out of scope for this paper, but basic concepts are sketched. A much more thorough description of SDES is available in [13–15].
A DTMC is said to be time homogeneous, if is independent of . In this work, we consider only time homogeneous DTMCs.
DTMCs can be represented by a directed graph, known as the state-transition diagram. The nodes represent the states of the DTMC and the edges, the transitions between the states labeled by the respective one-step transition probabilities.
To evaluate DTMC models, the SHARPE tool  has been adopted by this work.
A CPN  is a bipartite-directed graph, consisting of two types of vertices: (i) places (drawn as circles) and (ii) transitions (drawn as bars). Places model the states, and transitions represent the events of the system. In CPN, a transition is able to fire (enabled) when (i) it has one token of the proper type on each of its input arcs, and (ii) the guard (Boolean expression) attached to the transition holds. An enabled transition can fire and thus remove tokens from its input places and generate tokens for its output places.
The concept of hierarchical design is supported by CPN. The basic idea is to allow the construction of a large model by using a number of smaller models. These small models are called pages and are connected to each other by places called ports. Such places can be input or output types. It is also possible to use time in CPN models. Time is handled by introducing a global clock and allowing each token to carry a time stamp. A token cannot be used unless the value of the clock has passed or is equal to the value of the time stamp. Intuitively, each time stamp indicates the earliest time at which the token may be used.
When the transition f2 is fired (see Figure 1(b)), a token is removed from place fetching and two tokens are created in places control and fd. The new tokens get a time stamp which is the current time plus one. At this moment, transition f1 is enabled as well as transition d1. The simulation continues as long as enabled transitions can be found. As can be seen, the model structure makes it impossible for two instructions occupy the places fetching or decoding at the same time. Additionally, the function dec() in the arc (d2, execute) generates instructions and puts them to execute. This function will be explained in more details in Section 4.1.
To assist our modeling we use the tool CPN Tools , which is a mature and well-tested tool that supports editing, simulation, and analysis of CPN.
In this section, the proposed method is presented and applied to evaluating software applications running on the NXP LPC2106, an ARM7TDMI-S-based architecture .
The LPC2106 has 128 kB of on-chip FLASH and 64 kB of on-chip SRAM. It has an ARM7TDMI-S processor which enables system designers to build embedded devices requiring small size, low power, and high performance. Such processor is a 32-bit RISC architecture that consists of a program control unit, an address generator, an integer data path, a general-purpose register bank, and a 3-stage pipeline. An important characteristic of the LPC2106 is an instruction prefetch module, known as Memory Accelerator Module (MAM). The MAM is connected to the local bus and is placed between the FLASH memory and the ARM7TDMI-S core. Like a cache, the MAM attempts prefetch the next instruction from the FLASH memory in time to prevent CPU fetch stalls.
In order to model the LPC2106 architecture, a library of generic blocks of CPN models has been constructed. These blocks can be combined in a bottom-up manner to model sophisticated behaviors. Modeling a complex architecture thus becomes a relatively simple process. The proposed CPN models are high-level representations that focus on what the architecture should perform instead of on how it is implemented. Moreover, it is important to stress that once constructed, a building block can be reused in other platform models.
The second difference is that there are additional transitions in the fetch/decode block that are responsible for exchanging the instructions in the fetch and decode stages for bubbles. These transitions become enabled when place control receives a token with colour flush, generated by the execute block when it simulates a branch instruction.
As stated earlier, the dec function generates instructions according to the frequency in which each instruction class is executed in the application under evaluation. Since this frequency distribution is dependent on a given software and input data, we devised a method for capturing this information. The method consists in mapping the application code (with annotations) into a DTMC. More specifically, the Control Flow Graph (CFG) of the application is mapped into an irreducible DTMC.
which defines the probability of executing after . Such probabilities are obtained from annotations in the application code.
Given the average number each basic block executs, the frequency in which each instruction is executed can be obtained, and hence the execution frequency of each class.
The evaluation is made by means of simulation. The facilites of CPN Tools have been adopted to define analysis functions and to perform data collection. Basically, two performance metrics were defined: (i) the average execution time per instruction and (ii) the average energy consumption per instruction. Given these metrics and the number of executed instructions in the application and the processor's operating frequency, the overall energy consumption and execution time of an application is obtained.
Firstly, a breakpoint monitor  was defined and assigned to the last transition in the execute block. This transition is always fired by all instruction classes. The breakpoint monitor collects data and tests if the metrics satisfy the stop criterion. If so, the simulation stops; otherwise, the simulation continues. To calculate the metrics, two data are collected on the firing of the transition linked with the breakpoint monitor: (i) the interval firing time, that is, the current time minus the last firing time, and (ii) the interval energy consumption, that is, the current global energy consumption minus the global energy consumption of the last firing. We designed a set of statistical functions so that a confidence interval for the metrics could be constructed. The stop criterion defines that if the confidence interval of these two metrics satisfies the specified precision, the simulation stops. The precision is specified by two parameters: (i) the confidence level and (ii) the relative error. This work adopted a confidence level of 95% and a maximum relative error of 2%.
This section describes the measuring method adopted to obtain the energy consumption and execution time values employed in the proposed models. To capture the average energy consumption of each functional unit defined in the model, assembly codes that stimulate, separately, the respective functional unit of the LPC2106 have been implemented, uploaded on the platform, executed, measured, and then the obtained data were statistically analyzed. For example, to capture the average power consumption when a MAM miss occurs, an assembly code that forces MAM misses was designed.
The AMALGHMA (Advanced Measurement Algorithms for Hardware Architectures) tool has been implemented for automating the measuring activities. AMALGHMA adopts a set of statistical methods, such as bootstrap and parametric methods, which are important in the measurement process due to several factors, for instance, (i) oscilloscope resolution and (ii) resistor error. Besides, the results estimated by AMALGHMA were compared and validated considering LPC2106 datasheet as well as ARM7TDMI-S reference manual.
An additional contribution of this work was the development of a computational tool to automate same steps of the proposed methodology. The tool was named PECES (Performance and Energy Consumption Evaluation of Embedded Systems). It receives the annotated source code and the architecture model as input and returns the average execution time and energy consumption as output.
The following steps are performed by PECES to evaluate a code.
(1)It compiles the application source code using the option to generate intermediate assembly code. GCC (arm-uclibc-gcc ) has been adopted as compiler.
(2)PECES builds the Control Flow Graph (CFG) using the intermediate code generated in the previous step.
(3)It uses the CFG and the annotations from the source code to generate the corresponding irreducible DTMC.
(4)The DTMC is numerically evaluated in SHARPE, so as to obtain the stationary probabilities.
(5)It uses the stationary probabilities to calculate the average number of execution for each basic block and, then, the number of times each instruction is executed. Next, PECES clusters instructions from the same class and calculates the frequency each class executs.
(6)The distribution frequency is written in the architecture model.
(7)PECES invokes Access/CPN tool  to simulate the architecture model.
(8)Finally, the tool uses the average execution time per instruction and the average energy consumption per instruction obtained from the previous step to calculate the average execution time and the energy consumption.
This work has conducted some case studies to evaluate the proposed estimation methods. The case studies consist of (i) Motorola's Powerstone benchmark suite codes (adpcm, bcnt, and fdct), (ii) common search/ordering/signal processing algorithms (binarysearch, bubblesort, and convolution), (iii) a customized example, and (iv) a real-world biomedical application (a pulse oximeter). The pulse oximeter case study is composed of three concurrent tasks; hence it has been divided into three separate experiments. All experiments were performed on an Intel Core 2 Duo 1.67 GHz, 2 Gb RAM, and Windows Vista OS.
Execution Time (μ s)
Energy Consumption (μ J)
Simulation time comparison.
Code optimizations, such as loop unrolling and function inlining, have proven to be successful techniques to improve the system performance. A very useful application for the proposed method is to verify the effect of these common code optimizations on system energy consumption. The bubblesort experiment has been used to demonstrate how such what-if analysis may be carried out.
The proposed method is also useful when it comes to evaluating code operation scenarios, such as best-case, average-case, and worst-case scenarios. The bubblesort code has been used to evaluate such application.
Bubblesort typical scenarios results.
Execution Time (μ s)
Energy Consumption (μ J)
Multiprocessor evaluation results.
Evaluation time (s)
Energy consumption (μ J)
Execution time (μ s)
This work presented a method for evaluating energy consumption and performance in embedded systems. The proposed method adopts Coloured Petri Nets for modeling the functional behavior of processors and memory architectures at a high-level of abstraction. Further, the workload under evaluation is mapped into the hardware model to carry out the performance and energy consumption estimation. A tool, named PECES, was implemented for automatizing the method. Additionally, a measuring platform, named AMALGHMA, was constructed for characterizing the platform and for comparing the respective results provided by the proposed method.
This work adopted a real-world embedded platform as case study, and the experimental results show that the proposed approach may be used to ensure a rapid and reliable feedback to the designer. Besides, applications of the method, such as the modeling of multiprocessor architectures, were demonstrated. As future work, we plan to improve PECES for helping the designer in the platform model construction and to validate the method in other architectures.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.