Virtual Prototyping and Performance Analysis of Two Memory Architectures
© H.S. Muhammad and A. Sagahyroon. 2009
Received: 26 February 2009
Accepted: 24 December 2009
Published: 31 March 2010
Skip to main content
© H.S. Muhammad and A. Sagahyroon. 2009
Received: 26 February 2009
Accepted: 24 December 2009
Published: 31 March 2010
The gap between CPU and memory speed has always been a critical concern that motivated researchers to study and analyze the performance of memory hierarchical architectures. In the early stages of the design cycle, performance evaluation methodologies can be used to leverage exploration at the architectural level and assist in making early design tradeoffs. In this paper, we use simulation platforms developed using the VisualSim tool to compare the performance of two memory architectures, namely, the Direct Connect architecture of the Opteron, and the Shared Bus of the Xeon multicore processors. Key variations exist between the two memory architectures and both design approaches provide rich platforms that call for the early use of virtual system prototyping and simulation techniques to assess performance at an early stage in the design cycle.
Due to the rapid advances in circuit integration technology, and to optimize performance while maintaining acceptable levels of energy efficiency and reliability, multicore technology or Chip-Multiprocessor is becoming the technology of choice for microprocessor designers. Multicore processors provide increased total computational capability on a single chip without requiring a complex microarchitecure. As a result, simple multicore processors have better performance per watt and area characteristics than complex single core processors .
A multicore architecture has a single processor package that contains two or more processors. All cores can execute instructions independently and simultaneously. The operating system will treat each of the execution cores as a discrete processor. The design and integration of such processors with transistor counts in the millions poses a challenge to designers given the complexity of the task and the time to market constraints. Hence, early virtual system prototyping and performance analysis provides designers with critical information that can be used to evaluate various architectural approaches, functionality, and processing requirements.
In these emerging multicore architecture, the ability to analyze (at an early stage) the performance of the memory subsystem is of extreme importance to designers. The latency resulting by the access of different levels of memory reduces the processing speeds causing more processor stalls while the data/instruction is being fetched from the main memory. Ways in which multiple cores send and receive data to the main memory greatly affect the access time and thus the processing speed. In multicore processors, two approaches to memory subsystem design have emerged in recent years, namely, the AMD DirectConnect architecture and the Intel Shared Bus architecture [2–5]. In the DirectConnect architecture, a processor is directly connected to a pool of memory using an integrated memory controller. A processor can access the other processors' memory pool via a dedicated processor-to-processor interconnect. On the other hand, in Intel's dual-core designs, a single shared pool of memory is at the heart of the memory subsystem. All processors access the pool via an external front-side bus and a memory controller hub.
In this work, virtual system prototyping is used to study the performance of these alternatives. A virtual systems prototype is a software-simulation-based, timing-accurate, electronic systems level (ESL) model, used first at the architectural level and then as an executable golden reference model throughout the design cycle. Virtual systems prototyping enables developers to accurately and efficiently make the painful tradeoffs between that quarrelling family of design siblings functionality, flexibility, performance, power consumption, quality, cost, and so forth.
Virtual prototyping can be used early in the development process to better understand hardware and software partitioning decisions and determine throughput considerations associated with implementations. Early use of functional models to determine microprocessor hardware configurations and architectures, and the architecture of ASIC in development, can aid in capturing requirements, improving functional performance and expectations .
In this work, we explore the performance of the two memory architectures introduced earlier using virtual prototyping models built from parameterized library components which are part of the VisualSim Environment . Essentially, VisualSim is a modeling and simulation CAD tool used to study, analyze, and validate specification and verify implementation at early stages of the design cycle.
This paper is organized as follows: in Section 2 we provide an overview of the two processors and the corresponding memory architectures. Section 3 introduces the VisualSim environment as well as the creation of the platform models for the processors. Simulation Results and the analysis of these results form Section 4 of this paper. Conclusions are summarized in Section 5.
The AMD's direct Connect Architecture used in the design of the dual core AMD Opteron consists of three elements:
(i)an integrated memory controller within each processor, which connects the processor cores to dedicated memory,
(ii)a high-bandwidth Hyper Transport Technology link which goes out the computer's I/O devices, such as PCI controllers,
(iii)coherent Hyper Transport Technology links which allow one processor to access another processor's memory controller and Hyper Transport Technology links.
The Crossbar switch and the SRQ are connected to the cores directly and run at the processor core frequency. After an L1 cache miss, the processor core sends a request to the main memory and the L2 cache in parallel. The main memory request is discarded in case of an L2 cache hit. An L2 cache miss results in the request being sent to the main memory via the SRQ and the Crossbar switch. The SRQ maps the request to the nodes that connect the processor to the destination. The Crossbar switch routes the request/data to the destination node or the HyperTransport port in case of an off chip access.
Each Opteron core has a local on-chip L1 and L2 cache and is then connected to the memory controller via the SRQ and the Crossbar switch. Apart from these external components, the core consists of 3 integer and 3 floating point units along with a load/store unit that executes any load or store microinstructions sent to the core . Direct Connect Architecture can improve overall system performance and efficiency by eliminating traditional bottlenecks inherent in legacy architectures. Legacy front-side buses restrict and interrupt the flow of data. Slower data flow means slower system performance. Interrupted data flow means reduced system scalability. With Direct Connect Architecture, there are no front-side buses. Instead, the processors, memory controller, and I/O are directly connected to the CPU and communicate at CPU speed .
Since the L3 cache is shared, each core is able to access almost all of the cache and thus has access to a larger amount of cache memory. The shared L3 cache provides a better efficiency over a split cache since each core can now use more than half of the total cache. It also avoids the coherency traffic between cache in a split approach .
At the heart of the simulation environment is the VisualSim Architect tool. It is a graphical modeling tool that allows the design and analysis of "digital, embedded, software, imaging, protocols, analog, control-systems, and DSP designs". It has features that allow quick debugging with a GUI and a software library that includes various tools to track the inputs/stimuli and enable a graphical and textual view of the results. It is based on a library of parameterized components including processors, memory controllers, DMA, buses, switches, and I/O's. The blocks included in the library reduce the time spent on designing the minute details of a system and instead provide a user friendly interface where these details can be altered by just changing their values and not the connections. Using this library of building blocks, a designer can for example, construct a specification level model of a system containing multiple processors, memories, sensors, and buses .
Once a model is constructed, various scenarios can be explored using simulation. Parameters such as inputs, data rates, memory hierarchies, and speed can be varied and by analyzing simulation results engineers can study the various trade-offs until they reach an optimal solution or an optimized design.
The key advantage of the platform model is that the behavior algorithms may be upgraded without affecting the architecture they execute on. In addition, the architecture could be changed to a completely different processor to see the effect on the user's algorithm, simply by changing the mapping of behavior to architecture. The mapping is just a field name (string) in a data structure transiting the model.
Models of computation in VisualSim support block-oriented design. Components called blocks execute and communicate with other blocks in a model. Each block has a well-defined interface. This interface abstracts the internal state and behavior of a block and restricts how a block interacts with its environment. Central to this block-oriented design are the communication channels that pass data from one port to another according to some messaging scheme. The use of channels to mediate communication implies that blocks interact only with the channels they are connected to and not directly with other blocks.
In VisualSim, the simulation flow can be explained as follows: the simulator translates the graphical depiction of the system into a form suitable for simulation execution and executes simulation of the system model, using user specified model parameters for simulation iteration. During simulation, source modules (such as traffic generators) generate data structures. The data structures flow along to various other processing blocks, which may alter the contents of Data Structures and/or modify their path through the block diagram. In VisualSim simulation continues until there are no more data structures in the system or the simulation clock reaches a specified stop time . During a simulation run, VisualSim collects performance data at any point in the model using a variety of prebuilt probes to compute a variety of statistics on performance measures.
This project uses the VisualSim (VS) Architect tool, to carry out all the simulations and run the benchmarks on the modeled architectures. The work presented here utilizes the hardware architecture library of VS that includes the processor cores, which can be configured as per our requirements, as well as bus ports, controllers, and memory blocks.
Simulation models parameters.
1066 MHz (Width = 4 B)
2 GHz (Width = 4 B)
L1 Cache Speed
L2 Cache Speed
I1 Cache Size
D1 Cache Size
L2 Cache Size
4 MB (2 MB per core)
4 MB (shared cache)
The basic architecture of the simulated AMD dual core Opteron contains two cores with three integer execution units, three floating point units and two loads/stores, and branch units to the data cache . Moreover, the cores contain 2 cache levels with 64 kB of L1 data cache, 64 kB of L1 instruction cache, and 1 MB of L2 cache.
In the above model, the two large blocks numbered 4 and 5, respectively, are the Processor cores connected via bus ports (blocks 6) to the System Request Queue (block 7), and then to the Crossbar switch (block 8). The Crossbar switch connects the cores to the RAM (block 9) and is programmed to route the incoming data structure to the specified destination and then send the reply back to the requesting core.
On the left block 2 components contain the input task to the two cores. These blocks define software tasks (benchmarks represented as a certain mix of floating point, integer and load/store instructions) that are input to both the processors (Opteron and Xeon) in order to test their memory hierarchy performance. The following subsections give a detailed description of each of the blocks, their functionalities, and any simplifying assumptions made to model the memory architecture.
The architecture setup block configures the complete set of blocks linked to a single Architecture_Name parameter found in most blocks. The architecture setup block of the model (block 1) contains the details of the connections between the fields mappings of the Data Structure attributes as well as the routing table that contains any of the virtual connections not wired in the model. The architecture setup also keeps track of all the units that are a part of the model and its name has to be entered into each block that is a part of the model.
Core and Cache
Each core of Opteron implemented in the project using VS is configured to a frequency of 2 GHz and has 128 kB of L1 cache (64 kB data and 64 kB instruction cache), 2 MB of L2 cache, and the floating point, integer, and load/store execution units. This 2 MB of L2 cache per core is compatible with the 4 MB of shared cache used in the Intel Xeon memory architecture. The instruction queue length is set to 6 and instructions are included in the instruction set of both the cores, so as to make the memory access comparison void of all other differences in the architectures. These instructions are defined in the instruction block that is further described in a later section.
Certain simplifications have been made to the core of the Opteron in order to focus the analysis entirely on the memory architecture of the processor. These assumptions include the change of the variable length instructions to fixed length micro-ops . Another assumption made is that any L1 cache miss does not result in a simultaneous request being sent to the L2 cache and the RAM. Instead the requests are sent sequentially, where an L1 cache miss results in an L2 cache access and finally an L2 cache miss results in a DRAM access.
Crossbar Switch and the SRQ Blocks
Main Memory and the Memory Controller
In the simulation model, the RAM has a capacity of 1 GB and has an in-built memory controller configured to run at a speed of 1638.4 MHz. At this speed and a block width of 4 bytes, the transfer of data from the memory to the cache takes place at a speed of 6.4 GB/s. This rate is commonly used in most of the AMD Opteron processors but can be different depending on the model of the processor. The same rate is also used in the model of the Xeon processor. Each instruction that the RAM executes is translated into delay specified internally by the memory configurations. These configurations are seen in Figure 7 in the Access_Time field as the number of clock cycles spent on the corresponding task.
The basic architecture of the Intel Dual Core Xeon is illustrated in Figure 2. The corresponding platform model is depicted in Figure 8. The two cores of the processor are connected to the shared L2 cache and then via the Front-Side-Bus (FSB) interface to the SDRAM. The modeled Intel Xeon processor consists of two cores with three integer execution units, three floating point units, and two loads/stores and branch units to the data cache. The same specifications used to model the Opteron cores in VisualSim are used here as well. Besides, each core is configured with 64 kB of L1 data cache, 64 kB of L1 instruction cache, whereas the L2 cache is a unified cache and is 4 MB in size. The FSB interface, as seen in Figure 8, was constructed using the Virtual Machine block in VS  and is connected to the internal bus which links the two cores to the RAM via the FSB. The software generation block (block 2 on the left side of the model) contains the same tasks as the Opteron.
The architecture setup block of the model of the Xeon (Figure 8—block 1) is the same as the one implemented in the Opteron and the field mappings of the Data Structure attributes are copied from the Opteron model to ensure that no factors other than the memory architecture affects the results.
Core and Cache
The core implementation of the Xeon is configured using VS to operate at a frequency of 2 GHz and has 128 kB of L1 cache (64 kB data and 64 kB instruction cache), 4 MB of unified and shared L2 cache , floating point, integer, and load/store execution units. Here as well, the instruction queue length is set to 6 and instructions are included in the instruction set of both the cores, so as to make the memory access comparison void of all other differences in the architectures. These instructions are defined in the instruction block that is described in a later section.
Certain simplifications have been made to the core of the Xeon in order to focus the analysis entirely to the memory architecture of the processor. The assumption made in accessing the memory is that any L1 cache miss does not result in a simultaneous request being sent to the L2 cache and the RAM. Instead the requests are sent sequentially, where an L1 cache miss results in an L2 cache access and finally an L2 cache miss results in a RAM access.
To simplify the model and the memory access technique, the process of snooping is not implemented in this simulation, and similar to the Opteron, no parallel requests are sent to two memories.
The pipeline of the modeled Xeon consists of four stages (similar to the Opteron model), the prefetch, decode, execute, and the store. The prefetch of each instruction begins from the L1 cache and ends in a RAM access in case of L1 and L2 cache misses. The second stage in the pipeline is the decode stage that is mainly translated into a wait stage. The third stage, the execution stage, takes place in the five execution units that are present in the cores, and finally after the execution, the write-back stage writes back the specified data to the memory, mainly the L1 cache.
Caching Bridge Controller (CBC)
The CBC, block 7 of the model is simply a bridge that connects the L2 shared cache to the FSB . This FSB then continues the link to the RAM (block 9) from which accesses are made and the data/instruction read is sent to the core that requested the data. The CBC model is developed using the VisualSim scripting language and simulates the exact functionality of a typical controller.
Main Memory and the Memory Controller
Following a series of experimental tests and numerical measurements using benchmarking software, published literature [14–16] discusses the performance of the AMD Opteron when compared to the Xeon processor using physical test beds comprised of the two processors. These three references provide the reader with a very informative and detailed comparison of the two processors when subjected to various testing scenarios using representative loads.
In this work, we are trying to make the case for an approach that calls for early performance analysis and architectural exploration (at the system level) before committing to hardware. The memory architectures of the above processors were used as a vehicle. We were tempted to use these architectures by the fact that there were published results that clearly show the benefits of the Opteron memory architecture when compared to the Xeon FSB architecture and this would no doubt provide us with a reference against which we can validate the simulation results obtained using VisualSim.
Additionally, and to the best of our knowledge, we could not identify any published work that discusses the performance of the two memory architectures at the system level using an approach similar to the one facilitated by VisualSim.
Using VisualSim, a model of the system can be constructed in few days. All of the system design aspects can be addressed using validated parametric library components. All of the building blocks, simulation platforms, analysis, and debugging required to construct a system are provided in a single framework.
Synopsys integrated Cossap (dynamic data flow) and SystemC (digital) into System Studio while VisualSim combines SystemC (digital), synchronous data flow (DSP), finite state machine (FSM), and continuous time (analog) domains. Previous system level tools typically supported a single modeling specific domain. Furthermore, relative to prior generations of graphical modeling tools, VisualSim integrates as many as thirty bottom-up components functions into a single system level, easy to use, reusable blocks, or modules.
In the work reported here, Simulation runs are performed using a Dell GX260 machine with a P4 processor running at 3.06 GHz, and a 1 Gbyte RAM.
Benchmark tasks .
Model task name
Actual task name
Task latencies and Cycles/Task.
Figures 11 and 12 show a graph of processors' stall times. In both cases, 20 samples are taken during the entire simulation period and the data collected is used in the depicted graphs. During the execution of all the tasks, the maximum time for which the processors stalled was different for each kind of architecture. The maximum stall time for the DirectConnect architecture was 2.3 microseconds whereas for the Shared Bus architecture the maximum stall time was 2.9 microseconds. Due to the shared bus in the Xeon's architecture, delays were greater than the DirectConnect approach of the Opteron, and thus the difference in the stall time.
As the models described earlier suggest, the Opteron follows the split cache approach where each core in the processor has its own L1 and L2 cache; thus no part of the cache is shared between the two cores. On the contrary, the Xeon processor employs the shared cached technique and thus both the cores have access to a larger amount of cache than the ones in the Opteron. Whenever one of the cores in the Xeon is not accessing the shared cache, the other core has complete access to the entire cache which results in a higher hit ratio.
Hit Ratios (%)
In this work, we utilized a system modeling methodology above the detailed chip implementation level that allows one to explore different designs without having to write Verilog, VHDL, SystemC, or simply C/C++ code. This approach contributes to a considerable saving in time and allows for the exploration and assessment of different designs prior to implementation.
Since predictability of performance is critical in microprocessors design, simulation models can be used to evaluate architectural alternatives and assist in making informed decisions. Simulation is an acceptable performance modeling technique that can be used to evaluate architectural alternatives and features. In this work, we used Virtual System Prototyping and simulation to investigate the performance of the memory subsystems of both, the Opeteron, and the Xeon dual core processors.
Simulation results indicate that the Opteron has exhibited better latency than the Xeon for the majority of the tasks. In all cases, it either outperformed the Xeon or at least had similar latency. This demonstrates that using an FSB as the only means of communication between cores and memory has resulted in an increase in stalls and latency. On the other hand, in the DirectConnect Architecture, the cores being directly connected to the RAM, via the crossbar switch and the SRQ which were running at processor speed, had minimal delays. Each RAM request from either of the cores was sent individually to the SRQ blocks and they were routed to the RAM that had its memory controller on-chip and the cores did not have to compete for a shared resource.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.