Introduction to MIMD Architectures :
Multiple instruction stream, multiple data stream (MIMD) machines have a number of processors that function asynchronously and independently. At any time, different processors may be executing different instructions on different pieces of data. MIMD architectures may be used in a number of application areas such as computer-aided design/computer-aided manufacturing, simulation, modeling, and as communication switches. MIMD machines can be of either shared memory or distributed memory categories. These classifications are based on how MIMD processors access memory. Shared memory machines may be of the bus-based, extended, or hierarchical type. Distributed memory machines may have hypercube or mesh interconnection schemes.
A type of multiprocessor architecture in which several instruction cycles may be active at any given time, each independently fetching instructions and operands into multiple processing units and operating on them in a concurrent fashion. Acronym for multiple-instruction-stream.multiple-data-stream.
Bottom of Form
(Multiple Instruction stream Multiple Data stream) A computer that can process two or more independent sets of instructions simultaneously on two or more sets of data. Computers with multiple CPUs or single CPUs with dual cores are examples of MIMD architecture. Hyperthreading also results in a certain degree of MIMD performance as well. Contrast with SIMD.
In computing, MIMD (Multiple Instruction stream, Multiple Data stream) is a technique employed to achieve parallelism. Machines using MIMD have a number of processors that function asynchronously and independently. At any time, different processors may be executing different instructions on different pieces of data. MIMD architectures may be used in a number of application areas such as computer-aided design/computer-aided manufacturing, simulation, modeling, and as communication switches. MIMD machines can be of either shared memory or distributed memory categories. These classifications are based on how MIMD processors access memory. Shared memory machines may be of the bus-based, extended, or hierarchical type. Distributed memory machines may have hypercube or mesh interconnection schemes.
Multiple Instruction – Multiple Data
MIMD architectures have multiple processors that each execute an independent stream (sequence) of machine instructions. The processors execute these instructions by using any accessible data rather than being forced to operate upon a single, shared data stream. Hence, at any given time, an MIMD system can be using as many different instruction streams and data streams as there are processors.
Although software processes executing on MIMD architectures can be synchronized by passing data among processors through an interconnection network, or by having processors examine data in a shared memory, the processors’ autonomous execution makes MIMD architectures asynchronous machines.
Shared Memory: Bus-based
MIMD machines with shared memory have processors which share a common, central memory. In the simplest form, all processors are attached to a bus which connects them to memory. This setup is called bus-based shared memory. Bus-based machines may have another bus that enables them to communicate directly with one another. This additional bus is used for synchronization among the processors. When using bus-based shared memory MIMD machines, only a small number of processors can be supported. There is contention among the processors for access to shared memory, so these machines are limited for this reason. These machines may be incrementally expanded up to the point where there is too much contention on the bus.
Shared Memory: Extended
MIMD machines with extended shared memory attempt to avoid or reduce the contention among processors for shared memory by subdividing the memory into a number of independent memory units. These memory units are connected to the processsors by an interconnection network. The memory units are treated as a unified central memory. One type of interconnection network for this type of architecture is a crossbar switching network. In this scheme, N processors are linked to M memory units which requires N times M switches. This is not an economically feasible setup for connecting a large number of processors.
Shared Memory: Hierarchical
MIMD machines with hierarchical shared memory use a hierarchy of buses to give processors access to each other’s memory. Processors on different boards may communicate through inter nodal buses. Buses support communication between boards. We use this type of architecture , the machine may support over a thousand processors.
In computing, shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Depending on context, programs may run on a single processor or on multiple separate processors. Using memory for communication inside a single program, for example among its multiple threads, is generally not referred to as shared memory
In computer hardware, shared memory refers to a (typically) large block of random access memory that can be accessed by several different central processing units (CPUs) in a multiple-processor computer system.
A shared memory system is relatively easy to program since all processors share a single view of data and the communication between processors can be as fast as memory accesses to a same location.
The issue with shared memory systems is that many CPUs need fast access to memory and will likely cache memory, which has two complications:
- CPU-to-memory connection becomes a bottleneck. Shared memory computers cannot scale very well. Most of them have ten or fewer processors.
- Cache coherence: Whenever one cache is updated with information that may be used by other processors, the change needs to be reflected to the other processors, otherwise the different processors will be working with incoherent data (see cache coherence and memory coherence). Such coherence protocols can, when they work well, provide extremely high-performance access to shared information between multiple processors. On the other hand they can sometimes become overloaded and become a bottleneck to performance.
The alternatives to shared memory are distributed memory and distributed shared memory, each having a similar set of issues. See also Non-Uniform Memory Access.
In computer software, shared memory is either
- A method of inter-process communication (IPC), i.e. a way of exchanging data between programs running at the same time. One process will create an area in RAM which other processes can access, or
- A method of conserving memory space by directing accesses to what would ordinarily be copies of a piece of data to a single instance instead, by using virtual memory mappings or with explicit support of the program in question. This is most often used for shared libraries and for Execute in Place.
Shared Memory MIMD Architectures:
The distinguishing feature of shared memory systems is that no matter how many memory blocks are used in them and how these memory blocks are connected to the processors and address spaces of these memory blocks are unified into a global address space which is completely visible to all processors of the shared memory system. Issuing a certain memory address by any processor will access the same memory block location. However, according to the physical organization of the logically shared memory, two main types of shared memory system could be distinguished:
Physically shared memory systems
Virtual (or distributed) shared memory systems
In physically shared memory systems all memory blocks can be accessed uniformly by all processors. In distributed shared memory systems the memory blocks are physically distributed among the processors as local memory units.
The three main design issues in increasing the scalability of shared memory systems are:
- Organization of memory
- Design of interconnection networks
- Design of cache coherent protocols
Cache memories are introduced into computers in order to bring data closer to the processor and hence to reduce memory latency. Caches widely accepted and employed in uniprocessor systems. However, in multiprocessor machines where several processors require a copy of the same memory block.
The maintenance of consistency among these copies raises the so-called cache coherence problem which has three causes:
- Sharing of writable data
- Process migration
- I/O activity
From the point of view of cache coherence, data structures can be divided into three classes:
- Read-only data structures which never cause any cache coherence problem. They can be replicated and placed in any number of cache memory blocks without any problem.
- Shared writable data structures are the main source of cache coherence problems.
- Private writable data structures pose cache coherence problems only in the case of process migration.
There are several techniques to maintain cache coherence for the critical case, that is, shared writable data structures. The applied methods can be divided into two classes:
- hardware-based protocols
- software-based protocols
Software-based schemes usually introduce some restrictions on the cachability of data in order to prevent cache coherence problems.
Hardware-based protocols provide general solutions to the problems of cache coherence without any restrictions on the cachability of data. The price of this approach is that shared memory systems must be extended with sophisticated hardware mechanisms to support cache coherence. Hardware-based protocols can be classified according to their memory update policy, cache coherence policy, and interconnection scheme. Two types of memory update policy are applied in multiprocessors: write-through and write-back. Cache coherence policy is divided into write-update policy and write-invalidate policy.
Hardware-based protocols can be further classified into three basic classes depending on the nature of the interconnection network applied in the shared memory system. If the network efficiently supports broadcasting, the so-called snoopy cache protocol can be advantageously exploited. This scheme is typically used in single bus-based shared memory systems where consistency commands (invalidate or update commands) are broadcast via the bus and each cache ‘snoops’ on the bus for incoming consistency commands.
Large interconnection networks like multistage networks cannot support broadcasting efficiently and therefore a mechanism is needed that can directly forward consistency commands to those caches that contain a copy of the updated data structure. For this purpose a directory must be maintained for each block of the shared memory to administer the actual location of blocks in the possible caches. This approach is called the directory scheme.
The third approach tries to avoid the application of the costly directory scheme but still provide high scalability. It proposes multiple-bus networks with the application of hierarchical cache coherence protocols that are generalized or extended versions of the single bus-based snoopy cache protocol.
In describing a cache coherence protocol the following definitions must be given:
- Definition of possible states of blocks in caches, memories and directories.
- Definition of commands to be performed at various read/write hit/miss actions.
- Definition of state transitions in caches, memories and directories according to the commands.
- Definition of transmission routes of commands among processors, caches, memories and directories.
Although hardware-based protocols offer the fastest mechanism for maintaining cache consistency, they introduce a significant extra hardware complexity, particularly in scalable multiprocessors. Software-based approaches represent a good and competitive compromise since they require nearly negligible hardware support and they can lead to the same small number of invalidation misses as the hardware-based protocols. All the software-based protocols rely on compiler assistance.
The compiler analyses the program and classifies the variables into four classes:
- Read-only for any number of processes and read-write for one process
- Read-write for one process
- Read-write for any number of processes.
Read-only variables can be cached without restrictions. Type 2 variables can be cached only for the processor where the read-write process runs. Since only one process uses type 3 variables it is sufficient to cache them only for that process. Type 4 variables must not be cached in software-based schemes. Variables demonstrate different behavior in different program sections and hence the program is usually divided into sections by the compiler and the variables are categorized independently in each section. More than that, the compiler generates instructions that control the cache or access the cache explicitly based on the classification of variables and code segmentation. Typically, at the end of each program section the caches must be invalidated to ensure that the variables are in a consistent state before starting a new section.
shared memory systems can be divided into four main classes:
Uniform Memory Access (UMA) Machines:
Contemporary uniform memory access machines are small-size single bus multiprocessors. Large UMA machines with hundreds of processors and a switching network were typical in the early design of scalable shared memory systems. Famous representatives of that class of multiprocessors are the Denelcor HEP and the NYU Ultracomputer. They introduced many innovative features in their design, some of which even today represent a significant milestone in parallel computer architectures. However, these early systems do not contain either cache memory or local main memory which turned out to be necessary to achieve high performance in scalable shared memory systems
Non-Uniform Memory Access (NUMA) Machines:
Non-uniform memory access (NUMA) machines were designed to avoid the memory access bottleneck of UMA machines. The logically shared memory is physically distributed among the processing nodes of NUMA machines, leading to distributed shared memory architectures. On one hand these parallel computers became highly scalable, but on the other hand they are very sensitive to data allocation in local memories. Accessing a local memory segment of a node is much faster than accessing a remote memory segment. Not by chance, the structure and design of these machines resemble in many ways that of distributed memory multicomputers. The main difference is in the organization of the address space. In multiprocessors, a global address space is applied that is uniformly visible from each processor; that is, all processors can transparently access all memory locations. In multicomputers, the address space is replicated in the local memories of the processing elements. This difference in the address space of the memory is also reflected at the software level: distributed memory multicomputers are programmed on the basis of the message-passing paradigm, while NUMA machines are programmed on the basis of the global address space (shared memory) principle.
The problem of cache coherency does not appear in distributed memory multicomputers since the message-passing paradigm explicitly handles different copies of the same data structure in the form of independent messages. In the shard memory paradigm, multiple accesses to the same global data structure are possible and can be accelerated if local copies of the global data structure are maintained in local caches. However, the hardware-supported cache consistency schemes are not introduced into the NUMA machines. These systems can cache read-only code and data, as well as local data, but not shared modifiable data. This is the distinguishing feature between NUMA and CC-NUMA multiprocessors. Accordingly, NUMA machines are closer to multicomputers than to other shared memory multiprocessors, while CC-NUMA machines look like real shared memory systems.
In NUMA machines, like in multicomputers, the main design issues are the organization of processor nodes, the interconnection network, and the possible techniques to reduce remote memory accesses. Two examples of NUMA machines are the Hector and the Cray T3D multiprocessor.