There are N blocks of storage. Initially, these blocks are empty and linked toge
ID: 3793689 • Letter: T
Question
There are N blocks of storage. Initially, these blocks are empty and linked together on "freelist" (shown in the table below as --- for N=4). Three processes (Process 1, Process 2, and Process 3) communicate through the use of these blocks as depicted in the code on next page. Note that all three processes execute infinite loops. Denoting the execution of an iteration of the loop in Process 1, Process 2, and Process 3 as P1, P2, and P3, respectively, and supposing we have observed the following order of arrivals: P3, P1, P2, P1, P1, P1, and letting N=4, show in the table below all changes to the variables listed as the iterations execute, and the final values in those variables when all processing due to the above iterations’ execution completes.
Process being executed
freelist
nFL
list1
nL1
list2
nL2
value
queue
value
queue
value
queue
Initially
---
4
empty
empty
0
empty
empty
0
empty
P3
---
/* Three processes communicate through blocks of data. Initially “freelist” contains N free blocks, while “list1” and “list2” are both empty. */
/* Shared Variables */
var mutexFL, mutexL1, mutexL2 : semaphore := 1, 1, 1;
nFL, nL1, nL2 : semaphore := N, 0, 0;
freelist, list1, list2 : pointer to block : head_of_list_of_N_blocks,empty,empty;
/* Process 1 */
var b : block;
while true do
begin
SP (mutexFL,1,1; nFL,2,1); /* make sure at least 2 blocks in freelist */
b := unlink (freelist); /* remove a block from freelist */
SV (mutexFl,1);
produce info in block b;
SP (mutexL1,1,1);
link (b, list1); /* link block b to list1 */
SV (mutexL1,1; nL1,1);
end
/* Process 2 */
var x, y : block;
while true do
begin
SP (mutexL1,1,1; mutexFL,1,1; nL1,1,1; nFL,1,1); /* make sure blocks exist in
both list1 & freelist */
x := unlink (list1); y := unlink (freelist); /* remove a block from each
of list1 & freelist */
SV (mutexL1,1; mutexFl,1);
use block x & produce info in block y;
SP (mutexFL,1,1);
link (x, freelist); /* link block x to freelist */
SV (mutexFL,1; nFL,1);
SP (mutexL2,1,1);
link (y, list2); /* link block y to list2 */
SV (mutexL2,1; nL2,1);
end
/* Process 3 */
var c : block;
while true do
begin
SP (mutexL2,1,1; nL2,1,1); /* make sure blocks exist in list2 */
c := unlink (list2); /* remove a block from list2 */
SV (mutexL2,1);
consume info in block c;
SP (mutexFL,1,1);
link (c, freelist); /* link block c to freelist */
SV (mutexFL,1; nFL,1);
end
Process being executed
freelist
nFL
list1
nL1
list2
nL2
value
queue
value
queue
value
queue
Initially
---
4
empty
empty
0
empty
empty
0
empty
P3
---
Explanation / Answer
1 Critique of Past Multiprocessors One fundamental issue that must be addressed by any processor design concerns program locality. By properly exploiting locality through the use of a storage hierarchy (e.g., registers and caches), performance can be improved and cost lowered. Multiprocessing introduces two additional issues [AI87]: latency and synchronization. Latency arises from the need to communicate among physically separate processors (e.g., remote memory reads). Either this latency should be reduced, or masked by doing some other useful work. Synchronization is needed because the work associated with a given program must be broken down and distributed among processors, and there exists a need to coordinate these activities. In this section, the von Neumann and data ow models are evaluated based on their handling of the above issues. von Neumann Model. The implicit, sequential execution of the von Neumann model can e ectively exploit program locality. In addition, sophisticated hardware mechanisms have been developed to dynamically exploit parallelism at the instruction level (e.g., pipelining, delay slots, cache prefetching, and multiple instruction issue) or at the data stream level in a limited way (e.g., vector processing). Some studies suggest a reasonable amount of signi cant instruction level parallelism may be available [Wal93]. A great variety of von Neumann-based multiprocessor execution models have been built or proposed. The SIMD model (such as MasPar, DAP, CM2, and IBM GF11) has one control unit which executes the program and issues commands to a number of identical data processors. Programming is relatively simple due to the synchronous nature of execution, and it is most suitable for data-parallel applications where 0 operations across whole sets of data are performed. In the traditional distributed memory MIMD model (such as Ncube and Intel Paragon), each processing node contains its own local memory which is inaccessible from 2 Interconnection Network Node 1 Node 2 Node 3 Node n Local Memory Proc. 1 Proc. 2 Proc. 3 Proc. m Bus Figure 1.1: Generic MPP Architecture other nodes, and each executes an independent instruction stream. It communicates via message passing, which has relatively high communication overhead and latency, so this model is most suitable for executing task-parallel or coarse-grain tasks. The distributed memory MIMD with shared address space model (such as CM5, KSR1, T3D and IBM SP2) is gaining in popularity. This model provides a direct hardware method to access another node's local memory. Some of the newer machines such as T3D and IBM SP2 are characterized by relatively low network latency. The main advantage of these machines their ability to support a greater number of programming models compared to the other two von Neumann models. In summary, the von Neumann execution paradigm is characterized by a single thread of control per processor. When an operation of potentially long and unpredictable latency takes place, such as a remote memory read or a synchronization, the processor pipeline halts and the processor remains idle while the operation completes. The traditional techniques to cover latency, such as the use of caches and out-of-order execution, fail in the presence of long and unpredictable latency. Switching to another task is usually costprohibitive for a simple remote memory read. However, in executing instruction sequences that have no long latencies, the von Neumann model's eciency and use of locality gives it an important advantage. Fine-grain Data ow Model. Data ow is an inherently parallel execution model which represents computations by directed graphs. Highlights of the model are only brie y treated here; for further details, see [AC86, GE91]. In the ne-grain data ow model, nodes of the data ow graph represent instructions and edges represent data paths through which operands ow. A node can execute when all its operands arrive and all such nodes can re" concurrently. The only constraint on the execution order of instructions is their true data dependencies; thus the model exposes maximal parallelism at all levels including function, loop, and instruction levels. The data ow architecture consists of a number of processing nodes that are interconnected via a network. Each processing node consists of an execution unit that executes the instructions and a matching unit that stores incoming operands and enables ready instructions to be executed later. The instruction-level scheduling decisions are left entirely to the run-time system and the operand arrival order. Operands are carried in packets known as tokens. In dynamic data ow, each token carries its own context information. An instruction can be executed on any processor so the model is naturally scalable. Some of the more notable ne-grain data ow designs include MIT's Tagged-Token Data ow Architecture (TTDA) [ABM87, AN90], the Manchester Data ow Machine [GKW85], and SIGMA-1 [YSHK85]. 3 The problems associated with pure ne-grain data ow have been well treated in the literature [Ian88, MNB91, Cul89, PT91]. Important points can be summarized as follows: Resource management is a problem due to large potential parallelism. Because maximal parallelism has been exposed, the machine is forced to handle a large peak demand for resources [Cul89]. For example, matching operations implicitly allocate resources within the matching store. A deadlock may occur if the resource requirement ever exceeds the capacity. Synchronization at the instruction level causes two main problems: { The overhead of token matching is incurred for every instruction. This overhead is especially noticeable in the execution of inherently sequential threads of code and in those cases where the compiler can statically schedule the instructions with only a small loss of parallelism. { The dynamic instruction selection tends to mask inherent instruction level locality [Ian88, MNB91]. This is a major drawback of the model because it cannot take advantage of the storage hierarchy. Application domains are limited because certain traditional concepts are dicult to represent. For example, critical sections (no interruptions) are necessary to support low-level operating systems and handle certain kinds of resource management, but they cannot easily be represented in a data ow graph [PT91]. Several designs improve upon the rst-generation ne-grain data ow machines; for example, RMIT [AE90] borrows features of static data ow to eliminate unnecessary tagging operations, and EPSILON-1 [GDHH89] provides direct matching in order to reduce the matching store resource requirements. However, these are evolutionary improvements that do not directly address the main shortcomings listed above. In spite of the shortcomings, however, the data ow model has an important advantage over the von Neumann model in that it addresses the two key multiprocessor issues well: the data ow model tolerates latency by switching among a set of ready instructions, and it has relatively cheap ne-grain synchronization. The latency tolerance is possible due to the data ow model's ability to expose maximal parallelism at the function, loop and instruction levels. 1.2 Multithreaded Execution Model To summarize the discussion so far, the key advantage of the data ow execution is the ability to provide ne-grain level synchronization (as opposed to higher level synchronization such as barriers) and task-switching at a low cost so that latency can be usefully masked. The key advantage of the von Neumann execution is the ability to exploit locality via storage hierarchy through compile-time, static scheduling. Multithreading attempts to combine the two approaches to reap the bene ts of each. Some multithreading approaches start with a pure data ow model, extends it so that the granularity of tasks is increased from one to many instructions, and statically schedules the instructions within each task (or a thread). Threads themselves can be dynamically scheduled as in the ne-grain data ow model. Thus, the bene ts of sequential execution (e.g., taking advantage of locality, reducing overhead, and the ability to write critical sections) can be reaped within a thread and cheap ne-grain synchronization among threads can exploit useful parallelism. The guiding principle of the above approach, also referred to as a von Neumann-data ow hybrid, is a shifting of responsibility from the hardware to the compiler. By giving low level scheduling decisions to the compiler, the hardware is simpli ed. However, the compiler has the burden of providing a proper mapping to the hardware so that latency can be masked while at the same time locality is exploited. The chief feature of multithreading, its ability to cover latency by switching among a set of ready threads, provides another view of multithreading: one that starts with a conventional von Neumann processor and adds the capability of inexpensive context switches. In this view, there exist a number of ready tasks that execute in an interleaved fashion. For example, when a currently executing task makes a remote access, the processor switches to another ready task, saving the current task somewhere. This process requires some hardware support and involves replicating resources, such as register sets. It is of a paramount importance to multithreading that the cost of thread switching be much smaller than the latency. Only when the cost of thread generation and switching is small can the increased parallelism usefully mask latencies with other computations. It is possible to emulate multithreading on the traditional 4 von Neumann multiprocessor models discussed in the previous section (a la TAM); however, without ecient hardware support the thread switching and creation cost may defeat the usefulness of multithreading. 1.2.1 Variants and Issues in Multithreading In this section, various approaches to multithreading are classi ed in several ways. The Storage Model. Most multithreaded models associate a basic unit of storage (often called a frame) with an activation of a thread or a group of related threads. The frame can contain slots to perform synchronizations, to store input values, and to store temporary values. The allocation of slots in a frame is usually statically determined. This storage unit can be compared to the traditional von Neumann stack frames associated with function activations; however, whereas the von Neumann calling structure is such that only the top frame is active while the rest lie dormant, the multithreading calling" structure has the form of a tree, with all the rames" concurrently active. The inherent ability for a given activation to generate descendent activations all of which may be concurrently active gives rise to the development of a powerful compilation model that can exploit parallelism. However, the challenge for compiler writers is nding a proper mapping from the programming model to this execution model. In most models, a frame does not usually span across multiple processors, although some may allow frames to migrate for a load balancing purpose. Models such as TAM [CSS+ 91], *T [NPA92], EM-4 [SYH+ 89a, SYH+ 91], and the Iannucci's Hybrid Architecture (IHA) [Ian88] use activation frames that are based on code blocks. A code block is associated with a program structure such as a function or a loop body. For example, when a function is invoked, an activation frame is allocated explicitly as part of the calling convention. In general a code block may consist of several threads; however, a thread is never larger than a code block. Instead of associating a frame with an activation of a code block, it maybe associated with an activation of a single thread instead. In this storage model, the frame size would typically be much smaller but require more frequent allocation and deallocation operations. The Pebbles model developed in this thesis uses this thread-based frames. The state of computations is stored in various storage elements including registers, caches, frame memory, and global heap memory. The globally distributed heap space is used for long term storage needs (e.g., data with lifetimes greater than a single frame activation), which typically consist of data structures such as arrays which maybe accessed by several threads located in di erent processors. The processor state is usually linked to an activation frame. In general, a smaller activation frame increases the amount of synchronizations but decreases the synchronization cost. Blocking versus Non-blocking Thread Models. Broadly, the proposed multithreaded models can be classi ed into blocking and non-blocking variants. The classi cation is based on how thread switching and synchronizations are reconciled. In the blocking thread model, a thread can be suspended (or blocked) and then resumed later, requiring a mechanism to save states and to move threads between the di erent states. A thread may be suspended when either a remote memory request or a synchronizing operation is made. The hardware complexity and cost may range from simply queueing the suspended thread in a hardware queue to saving an entire set of registers. The size of blocking threads may range from tens to millions of instructions. Examples include HEP [Smi81], Tera MTA [ACC+ 90], Dash [WG89], IHA [Ian88], and J-Machine [DFK+ 92]. In the non-blocking thread model, once a thread starts executing, it runs to completion without blocking. This model therefore places two restrictions on each thread: all its inputs need to be present before the execution starts, and all instructions must take a bounded amount of time. The second requirement implies the use of split-phase transactions for long latency operations; for example, the request and its reply represent separate transactions. The non-blocking model requires a simpler processor architecture but the thread size is necessarily smaller than in the blocking model; examples include Monsoon [PC90], *T [NPA92], EM- 4 [SYH+ 89a, SYH+ 91], TAM [CSS+ 91], EARTH [HG93, HTG94], S-TAM [Vas94], and Pebbles. A machine that supports the blocking thread model obviously can support the non-blocking threads, albeit not as eciently as another machine which is speci cally designed to run non-blocking threads. Since the non-blocking threads are smaller and also more numerous, a di erent hardware requirements exist; for 5 example, a larger queue may be needed to handle a larger number of ready threads. Similarly a nonblocking machine can support blocking threads via a programming method: a set of non-blocking threads can be considered to be one longer blocking thread (e.g., EM-4) with suspensions" between the boundaries. However, the software must explicitly manage such a macro-thread." The choice of a thread model also interacts closely with the chosen storage model. Typically, values in registers are not carried across threads. In some cases, notably IHA, registers are not even valid across potentially suspensive points within a thread. Also, the expected lifetime of frames may a ect the design of the frame memory. In general, it is expected that the lifetime of a frame is longest for the blocking, code block-based frame model, the shortest for the nonblocking, thread-based frame model, and somewhere in the middle for the other two combinations. Synchronization. A synchronization requires a meeting point such as a register, a memory word, an interrupt level, etc. The number of these meeting points and their costs greatly a ect the ability to tolerate latency through task switching. Most von Neumann machines either provide a limited number of these meeting points (e.g., the registers) or their implementation is expensive (e.g., software-implemented semaphores). In contrast, multithreaded machines provide virtually unlimited meeting points since any memory location can be used as a meeting point. The cost is also low for both explicit and implicit synchronizations that the multithreaded machines provide. In the implicit case, a synchronization operation is performed as part of a normal instruction execution sequence. A ne-grain data ow is an extreme example of this: every operation requires an implicit synchronization. Most often, the implicit synchronizations are implemented by encoding synchronization functions in the tag bits associated with each memory word. For example, when an instruction accesses a word which is not yet written, as indicated by its tag bits, the synchronization fails and either the thread terminates or it becomes blocked. Examples include Monsoon, IHA, and Tera. In the explicit synchronization case, the instruction set contains a set of speci c synchronization instructions1 to perform the operations. Usually, speci c memory locations are reserved as meeting points (typically within an activation frame or the heap memory) and the synchronization instructions manipulate their contents. In general, they operate as counters; for example, when a certain number of synchronization events have occurred as indicated by the count, an associated thread can be enabled. Examples include P-RISC, TAM, and *T. Synchronization involves rst associating the signal" with the appropriate thread, then enabling (or creating) it, and later executing it. The cost of the rst two can be reduced by special hardware such as synchronization units and hardware queues. In addition to cheap synchronizations, a cheap dynamic thread creation (forking) mechanism is also important. In general, a thread is created as a result of certain synchronization operations or through a speci c call. However, the actual creation could also be handled implicitly or explicitly. In some cases, such as Monsoon, the thread is created automatically as part of the execution of a single instruction, and in other cases, such as P-RISC and IHA, explicit instructions are executed to generate and enable threads. Scheduling. Given a set of ready threads in the system, an interesting challenge lies in selecting which thread to execute next. The choice could a ect the performance greatly and involves several tradeo s. The scheduling aspect of multithreading has received scant attention so far. This is probably due to the diculty in implementing a proper scheduling policy since scheduling is controlled locally but has global a ects. In general, a global knowledge about the program and system component interaction is required. Consider an example of such a problem: it may seem sensible to give high priority to a thread that lies along the program's critical path (if that path can be determined) since many other threads may depend it. On the other hand, its execution might disrupt the existing locality so much that little or no overall bene t results. Since optimal scheduling is an NP-complete problem, a set of good heuristics which work well over a range of problems are the best that we can hope for. 1 In some cases, a synchronization instruction may actually be composed of a sequence of fundamental operations, as is the case in EARTH and TAM. 6 Models Storage Thread Synch. Forking Scheduling Monsoon Code block Nonblocking Implicit Implicit FIFO IHA Code block Blocking Implicit Explicit FIFO TAM Code block Nonblocking Explicit Explicit Quantum EM-4 Function Nonblocking Implicit Implicit FIFO P-RISC Code block Nonblocking Explicit Explicit FIFO J-Machine Code block Blocking Implicit Explicit FIFO *T Code block Nonblocking Explicit Explicit FIFO Tera Function Blocking Implicit Explicit FIFO EARTH Function Nonblocking Explicit Explicit FIFO Pebbles Thread Nonblocking Implicit Implicit FIFO/Depth- rst Table 1.1: Classi cation of Proposed or Existing Models A related dicult question is when to do scheduling: at run-time, compile-time, or a combination of the two. It would seem that a combination of the compile-time and run-time scheduling makes the most sense but much research needs to be done. Most multithreaded models introduced so far rely on strict FIFO run-time scheduling. Examples include *T, Monsoon, J-Machine, and Tera. Some models do use a simple priority-based scheme. For example, TAM tries to schedule together threads that belong to the same activation frame in order to exploit locality. Even for this fairly simple local scheduling policy, the implementation cost maybe such that a simple FIFO scheduling policy might end up being better [NWD93]. Implementation. In Table 1.1, various proposed and actual models are classi ed according to the above criteria. Just about every design alternative discussed can be implemented in either hardware, software, or some combination of both. The tradeo lies mainly in implementation cost versus speed. However, there should be some optimal price/performance point and this is the subject of current research. For example, EARTH uses dedicated hardware to perform synchronization, whereas *T uses the main execution processor for the duty. As another example, some have hardware scheduling queues (e.g., *T) whereas others implement queues via software schemes (e.g., TAM). Each tradeo may a ect the goals and strategies of the code generation signi cantly, by providing di erent cost gures. In Figure 1.2, the variations in the multithreading model are (roughly) mapped into the design space. Along the x axis lie the two thread models: blocking versus nonblocking. Along the y axis, the models are classi ed according to the cost of doing various multithreading operations such as synchronization and thread creation. As more hardware support is provided, the run-time cost decreases. 1.3 Compiling for Multithreaded Models Compiling for a multithreaded processor is a challenge especially since the multithreaded processor a ords a wide range of design parameters as described in the previous section. The goal is to generate code that can make e ective use of the resources of a given multithreaded machine. The most signi cant aspect of code generation for any multiprocessor involves proper partitioning of programs so that the hardware overhead for operations such as synchronization is minimized and the ability to exploit parallelism is maximized. Partitioning determines the unit of computations (tasks or threads) in a program. For a given multiprocessor architecture, there should be some optimal granularity. Generally, the partitioning process generates threads during compile time and the scheduling process determines how to schedule these threads at either run-time or compile-time. Several methods that generate multithreaded code have been developed by various researchers [GN88, EG91, Ian88, SCvE91, HDGS92, Tra91, RNB93]. These methods generally fall into two categories. In the top-down strategy, codes are generated directly from program data-dependence graphs. In the bottom-up strategy, ne-grain data ow graphs are rst generated, which are then partitioned into a set of threads. The
Related Questions
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.