nach oben

The Journal of Supercomputing

Erschienen in:

Open Access 16.09.2022

Using software visualization to support the teaching of distributed programming

verfasst von: Lorenzo Di Rocco, Umberto Ferraro Petrillo, Francesco Palini

Erschienen in: The Journal of Supercomputing | Ausgabe 4/2023

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

In this paper, we introduce MARVEL, a system designed to simplify the teaching of MapReduce, a popular distributed programming paradigm, through software visualization. At its core, it allows a teacher to describe and recreate a MapReduce application by interactively requesting, through a graphical interface, the execution of a sequence of MapReduce transformations that target an input dataset. Then, the execution of each operation is illustrated on the screen by playing an appropriate graphical animation stage, highlighting aspects related to its distributed nature. The sequence of all animation stages, played back one after the other in a sequential order, results in a visualization of the whole algorithm. The content of the resulting visualization is not simulated or fictitious, but reflects the real behavior of the requested operations, thanks to the adoption of an architecture based on a real instance of a distributed system running on Apache Spark. On the teacher’s side, it is expected that by using MARVEL he/she will spend less time preparing materials and will be able to design a more interactive lesson than with electronic slides or a whiteboard. To test the effectiveness of the proposed approach on the learner side, we also conducted a small scientific experiment with a class of volunteer students who formed a control group. The results are encouraging, showing that the use of software visualization guarantees students a learning experience at least equivalent to that of conventional approaches.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

The recent availability of unprecedented amounts of data (i.e., Big Data) requires the development of new paradigms, methods, and technologies that enable efficient storage, management, processing, and analysis of such data in a reasonable amount of time.

One possible direction is to solve these problems through distributed computing. In this case, one computational approach that has attracted much attention in recent years is that based on MapReduce [1]. This is a general-purpose distributed programming paradigm for big data processing, first developed by Google for internal purposes and made publicly available through the Apache Hadoop [2] and Apache Spark [3] distributed computing frameworks.

One of the main reasons for the popularity of this paradigm is that it allows users to focus on the conceptual development of a distributed solution to a problem, rather than dealing with all the complex issues that typically arise when implementing a distributed algorithm. This simplicity, along with robust market demand, has made MapReduce the paradigm of choice for introducing distributed programming in undergraduate computer science courses [4].

We note that, despite its apparent simplicity, it is not easy to fully understand the MapReduce paradigm and how it can be used to leverage the computational resources of a distributed framework. A common approach to teaching this topic is to use visualizations to illustrate the behavior of MapReduce algorithms. However, this solution can be cumbersome and quite limiting when implemented with traditional teaching tools such as whiteboards or electronic slides.

In this paper, we explore the possibility of using software visualization techniques to support the teaching of MapReduce. To this end, we present MARVEL¹ (MAp Reduce Visual Environment Laboratory), an interactive environment to introduce and explain MapReduce. This environment provides users with a visual representation of a distributed system and allows them to interactively compose and execute a MapReduce algorithm step-by-step through a graphical interface. Each user-requested step of a MapReduce algorithm is effectively executed on the current input and its effects are represented by rendering a corresponding graphical animation. The output resulting from the execution of each step is displayed on the screen and provided as input to the following step. Thus, by executing the correct steps in the correct order, it is possible to emulate and display the behavior of a variety of MapReduce algorithms without writing a single line of code.

A further contribution of this work lies in the results of the experimental analysis we performed, in which we compared the performance of two groups of students attending a lecture on MapReduce. In one case, the lecture was given using electronic slides and, in the other case, it was taught using MARVEL. The results show that the use of MARVEL has no negative impact on student performance.

1.1 Organization of the paper

In Sect. 2, we briefly introduce distributed programming, the MapReduce paradigm, and Spark, one of its major implementations. In Sect. 3, we discuss the issues that can arise when teaching MapReduce and focus on two important learning tasks that we believe can be addressed using software visualization. In Sect. 4, we introduce and describe our proposal. In Sects. 5 and 6, we outline the architecture of MARVEL. In Sect. 7, we briefly review the existing literature on the use of software visualization to support the teaching of parallel and distributed algorithms. In Sect. 8, we discuss the experiment we conducted to evaluate the impact of our proposed system on the learning experience of a class of students. Finally, in Sect. 9 we draw some concluding remarks on our work.

2 Distributed programming

At its core, distributed programming is about the development of applications where two or more processes cooperate to solve a given task, where the processes may exist or not on a same computer [6]. This allows to leverage the computational capabilities of several computers, for solving problems that would be, otherwise, out of grasp for a single workstation.

Although distributed programming can usually be done using traditional programming languages, there are several aspects the developer has to care of and that are missing while developing non-distributed applications. Most of these concern with the need of exchanging data among the different processes running a same distributed application, synchronizing their execution, guaranteeing a balanced workload distribution and coping with network faults. Consequently, writing distributed applications has never been an easy task, as it required non-trivial technical skills.

This problem has been alleviated thanks to the introduction in the past decades of several software frameworks and libraries, such as MPI [7] or PVM [8], able to take charge of some of these issues, thus simplifying the development of complex distributed applications. Despite this, distributed programming remained a complex activity, and so it has been its teaching. Situation changed in the last decade, thanks to the introduction of MapReduce.

2.1 MapReduce

The MapReduce paradigm is a programming model designed to simplify the development of algorithms useful in a wide range of application domains, such as [9‐12], for processing very large amounts of data [1]. Starting from the availability of an input dataset organized as a collection of key-value pairs, it is based on the execution of a sequence of two types of functions:

map functions. They take as input a key-value pair returning, as output, a (possibly empty) set of key-value pairs.
reduce functions. They take as input a collection of key-value pairs sharing the same key, so to produce, as output, a (possibly empty) set of key-value pairs.

Thus, distributed algorithms of different complexities can be modeled in MapReduce by defining proper map and reduce functions, so that the algorithm itself is a sequence of these functions, where each function is executed on the input dataset, with its output used as input for the next one.

MapReduce applications are typically executed on distributed systems. In this case, the input collection of key-value pairs is partitioned into data blocks, with each block assigned to one computing node of the distributed system. Then, two cases can occur. If a map function has to be executed, each node of the distributed system will execute this function independently of the others, on each key-value pair of its data blocks. If a reduction function needs to be executed instead, for each different key, a node is selected to be responsible for processing the key-value pairs having that key (i.e., partitioning). Then, the input collection of key-value pairs is reordered so that all elements having the same key are moved to the node responsible for that key (i.e., shuffle). Finally, the reduce function is executed by this node on the values associated with that key.

Notice that all communications required to transfer the elements returned by a function to the node where they will be further processed, as well as the handling of possible errors, are automatically managed by the underlying distributed computing framework. The user only needs to define the map and reduce functions required to implement the algorithm.

2.2 Apache spark

Spark is the de-facto standard framework that supports the MapReduce paradigm. It is capable of leveraging the computational capabilities of a distributed system using a high-level distributed programming model.

Its architecture is based on a driver/worker model (see Fig. 1). The developer is required to code the driver application. This will be executed on the master node and will be responsible for orchestrating a distributed execution using a collection of tasks. The worker nodes perform the distributed computations, following the instructions received from the driver program. Each worker node hosts one or more executors, which are the processes actually responsible for executing the tasks issued by the driver program. Input data is maintained and processed in a distributed manner using Resilient Distributed Datasets (RDD). Communication between the driver and the worker nodes occurs with the mediation of a Spark cluster manager.

2.2.1 Resilient distributed datasets

The term Resilient Distributed Dataset (RDD) refers to the standard data structure available in Spark for developing distributed applications. It is essentially a distributed collection of customizable objects, modeled as key-value pairs. When instantiated during the execution of a Spark application on a distributed system, their content is automatically partitioned in small data blocks, with each block associated with a particular worker node. Spark also provides a rich library of distributed high-level operations that can be used to process the contents of these data structures. These operations can be either transformations or actions. Transformations process one or more input RDD data structures, by means a MapReduce-based function supplied by the user. This is in turn applied to each object of the target RDD. The result is a new RDD containing the processed objects. Actions process the objects in an input RDD data structure, and return to the driver program a non-distributed collection of objects.

3 Motivations

The MapReduce paradigm is often introduced using very simple distributed algorithms, such as word counting [1] or line counting, whose abstract execution is simulated by the teacher using graphical metaphors. More complex problems and algorithms are then discussed. We find that such introductory problems can be solved elegantly with very few map and reduce functions and provide a smooth introduction to the subject. However, it may not be clear to a student how the MapReduce approach can be generalized to more complex problems starting from these simple examples. Moreover, it may not be easy to understand how the same algorithm executed on a distributed system requires an execution time that is only a fraction of the original one. Therefore, we focus on the following learning objectives:

Solving non-trivial problems with MapReduce Solving a general problem with MapReduce will likely require running several simple map and reduce functions. It is important to understand how executing these simple functions in a particular order can lead to the solution of a more general and complex problem.
Exploiting distributed computing When introducing a MapReduce algorithm, the focus is often on explaining the conceptual behavior of the algorithm rather than how it is implemented in a real distributed system. This can lead to a lack of understanding of how a MapReduce algorithm can take advantage of a distributed system to significantly reduce its execution time.

3.1 Visualization as a solution

Following the idea that it is easier to understand the logic of an algorithm if its behavior is represented and illustrated using appropriate graphical metaphors, it is common to introduce MapReduce and its applications using visualizations. This is usually done in two ways. The first approach is to illustrate on a whiteboard during a lesson the graphical simulation of the steps required by a selected algorithm in the execution of an input dataset. The second approach is to prepare a set of electronic slides prior to the lesson that illustrate a simulation of the algorithm executing on an input data set.

Both teaching approaches can have some significant drawbacks. First, it can be very time consuming to illustrate the simulation of a MapReduce algorithm on a whiteboard, since the layout of these algorithms is usually complicated and takes time to display. By using electronic slides, on the other hand, the simulation of a MapReduce algorithm can be displayed immediately on a screen.

On the other hand, the content of an electronic presentation is difficult to change during the lesson, so the teacher cannot easily deviate from the particular presentation he prepared at an earlier time. Instead, if the teacher simulates the algorithm using a whiteboard, he has the possibility to change the content of the visualization. This can be useful, for example, to respond to a particular question from the class or to highlight some steps of the algorithm that might be unclear at first.

So, there is no clear choice between the two teaching methods, as each of them has some advantages and disadvantages over the other.

4 Our contribution

The idea we propose is the proper use of interactive visualizations automatically generated by software to get the most out of the teaching approaches discussed in Sect. 3 while avoiding their drawbacks. Specifically, we provide a teacher with a framework capable of executing a MapReduce algorithm in reality and illustrating its behavior using appropriate graphical metaphors and animations. Such a framework should also include a representation of a real-distributed environment and be able to show details about the physical execution of an algorithm, such as the communication between different nodes and the parallel execution of the algorithm.

The availability of such a framework could enable a new teaching approach that has several advantages over traditional approaches. Compared to using a whiteboard, it requires almost no time investment, either in preparing a lesson or during the lesson, since the visualization of a target algorithm is automatically generated on-the-fly by closely following its execution. As for the electronic slides, our approach is not only much more efficient in terms of preparation time for a lesson, but also much more flexible, as the teacher has the possibility to compose the presented algorithm interactively, instead of being bound to a specific presentation. As another advantage, we mention the possibility for students to recreate and experiment on their own the algorithm taught by the teacher using the same framework.

4.1 MARVEL

We have developed a software framework called MARVEL (MAp Reduce Visual Environment Laboratory) to evaluate the effectiveness of our approach, along with the primitives used for the examples and experiments described in this paper². It is an application that allows a teacher to interactively define and run a MapReduce algorithm on an input dataset, automatically illustrating its behavior with appropriate graphical metaphors and animations (see Fig. 2 for an example).

A typical teaching session with MARVEL is organized in three phases:

Phase 1: Environment Initialization

The teacher sets the initial configuration of the simulated distributed environment to be used during the teaching session. The most important parameters are the number of computing nodes to be used and the size of the data blocks used to partition RDD data structures (see Sect. 2.2.1). Once this information is specified, the user interface of the framework is rendered on the screen, including a set of graphical entities representing the different nodes of the simulated system.
Phase 2: Input Initialization

The teacher introduces the input dataset to be processed by the presented algorithm. The dataset can either be loaded from a file or generated on-the-fly, using a provided random data generator. After instantiation, the input dataset is visually divided into blocks and assigned to the different nodes of the simulated system.
Phase 3: Algorithm execution

The teacher interacts with the Graphical User Interface (GUI) of the framework to replicate the MapReduce algorithm presented. Specifically, the teacher selects from a drop-down list the map or the reduce transformations to be performed on the dataset currently displayed on the screen. After this choice is made, an animation stage is automatically played to illustrate the behavior of the selected operation (see Fig. 2 for an example). The animation is played to reflect the distributed nature of the transformation, i.e., the different data blocks located on different nodes are processed in parallel. Optionally, the framework can prompt the teacher to provide some input parameters if required for the ongoing transformation. Note that each animation stage is supposed to represent a step of the algorithm under consideration and refers to the output of the previous stage. This means that the sequence of all animation stages played back one after the other will result in a visualization of the whole algorithm.

In addition, the teacher can rewind the current visualization at any time by undoing any number of the last operations. This is useful, for example, to show the behavior of the same algorithm on a different input, or to repeat and focus on a particular step of the algorithm until the class has reached a certain level of confidence.

4.2 Case study

In the remainder of this article, as a reference case study for MapReduce, we will consider a slight reformulation of a paradigmatic example used in the Apache Spark documentation to introduce the use of this framework [13].

One problem that large online social network providers often face is analyzing their huge user base by various factors, including demographic variables, to find the most inclusive age group. For example, let us take the case of a very large group of people whose names and ages are known. We are interested in first extracting the mode of age distribution and then the largest age group in a short time using a distributed approach.

Suppose the input subjects are described as key-value pairs, where the key contains the unique identifier of each subject and the value contains her/his age. It is possible to solve this problem in an efficient and scalable way by running the following MapReduce algorithm.

First, all key-value pairs are changed by swapping their key with their value and vice versa, using a map function. The resulting collection of pairs is indexed by age. Then we aggregate all key-value pairs that have the same age and evaluate the size of the resulting aggregates using a reduce function. This is then the age distribution mode. Finally, all age frequencies are compared using a reduce function to find the largest age group.

5 Architecture of the visualization

In this section, we describe the basic building blocks of the visualizations generated by MARVEL. These are useful for emulating a generic MapReduce algorithm and explaining its behavior when executed on a distributed system.

At startup, the framework draws on the screen a series of round boxes representing the different nodes involved in the current computation³. Each box contains a visual representation of the data blocks managed by the corresponding executor (see Phase 2 in Sect. 4.1), as well as the animation stages describing the operations performed by that executor (see Phase 3 in Sect. 4.1).

Data blocks are represented as groups of graphical records, each one referring to a different partition of the same collection (see Sect. 2.2.1). The execution of a particular algorithm is represented as a sequence of animation stages (see Phase 3 in Sect. 4.1). Each stage corresponds to a map or reduce transformation and consists of one or more processing steps and possibly a communication step.

Processing Steps. These are visualized simultaneously in the box representing each executor by playing an animation representing the way each block of data is processed by a particular operation during that step.
Communication Steps. These are visualized by playing an animation representing the way records are moved from their origin to their destination executor as required by a particular operation.

In the following, we provide details about the map/reduce transformations commonly used in the creation of MapReduce applications and supported by MARVEL. For the sake of clarity, a summarization of these transformations is reported in Table 1.

Table 1

List of the map (M) and reduce (R) transformations supported by MARVEL

Function Name	Type	Description
filterOnKey	M	Filters the key-value pairs whose key satisfies a condition given by input
filterOnValue	M	Filters the key-value pairs whose value satisfies a condition given by input
split	M	Splits a distributed collection of strings into a series of key-value pairs according to a user-specified separator
swap	M	Swaps the key and the value of each element of an input set of key-value pairs
count	R	Counts the element of an input set of key-value pairs
countByKey	R	Evaluates the frequency of each key
groupByKey	R	Groups all elements that have the same key
maxOnKey	R	Returns the element with the largest key
minOnKey	R	Returns the element with the smallest key
maxOnValue	R	Returns the element with the largest value
minOnValue	R	Returns the element with the smallest value
sumOnKey	R	Evaluates the sum of all keys
sumOnValue	R	Evaluates the sum of all values

5.1 Map transformations

We recall from Sect. 2.2 that these transformations process all elements of an input distributed collection, here defined as \({\mathcal {C}}\), partitioned into data blocks, and return a new distributed collection, here defined as \({\mathcal {R}}\). Both collections can store either values or key-value pairs.

5.1.1 filterOnKey

Description Let \({\mathcal {C}}\) be an input distributed collection of key-value pairs. This transformation filters the elements of \({\mathcal {C}}\) that satisfy a condition given by the input, targeting the key of each element. The result is stored in a new distributed collection of key-value pairs \({\mathcal {R}}\).

Visualization A new empty distributed collection of objects representing \({\mathcal {R}}\) is displayed next to \({\mathcal {C}}\). Then, visualization occurs by highlighting each element of \({\mathcal {C}}\) in turn. If its key satisfies the input condition, it is drawn in green and graphically moved to \({\mathcal {R}}\). Otherwise, it is drawn in red and ignored.

5.1.2 filterOnValue

Same as filterOnKey, but targeting the value field of each element of the input collection.

5.1.3 Split

Description Let \({\mathcal {C}}\) be an input distributed collection of strings. This transformation splits each string of \({\mathcal {C}}\) into a set of strings according to a user-specified separator. The result is stored in a new distributed collection of strings \({\mathcal {R}}\). Optionally, it is possible to assign the value 1 to each element of \({\mathcal {R}}\).

Visualization A new empty distributed collection of objects representing \({\mathcal {R}}\) is displayed next to \({\mathcal {C}}\). Then, visualization is performed by highlighting each element e of \({\mathcal {C}}\) in turn. The contents of e are processed, yielding a new collection of elements (i.e., the output of the split operation) that are graphically moved into \({\mathcal {R}}\). Optionally, the collection of elements produced by the split operation is returned as key-value pairs, with the value always set to 1.

5.1.4 Swap

Description Let \({\mathcal {C}}\) be an input distributed collection of key-value pairs. This transformation swaps the key and value field of each element. The result is stored in a new distributed collection of key-value pairs \({\mathcal {R}}\).

Visualization A new empty distributed collection of objects representing \({\mathcal {R}}\) is displayed next to \({\mathcal {C}}\). Then, visualization is performed by highlighting each element e of \({\mathcal {C}}\) in turn. The contents of e are cloned and then edited to visually swap the position of the key and value fields. The resulting element is graphically shifted in \({\mathcal {R}}\).

5.2 Reduce transformations

These transformations process all elements of an input distributed collection of elements, here defined as \({\mathcal {C}}\), and possibly return a new distributed collection of elements, here defined as \({\mathcal {R}}\). Both collections can store either values or key-value pairs. Note that there are also some reduction operations that are defined as actions rather than transformations, because they return a set of scalar values to the driver program rather than a distributed collection of elements (see Sect. 2.2). For simplicity, we will treat them as transformations in the rest of the paper.

5.2.1 Count

Description Let \({\mathcal {C}}\) be an input distributed collection of elements. This transformation counts the number of elements in \({\mathcal {C}}\). The result is shown on screen using a dialog window.

Visualization Let \({\mathcal {E}}\) be a reference to each executor of the considered distributed system. In each \({\mathcal {E}}\), a new numeric value, initially set to zero, is displayed. It is used as a counter to report the number of items of \({\mathcal {C}}\) stored in \({\mathcal {E}}\). Then, visualization is performed by highlighting each element e of \({\mathcal {C}}\) in turn, and increasing the corresponding counter. Once all executors have finished evaluating their counts, these are combined and the resulting global count is shown on screen using a dialog window.

5.2.2 countByKey

Description Let \({\mathcal {C}}\) be an input distributed collection of elements. This transformation counts the number of key-value pairs for each distinct key in \({\mathcal {C}}\). The result is stored in a new distributed collection of key-value pairs, called \({\mathcal {R}}\).

Visualization

Let \({\mathcal {E}}\) be a reference to each executor of the considered distributed system. In each \({\mathcal {E}}\), a new numeric value, initially set to zero, is displayed. Visualization is performed by highlighting each element e of \({\mathcal {C}}\) in turn. If the key of \({\mathcal {C}}\) has been found for the first time in \({\mathcal {E}}\), a new counter is introduced with that key, and set to 1. If the key of \({\mathcal {C}}\) has already been found before, its frequency is increased by 1. Once finished, for each distinct key \({\mathcal {K}}\), all nodes send their local frequencies of \({\mathcal {K}}\) to the node responsible for that key, by means of a network transmission (shuffle phase). Once all local frequencies have been aggregated, the result is stored in a new distributed collection of key-value pairs \({\mathcal {R}}\).

5.2.3 groupByKey

Description Let \({\mathcal {C}}\) be an input distributed collection of key-value pairs. This transformation groups all elements that have the same key and combines their values into a single list. The result is stored in a new distributed collection of key-value pairs, called \({\mathcal {R}}\).

Visualization

A new empty distributed collection of objects representing \({\mathcal {R}}\) is displayed next to \({\mathcal {C}}\). Then, visualization is performed by highlighting each element e \(=(k,v)\) of \({\mathcal {C}}\) in turn. If the partition of \({\mathcal {R}}\) local to this executor already has an element with key k, then v is added to the corresponding list of values by cloning it and moving it from \({\mathcal {C}}\) to \({\mathcal {R}}\). Otherwise, a new element is added to \({\mathcal {R}}\) ’s local partition, by cloning and moving the key-value pair (k, v). After all elements of \({\mathcal {C}}\) are processed, each key in \({\mathcal {R}}\) is assigned to one of the available executors. Consequently, all key-value pairs with this key are cloned and moved from their original location to their reference executor (shuffle phase). During this process, if multiple key-value pairs with the same key k end up in the same executor, they are merged into a single element with key k and with a value corresponding to the concatenation of their original value lists.

5.2.4 maxOnKey

Description Let \({\mathcal {C}}\) be an input distributed collection of key-value pairs. This transformation returns the element of \({\mathcal {C}}\) with the largest key (for string-based elements, the lexicographic order is taken into account). The result is shown on screen using a dialog window.

Visualization Let \({\mathcal {E}}\) be a reference to each executor of the considered distributed system. In each \({\mathcal {E}}\), a new numeric value, initially set to infinity, is displayed. It is used to report the element of \({\mathcal {C}}\) with the current largest key among those stored in \({\mathcal {E}}\). Then, visualization is performed by highlighting each element e of \({\mathcal {C}}\) in turn and possibly updating the corresponding maximum value. Once all executors have finished evaluating their local maximum, they are combined and the resulting element with the overall largest key is shown on screen using a dialog window.

5.2.5 minOnKey

Same as maxOnKey, but returning the element with the smallest key.

5.2.6 maxOnValue

Same as maxOnKey, but considering the value field of each key-value pair in \({\mathcal {C}}\).

5.2.7 minOnValue

Same as maxOnValue, but returning the element with the smallest value field.

Then the visualization is done by highlighting each element e of \({\mathcal {C}}\) in turn and updating the corresponding sum. Once all executors have finished evaluating their local sums, these are combined and the resulting overall sum is shown on screen using a dialog window.

5.2.8 sumOnKey

Same as maxOnKey, but returning the sum of all keys of \({\mathcal {C}}\).

5.2.9 sumOnValue

Same as maxOnKey, but returning the sum of all values of \({\mathcal {C}}\).

5.3 Case study: solution

We describe here a MARVEL session used to recreate the algorithm based on MapReduce for solving the case study introduced in Sect. 4.2.

At the beginning (environment initialization phase), the teacher defines the number of computing nodes and the size of data blocks that will be used during the teaching session. Then, the teacher chooses an input text file containing a set of records (input initialization phase), holding information about the name and the age of a group of subjects. In this phase, the teacher provides also information for the correct parsing of the input file. After fixing these parameters, MARVEL generates a visual representation of the simulated distributed system, including the content of the provided input file, organized as a distributed data structure (see Fig. 3).

The first step of the algorithm is to swap the keys and the values of the input data structure, so to obtain a distributed data structured indexed by age. This operation can be performed by invoking the built-in MARVEL swap map transformation. The resulting visualization shows the creation of a new distributed data structure, obtained by swapping the keys and the values of the original one. The transformation is performed in parallel by the different nodes and requires no communications among them (see Fig. 4).

The second step of the algorithm requires to compute the histogram of the age distribution, by aggregating and counting all the key-value pairs sharing the same age. This operation can be performed by invoking the built-in MARVEL countByKey reduce transformation. In the resulting visualization (see Fig. 5), each node creates first a local histogram of the age distribution, evaluated according to the local key-value pairs. Then, all nodes send their local frequencies about each key to the node responsible for that key, by means of a network transmission (shuffle phase).

As the last step, the algorithm requires to compare all different age frequencies stored over all nodes of the distributed system, to find the highest one. This operation can be performed by invoking the built-in MARVEL maxOnValue reduce transformation. In the resulting visualization (see Fig. 6), each node will find the local record having the maximum age. All the local maximums will be compared with each other and the global maximum will be then displayed on screen using a modal dialog window.

6 System description

Providing a live visualization of what is happening inside a distributed MapReduce execution is not straightforward, due to the intrinsic complexity of distributed computing frameworks such as Hadoop or Spark. There are two possible approaches. The first approach is to build a system capable of introspecting the architecture of the underlying computing framework and the layout of the distributed data structures available within it, while keeping track of the execution and output of the distributed map or reduce operations. The second approach is to simulate a distributed computing framework in the first place, as well as its MapReduce primitives, and then visualize its behavior.

This second approach is easier to follow, since the distributed execution of an MapReduce algorithm would be simulated only for the part necessary to generate a correct visualization. On the other hand, simulating the behavior of a distributed computing framework requires a much longer development time and may fail to describe how an MapReduce algorithm is translated and executed on a real distributed computing framework.

In our case, we opted for a hybrid solution that combines the best of both approaches. We have developed a system where the MapReduce algorithm to be visualized is executed on a real instance of the Spark distributed computing framework, but on a standalone machine. Most of the details about how Spark partitions an input data structure on the nodes of a distributed system, which are useful for creating an appropriate visualization, are gathered by querying the Spark middleware. Instead, the remaining information is obtained through an emulation of the Spark framework.

In summary, our framework is organized in the following three modules (see Fig. 7 for a high-level design representation):

The MapReduce algorithm module. It is a generic interactive distributed Spark application. It works by instantiating a distributed RDD data structure (see Sect. 2.2.1) whose type and size are chosen by the user at startup, and whose contents are randomly generated or loaded from an external file. This application also implements a standard collection of map and reduce transformations (see Sect. 5).
The Orchestrator module. It is a communication broker used by the visualization module to collect information about the MapReduce algorithm to be visualized, and by the MapReduce algorithm to obtain input parameters and other information required for its execution from the GUI. Some of this information is retrieved by the orchestrator directly from the underlying Spark framework. Other information required to simulate the execution of the algorithm MapReduce on a distributed framework is managed internally by the orchestrator. In addition, this module is responsible for receiving the operation requests issued by the visualization module for executing Map or Reduce transformations on the current distributed data structures. Finally, it is in charge of initializing the starting dataset as a local data structure, that is converted into a RDD by the MapReduce algorithm module. The starting dataset can be generated at random or imported from an input file.
The Visualization module. It is responsible for generating the visualization of the proposed MapReduce algorithm, whose behavior is defined by the user’s interaction with the visualization framework. Each time the user selects a transformation to be executed on the current dataset, this request is sent from this module to the algorithm module MapReduce via the orchestrator. Conversely, the actual execution of the transformation by the algorithm module MapReduce triggers the playback of the corresponding animation phases on the visualization module. The visual content of a MARVEL representation, as well as its animation, is delegated to a set of classes representing the different concepts involved in a MapReduce application, such as: NodeFx, RDDPartitionFx, BlockFx, RecordFx and FieldFx. Each of this class is responsible for illustrating on screen the behavior of its MapReduce counterpart, in a hierarchical way. So, as an RDD partition is made of one or more blocks, containing each several records, so an instance of RDDPartitionFx, holds one or more instances of BlockFx, each made of RecordFx and FieldFx objects. This allows to easily implement complex graphical scenes involving whole RDD or parts of them, by interacting with just one object at time rather than animating all the single objects making up these components. The SystemFx class is used to describe, as a whole, the state of a MARVEL execution, just after a transformation is applied. So, after executing a sequence of MapReduce operations, MARVEL will have several instances of this class. Then, the MarvelController class is able to manage all these instances, making it possible both to run new transformations, so creating a new SystemFx instance, or revert the current MARVEL state to a previous SystemFx instance, thus rewinding the algorithm execution back in time.

In recent years, several academic results have been published that address the problem of how to effectively teach topics related to Big Data processing, such as [14‐19]. Distributed computing has become a key element for the Computer Science curriculum. Therefore, some educators [20] argue for the need to include these topics in the early courses. Even during the pandemic period, educators attempted to ensure high-quality remote teaching of these topics. In [21], the authors discuss about the efficacy of the remote material they provided to their student on computational platform.

Most of previous works address the issue of designing a Big Data course or curriculum in which MapReduce and its enabling technologies, Hadoop and/or Spark, typically play a central role. Therefore, their focus is mostly on organizational issues, while the problem of teaching MapReduce is not given special attention.

On the other hand, there are also contributions that are focused on proposing new approaches for supporting the teaching of the MapReduce paradigm. In [22], the authors address the problem of simplifying the implementation of MapReduce algorithms using the Hadoop framework. Their motivation is that this framework requires technical skills that are often lacking among students in introductory computer science courses. For this reason, they present a web-based framework that allows students to perform MapReduce computations by providing the algorithmic code to run, while the definition of the several configuration parameters typically by Hadoop is done with the help of a simplified web-based interface.

A similar goal is pursued by Phoenix++ [23]. It is a framework that introduces undergraduate students to the MapReduce paradigm while avoiding the technical intricacies that usually come into play when using a real distributed system. The students are required only to implement map/reduce functions, while the Phoenix++ automates the parallelization of the jobs. This is an effective tools to introduce the concept of scalability or the decomposition of a problem into independent tasks, but it does not provide deeper insights about the distributed architectures where MapReduce algorithms actually run. Indeed, Phoenix++ is mostly a tool useful to quickly setup a simulated distributed system, run simple examples and evaluate their performance in terms of execution times but it is not aimed to simplify the understanding of the Behavior of MapReduce applications using visualizations or other efficient solutions, whether for teaching or other purposes.

A recent work [24] proposes a gamification approach to make easier the teaching of a distributed architecture. In the authors’ classes, the fundamentals of the Hadoop framework are introduced using a role-play game, where students are divided into teams imitating the behavior of the cluster components.

Notice that all contributions mentioned so far essentially aim to simplify the implementation and the deployment of MapReduce algorithms, but do not tackle at all the problem of supporting the comprehension of this kind of algorithms. Broadening our perspective, we note that, since Stasko’s pioneering work in [25], the use of computer graphics has been seen as a potentially effective medium for illustrating the behavior of an algorithm and helping a user understand its meaning.

Apart from computer algorithms, software visualization has been used as a supportive tool in a variety of cases, from illustrating cryptographic protocols [26] to teaching the database Structured Query Language (SQL) [27].

This approach has also been taken to support the teaching of parallel and distributed algorithms. To name a few, Naps and Chan in [28] proposed two techniques for animating parallel programs that allow students to better understand a parallel target algorithm by also interactively participating in its construction. A similar approach was taken by Ben Ari in [29] and by Carr et al. [30] with the introduction of two systems for the implementation and visualization of interactive parallel programs. In both cases, the user (i.e., the instructor) is provided with a programming library useful for implementing a parallel program. Upon execution, the resulting code is automatically visualized using some appropriate graphical metaphors. The user (i.e., the instructor or the student) can also visualize the network of processes involved in the parallel execution and the flow of communication between these processes. Danner et al. proposed a debugging tool for GPU and CPU programs in [31]. The student first implements a routine and then uses the proposed tool to verify that the parallel execution of his/her code is correct using a computer-generated visualization. In [32], another tool has been proposed for visualizing a set of processes executing a parallel algorithm. It has a rich set of standard 2D computational patterns to visualize the parallelism more easily. Each process is identified by a color. The same color is also used by the graphical objects describing the 2D computational patterns to describe the activity of each process.

All the above systems were designed to support the teaching of explicitly parallel algorithms. In this case, the code to be executed on each node of a parallel computation must use some kind of communication primitives to explicitly exchange data with other nodes (e.g., send-to-one, send-to-may, scatter, gather). For this reason, the focus of these systems is primarily on the communication patterns that occur between the various actors in the parallel execution, rather than on how data is processed.

8 Experimental validation

As noted in Sect. 4, the use of MARVEL brings several advantages to the teacher over the use of electronic slides or a whiteboard. However, it remains to be seen what cost (if any) a student pays in terms of the learning experience when participating in a lesson taught using this approach. For this reason, we conducted an experimental study to determine whether the use of MARVEL provides students with an understanding of MapReduce that is at least equivalent to that achieved using traditional teaching methods. In the remainder of this section, we explain the framework of our experiment and then discuss its results.

8.1 Experimental setting

We selected a group of 26 volunteer statistics students who had started the “Big Data Management and Computation” course, had already taken the “Fundamentals of Computer Science” exam, and had no prior knowledge of MapReduce or distributed computing.

They were randomly divided into two equal groups: Group A and Group B. Then each group received an introductory MapReduce lesson of about one hour duration. The content of the two lessons was identical and is described below:

Topic 1: Brief introduction to distributed computing;
Topic 2: Formalization of the MapReduce paradigm;
Topic 3: Analysis of MapReduce using two example computer security applications on two toy datasets. The example applications are the event counting problem (i.e., counting how many times a given event occurs in a log file) and the maximum number of matches problem (i.e., determining the pair of events that occurs most often in a set of records describing some malicious activity).

The only difference between the two lessons was the approach the teacher used to introduce and discuss MapReduce. The first lesson was taught using MARVEL to explain Topic 2 and Topic 3, while the second lesson was taught entirely with electronic slides, with about 30 minutes of preparation time. The two lessons were held consecutively in the same classroom. At the end of each lesson, participants were asked to complete an anonymous multiple-answer test with five questions arranged in two blocks:

Block 1: Understanding the MapReduce Paradigm. The goal is to assess whether the student understood how MapReduce works and how it can be used to solve specific classes of problems.
Block 2: MapReduce Problem Solving. An algorithmic problem will be used to determine if the student is able to find the correct MapReduce solution from a selection of possible solutions.

In addition, participants were asked to indicate the degree they had received when passing the “Fundamentals of Computer Science” exam. This information was later used to assess whether each participant’s performance correlated with mastery of the major topics in computer science.

8.2 Results and discussions

We summarize the responses of the two groups to the questions in block 1 using the boxplots in Fig. 8. Note that one participant who was assigned to group B did not show up for the test. As can be clearly seen, the performances of the two groups were very similar. However, it can be observed that the median and minimum scores of group A (who participated in the lesson with MARVEL) were slightly smaller than those of group B. In fact, the distribution of scores for the two groups was completely identical, with the only exception of one participant in group A who skipped all the questions. Let us now turn to the answers to the questions from block 2, which are summarized in the boxplots in Fig. 9. Again, the performances of the two groups are almost identical. The only small difference we notice here is due to one participant in Group B, who was the only one not to answer both questions.

These results are encouraging as they suggest that the use of MARVEL does not detract from a class’ learning experience when introduced to the MapReduce paradigm compared to the electronic slide-based approach. However, given the relatively small sample we used, we need to be sure that these results were not affected by sampling bias. In other words, we need to rule out the possibility that the random selection procedure we used caused one group of qualified students to answer the final questions better than the other group, regardless of the method by which the lesson was taught. To investigate this potential problem, we looked at the variable describing participants’ computer science skills as measured by the grade they achieved on the “Fundamentals of Computer Science” exam. First, we tested the correlation \(\rho\) between this variable and participants’ performance in terms of correct answers on the final test using the function cor.test() available in the R statistical framework. Given a significance level set to \(\alpha =5\%\), this function allows to reject the hypothesis \(\rho =0\) if the observed significance (also called p value) is lower than \(\alpha\). Since the p value obtained is lower than \(5\%\), we find that the two variables under consideration are correlated with a significance level of 0.05. In Fig. 10 we also see that the distribution of this variable is very similar in the two groups.

9 Conclusions

Application domains that involve the analysis of large amounts of data often require the use of distributed computing to ensure reasonable execution times. However, learning and teaching distributed programming techniques, such as those based on the MapReduce paradigm, is not easy.

Despite its apparent simplicity, the MapReduce paradigm and its applications can be difficult for students to understand when dealing with complex algorithms that involve a long sequence of Map and/or Reduce operations. Similarly, it can be difficult to understand how an abstract MapReduce algorithm manages to process a given dataset while making full use of the computational capabilities of a distributed system. For these reasons, we have introduced a new software visualization-based approach to support the teaching of the MapReduce paradigm. With this approach, implemented with a software prototype called MARVEL, it is possible to show in a clear and understandable way how a data structure is distributed in blocks across the nodes of a distributed system by resorting to appropriate graphical representations. The same representation is also useful to illustrate how these blocks are processed in a distributed manner and how a MapReduce algorithm manages to transform the input data into the expected output data by means of a sequence of Map and/or Reduce operations interactively selected by the user of the system. From the teacher’s point of view, the use of our approach may allow a reduction of the time needed to prepare a lesson, compared to other approaches based, for example, on whiteboards or electronic slides, but with a similar flexibility. From the students’ point of view, the experimental study we conducted shows that using our approach to understand the MapReduce paradigm is not a disadvantage in the learning experience compared to using electronic slides.

Finally, we see as a future direction for our work the development of appropriate visual metaphors and layout techniques that can describe the execution of MapReduce algorithms at a supercomputing scale, so that we are able to present and explain some phenomena that occur when distributed algorithms are executed in such an environment (e.g., scalability issues, network congestion).

Acknowledgments

All authors are grateful to Prof. Raffaele Giancarlo for helpful discussions. All authors would like to thank the Department of Statistical Sciences of University of Rome - La Sapienza for computing time on the TeraStat Spark cluster [33].

Declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Vorheriger Artikel A novel 2-phase consensus with customized feedback based group decision-making involving heterogeneous decision-makers

Nächster Artikel IoT-fog-based healthcare 4.0 system using blockchain technology

A preliminary version of this paper has been presented in [5].

A copy of our framework is available for download at https://github.com/fpalini/marvel.

For simplicity, we assume that each node of the distributed system hosts only one executor. In the rest of the paper, we will use both terms equally.

Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation (OSDI), vol 6, pp 137–150

O’Malley O (2008) Terabyte Sort on Apache Hadoop. Yahoo, pp 1–3. http://sortbenchmark.org/YahooHadoop.pdf

Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, vol 10. USENIX Association, p 10

Woods P (2012) The New Era of Big Data Security Analytics. https://searchsecurity.techtarget.com/feature/The-new-era-of-big-data-security-analytics

Ferraro Petrillo U (2018) In: Au MH, Yiu SM, Li J, Luo X, Wang C, Castiglione A, Kluczniak K (eds) Network and system security. Springer, Cham, pp 349–360

Hughes C, Hughes T (2004) Parallel and distributed programming using C++. Addison-Wesley ProfessionalMATH

Walker DW, Dongarra JJ (1996) Mpi: a standard message passing interface. Supercomputer 12:56–68

Geist A, Beguelin A, Dongarra J, Jiang W, Manchek R, Sunderam VS (1994) PVM: parallel virtual machine: a users’ guide and tutorial for networked parallel computing. MIT PressCrossRefMATH

Ferraro-Petrillo U, Roscigno G, Cattaneo G, Giancarlo R (2018) Informational and linguistic analysis of large genomic sequence collections via efficient Hadoop cluster algorithms. Bioinformatics (Oxford, England) 34(11):1826–1833CrossRef

10.

Ferraro Petrillo U, Sorella M, Cattaneo G, Giancarlo R, Rombo SE (2019) Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics. BMC Bioinform 20(4):138. https://doi.org/10.1186/s12859-019-2694-8CrossRef

11.

Abdolazimi R, Heidari M, Esmaeilzadeh A, Naderi H (2022) In: 2022 IEEE 12th Annual Computing and Communication Workshop and Conference (CCWC). IEEE, pp 0112–0118

12.

Verma N, Malhotra D, Singh J (2020) Big data analytics for retail industry using mapreduce—a priori framework. J Manag Analyt 7(3):424–442

13.

Apache Software Foundation (2016). Spark. http://spark.apache.org/

14.

Deb D, Fuad M, Irwin K (2019) In: Proceedings of the 50th ACM Technical Symposium on Computer Science Education, pp 2–8

15.

Ngo LB, Duffy EB, Apon AW (2014) In: 2014 IEEE International on Parallel & Distributed Processing Symposium Workshops (IPDPSW). IEEE, pp 1114–1121

16.

Shamsi JA, Durrani NM, Kafi N (2015) In: 2015 IEEE International Parallel and Distributed Processing Symposium Workshop. IEEE, pp 772–778. https://doi.org/10.1109/IPDPSW.2015.88

17.

Eckroth J (2016) In: Proceedings of the 47th ACM Technical Symposium on Computing Science Education. ACM, pp 175–180

18.

Eckroth J (2018) A course on big data analytics. J Parallel Distrib Comput 118:166–176CrossRef

19.

Hoque E, Lee H, Killian C, Nita-Rotaru C (2016) A testing platform for teaching secure distributed systems programming. Purdue e-Pubs

20.

Ghafoor S, Prasad S, Weems C (2022) In: Proceedings of the 53rd ACM Technical Symposium on Computer Science Education V. 2, pp 1198–1198

21.

Adams JC, Brown R, Matthews SJ, Shoop E (2021) In: 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, pp 342–349

22.

Brown ER, Garrity P, Yates T, Northfield M, Shoop E, Saint Paul M (2011) In: Midwest Instruction and Computing Symposium (Midwest Instruction and Computing Symposium), pp 758–758

23.

Matthews SJ (2017) Using phoenix++ mapreduce to introduce undergraduate students to parallel computing. J Comput Sci Coll 32(6):165–174

24.

Yang Z, Guo X (2020) Teaching Hadoop using role play games. Decis Sci J Innov Educ 18(1):6–21CrossRef

25.

Stasko JT (1990) Tango: S framework and system for algorithm animation. ACM SIGCHI Bull 21(3):59–60CrossRef

26.

Cattaneo G, De Santis A, Ferraro Petrillo U (2008) Visualization of cryptographic protocols with GRACE. J Vis Lang Comput 19(2):258–290. https://doi.org/10.1016/j.jvlc.2007.05.001CrossRef

27.

Cembalo M, De Santis A, Ferraro Petrillo U (2011) In: Goda BS, Sobiesk E, Connolly RW (eds) Proceedings of the 2011 Conference on Information Technology Education, SIGITE ’11. ACM, New York, pp 165–170. https://doi.org/10.1145/2047594.2047641

28.

Naps TL, Chan EE (1999) In: The Proceedings of the Thirtieth SIGCSE Technical Symposium on Computer Science Education, SIGCSE ’99. ACM, New York, pp 232–236. https://doi.org/10.1145/299649.299767

29.

Ben-Ari M (2001) Interactive execution of distributed algorithms. J Educ Resour Comput 10(1145/384055):384057

30.

Carr S, Fang C, Jozwowski T, Mayo J, Shene CK (2003) In: 2003 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA), pp 1676–1682

31.

Danner, A, Newhall, T, Webb KC (2019) In: 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, pp 326–333

32.

Lasserre A, Namyst R, Wacrenier P (2020) In: 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, pp 276–283

33.

Bompiani E, Ferraro Petrillo U, Jona-Lasinio G, Palini F (2020) In: Proceedings of the 2020 IEEE International Conference on Dependable, Autonomic and Secure Computing. IEEE, pp 499–506. https://doi.org/10.1109/DASC-PICom-CBDCom-CyberSciTech49142.2020.00088

Titel: Using software visualization to support the teaching of distributed programming
verfasst von: Lorenzo Di Rocco
Umberto Ferraro Petrillo
Francesco Palini
Publikationsdatum: 16.09.2022
Verlag: Springer US
Erschienen in: The Journal of Supercomputing / Ausgabe 4/2023
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI: https://doi.org/10.1007/s11227-022-04805-9

Springer Professional

Abstract

Publisher's Note

1 Introduction

1.1 Organization of the paper

2 Distributed programming

2.1 MapReduce

2.2 Apache spark

2.2.1 Resilient distributed datasets

3 Motivations

3.1 Visualization as a solution

4 Our contribution

4.1 MARVEL

4.2 Case study

5 Architecture of the visualization

5.1 Map transformations

5.1.1 filterOnKey

5.1.2 filterOnValue

5.1.3 Split

5.1.4 Swap

5.2 Reduce transformations

5.2.1 Count

5.2.2 countByKey

5.2.3 groupByKey

5.2.4 maxOnKey

5.2.5 minOnKey

5.2.6 maxOnValue

5.2.7 minOnValue

5.2.8 sumOnKey

5.2.9 sumOnValue

5.3 Case study: solution

6 System description

7 Related work

8 Experimental validation

8.1 Experimental setting

8.2 Results and discussions

9 Conclusions

Acknowledgments

Declarations

Conflict of interest

Publisher's Note

Weitere Artikel der Ausgabe 4/2023

A randomized algorithm for the wait-free consensus problem

High-level power estimation techniques in embedded systems hardware: an overview

Architecting threat hunting system based on the DODAF framework

Correction to: ReNo: novel switch architecture for reliability improvement of NoCs

The analysis of agricultural Internet of things product marketing by deep learning

Computer vision-based deep learning for supervising excavator operations and measuring real-time earthwork productivity

Premium Partner