DfAnalyzer tool is an instantiation of ARMFUL architecture to provide dataflow analysis to scientific applications. By coupling DfAnalyzer components to the source codes of scientific applications, scientific data is extracted and related along a dataflow ready for queries at runtime. DfAnalyzer has features that help monitoring, debugging, provenance registration and steering, all at runtime. Its light components do not harm the parallel performance of the scientific applications. As opposed to SWMS, DfAnalyzer keeps all the parallel execution control at the scientific application code. Therefore, applications may use highly efficient libraries or invoke black-box parallel code. DfAnalyzer can also be used by computational and computer specialists, so that they can incorporate user steering support. Considering the ARMFUL’s components, the following table shows how each one of them is mapped to the DfAnalyzer components:
|ARMFUL component||DfAnalyzer component|
|Provenance Data Gathering||Provenance Data Extractor|
|Raw Data Extraction||Raw Data Extractor|
|Index Generation||Raw Data Indexer|
|Query Processor||Dataflow Viewer and
Table 1. Components of DfAnalyzer
Provenance Data Extractor (PDE) consists of a RESTful API that manages dataflow at physical (i.e., file flow) and logical (i.e., data flow) levels. By receiving HTTP requests, PDE captures provenance and scientific data from scientific application codes or workflows modeled on SWMS, generating a JSON object (kept in memory) with the gathered data and their data dependencies. We highlight that the JSON object generated by PDE presents a hierarchical structure to be in accordance with the dataflow concepts presented in our CCPE paper published in 2016. Besides provenance and scientific data capture, PDE also loads these data in DfAnalyzer's provenance database for enabling online query processing.
More information about PDE can be found here.
Before to send retrospective provenance data (scientific data from the execution of scientific application) using PDE program, users have to access and extract scientific data from raw data files. Therefore, Raw Data Extractor (RDE) is a binary program that provides raw data extraction from files. To run this program, users have to define the following input arguments:
Since DfAnalyzer is based on the component-based architecture ARMFUL, users are able to develop new cartridges (i.e., new algorithms) for extracting scientific data from raw data files. Considering these input arguments, RDE program accesses each raw data file each time and extracts the values of the attributes specified by the user. All extracted raw data (from several files) are stored in an output file, which follows a Comma-Separated Values (CSV) file format with a header (i.e., selected attributes).
More information about RDE can be found here.
Besides the raw data extraction, there are some approaches that index raw data from files to reduce the volume of data to be loaded into an external repository or a database, such as FastBit, NoDB, RAW, and SDS/Q. Raw Data Indexer (RDI) has the same input arguments of RDE program, however this program generates indexes to the extracted scientific data from files. RDI binary program presents some cartridges that use indexing techniques from existing approaches, such as FastBit tool. This program also has a cartridge based on the positional indexing, which considers the RAW’s implementation. In addition, users can also develop new cartridges for applying other indexing techniques in their raw data file formats.
More information about RDI can be found here.
DfAnalyzer delivers a graphical interface with Dataflow Viewer (DfViewer) to provide a dataset perspective view from dataflow specification registered in DfAnalyzer's database. Basically, DfViewer provides a list of dataflows registered in provenance database and users can choose which dataflow specification they would like to visualize.
More details about DfViewer can be found here.
Since provenance and raw data are already stored in a relational database, Query Interface (QI) can be invoked to query those stored data. Basically, QI consists of a RESTful API that receives HTTP requests with input arguments in their messages. These input arguments contemplate the source and target datasets of the dataflow fragment to be analyzed, the attributes to be accessed from the visited datasets, the conditions to be applied in the dataflow fragment to select specific data elements, and the datasets that can be included or excluded from the dataflow fragment. After the submission of query specification, QI generates a SQL-based query to our MonetDB's database automatically, and runs this query. Then, as a result, QI returns a CSV file with the obtained results from the query processing.
More details about QI can be found here.
Besides the description of each component of DfAnalyzer (PDE, RDE, RDI, DfViewer, and QI), we submitted a demonstration paper of this tool to VLDB conference (under review). In this demonstration, we present a Spark application using DfAnalyzer tool for extracting provenance and scientific data at runtime. Moreover, we show the analytical capabilities using the components DfViewer (graphical interface considering a dataset perspective view of the dataflow) and QI (query processing interface for analyzing dataflow fragments). More details about this demonstration can be found in our git repository with a Spark application using DfAnalyzer.