DfAnalyzer tool is an instantiation of ARMFUL architecture that can be coupled to existing Scientific Workflow Management Systems (SWMS) and scientific applications. DfAnalyzer keeps all the parallel execution control at the code of the SWMS or the scientific application. It is only recommended to couple Provenance Gatherer to existing SWMS without provenance data management. If the SWMS manages provenance data, it is recommended to extend the provenance support of the SWMS with the purpose of managing raw data from files. Considering the ARMFUL’s components, each one of them can be mapped to a specific software from DfAnalyzer tool, as follows:
|ARMFUL component||JAR program of the DfAnalyzer tool|
|Provenance Data Gathering||Provenance Data Gatherer|
|Raw Data Extraction||Raw Data Extractor|
|Index Generation||Raw Data Indexer|
|Provenance Database||Data Ingestor|
|Query Processor||Query Processor|
Table 1. JAR programs of the DfAnalyzer tool
Provenance Data Gatherer (PG) program manages dataflow at physical (i.e., file flow) and logical (i.e., data flow) levels. PG captures provenance and domain-specific data from scientific application code or a SWMS, generating a JSON file with the gatherer data and their dependencies. We highlight that the generated JSON files by PG presents a hierarchical structure to be in accordance with the dataflow concepts presented in our CCPE paper published in 2016.
More information about PG can be found here.
Before to gather domain-specific data using PG program, users have to access and extract raw data from files. Therefore, Raw Data Extractor (RDE) program can be used to provide raw data extraction from files. To run this program, users have to define the following input arguments:
Since different cartridges can be developed in ARMFUL, users are able to change the extraction algorithm according to their file format or other issues. Considering these input arguments, RDE program access each raw data file each time and extract the values of the selected attributes. All extracted raw data (from several files) are stored in an output file, which follows a Comma-Separated Values (CSV) file format with a header (i.e., selected attributes).
More information about RDE can be found here.
Besides the raw data extraction, there are some approaches that index raw data from files to reduce the volume of data to be loaded into an external repository or a database system, such as FastBit, NoDB, RAW, and SDS/Q. Raw Data Indexer (RDI) has the same input arguments of RDE program, however this program generates indexes to the accessed raw data from files. RDI program presents some cartridges that use existing approaches, such as FastBit tool. This program also has a cartridge based on the positional indexing, which considers the RAW’s implementation.
More information about RDI can be found here.
Once provenance and raw data have been gathered by PG program, Data Ingestor (DI) has to be initialized to store provenance data and extracted/indexed raw data from files into a relational database. Thus, this program initialize a database server to perform those store operations into the database. In addition, this relational database follows our PROV-Df data model, which is PROV-compliant.
More details about DI can be found here.
Since provenance and raw data are already stored in a relational database, Query Processor (QP) can be invoked to query those stored data. This program consumes as input arguments the database connection properties and a SQL-based query. QP aims at performing a database connection and running this specified query. Therefore, QP has to identify which attributes have to be queried from provenance tables and which attributes are related to domain-specific data (raw data extracted/indexed from files).
More details about QP program can be found here.