Major Pipeline
in Multiple Pipelines
in Multiple Pipelines
Tasks of the Major Pipeline:
The major pipeline is the pipeline that generates minor pipelines. It has the following tasks:
- Investigating of all the data sets.
- Determining the settings for each specific data set
- Making submissions of the minor pipelines
The first task is investigating all the data sets that have to be processed. For each specific data set, the major pipeline makes a minor pipeline as a job for the gridcluster of computers or the supercomputer. The major pipeline puts each minor pipeline in a job. So, each job has 1 minor pipeline in itself. It is supposed that each minor pipeline will be submitted together with its data set that it has to analyze.
The second task is determining for each specific data set the settings, on what type of computer the data have to be analyzed. For example, the major pipeline will give a big data set a computer with high speed, much memory, more number of nodes and a long calculation time. On the other hand, the major pipeline will give a small data set a computer with a normal/lower speed, less memory, 1 or a few number of nodes and a relatively short calculation time. In that way, the data sets will be analyzed on the right computers in the gridcluster or supercomputer. It would insufficient to submit a relatively small data set on a very fast, big computer. The major pipeline prevents this insufficient distribution of jobs with minor pipelines.
The third task is making submissions of these minor pipelines with its data sets possible. Therefore, the major pipeline generates a submit job script. In this submit job script, all the jobs - with its minor pipelines and its data sets - have been put into a queue. This submit job script will be submitted at the end of running the major pipeline. After that, the minor pipelines can be submitted and started.
Detailled description of the working of the Major Pipeline:
Figure ... shows the total overview of the working of a multiple pipeline. This figure was already shown and explained on the page General Information about Multiple Pipelines, but now the detailed working of the major pipeline will be explained here.
On the bottom, you can see that the major pipeline has been represented by the blue rectangle on the bottom.
Figure ...: Overview of how a multiple pipeline works.
In this figure, you can see how a multiple pipeline works in combination with its minor pipelines. When the major pipeline is running, it is creating minor pipelines in jobs. According to the conditions of each data set, a special minor pipeline is created that is totally suited for analyzing its specific data set. Each minor pipeline is put into a job. Each job can be submitted,
In this figure, you can see how a multiple pipeline works in combination with its minor pipelines. When the major pipeline is running, it is creating minor pipelines in jobs. According to the conditions of each data set, a special minor pipeline is created that is totally suited for analyzing its specific data set. Each minor pipeline is put into a job. Each job can be submitted,
The main task of the major pipeline is generating minor pipelines according to the data sets that have to be processed. These minor pipelines will be saved in jobs, in which each job has got the right settings for submission. At the end, all these jobs are put into 1 submit job script.
Detailed explanation of the major pipeline with example data sets:
The example data sets in figure ... will be used to explain the working of the major pipeline in more detail.
In figure ..., you can see that there are 3 data sets. Data set number 1 represents a medium size data set. Data set number 2 represents a small data set. Data set number 3 represents a large data set.
When the major pipeline is started, scanning of the first data set begins. The major pipeline first scans data set number 1. The major pipeline determines what type of data it is, what size it has, calculates how much memory it needs for the analyses, calculates how long the calculation time will be, etc. Briefly, the major pipeline first determines the conditions (settings) for processing the data in that data set. These settings will be put in the header section of the job script of the minor pipeline. When the job is submitted, the job script and the minor pipeline in it will run on the very most suitable computer for processing and analyzing these data. This is also done with the data sets numbers 2 and 3. For these data sets, separate minor pipelines are generated with their own settings.
After the major pipeline has generated all the job scripts and minor pipelines, it will create a submit job script. In this submit job script, each minor pipeline has been written with its job script as a command and put into a queue on the gridcluster or supercomputer. After that, the major pipeline has finished its job.
All the minor pipelines can be submitted and started now with this submit job script. This can be done by typing:
bash SubmitJobs.sh
bash SubmitJobs.sh
All the jobs are submitted then and put into the queue. When a computer is free again, the job with most suitable settings will be taken for processing its data set on that with its minor pipeline.
After all the minor pipelines did their jobs, all the data sets have been analyzed and processed. Mostly, all the results of the separate minor pipelines will be joined together. Therefore, a single pipeline could programmed for it.