This project includes design and development of a quantitative model and a simulator (see repository's wiki for details). The description and the code of the designed simulator is presented in the current repository.
The focus of the research work is on the study of the load on a supercomputer (specifically, on Titan supercomputer ) and its modeling. The load on a resource is defined as a number of busy service nodes at a certain time; it is determined by the number and parameters of running computing jobs:
- real execution time per job (initially is requested the maximum required time for job to be executed, i.e., wall time)
- number of required nodes per job
- jobs generation rate
The concept of an execution strategy is defined as the set of values of denoted parameters that uniquely define the group of jobs to be executed.
The designed analysis and modeling tool simulates the load on a supercomputer and produces job traces for a given workload. It is characterized by the following features:
- based on queueing theory
- arrival process is represented by streams that are responsible for job generation and is described either by a Poisson process or by a deterministic model
- service/server process is represented by a set of nodes that simulate job execution process and is described either by a Poisson process or by a deterministic model as well
- number of servers/nodes corresponds to the number of computing nodes (in terms of Titan supercomputer)
- capacity of the queue or system overall is defined and can be limited by the queue limit (either per stream or for the total number of jobs in the queue) and assumes that the queue buffer is not used, otherwise the capacity is unlimited
- queueing discipline is provided in two options: FIFO or Priority.
- includes the possibility to use the schedule to boost the starting for the execution of "small" jobs (i.e., backfill mode in terms of Titan supercomputer)
- provides the explicit job state model (state transitions)
- generated - holding (i.e., Titan notation: blocked) in the buffer - pending (i.e., Titan notation: eligible-to-run) in the queue - starting - executing - finished
Simulator is named as Queueing System Simulator (QSS)
from qss import QSS
qss_obj = QSS(num_nodes=18688)Required argument num_nodes defines the number of service/computing nodes.
Optional arguments:
queue_limitdefines the total limit of the queue (default value isNone)use_queue_bufferis a flag to use the queue buffer (default value isFalse)use_scheduleris a flag to use schedule, i.e., backfill mode (default value isFalse)time_limitprovides a timestamp when the processing should be stopped; if not is defined then the processing continues while the job generators (streams) produce new jobs (default value isNone)output_filefile name, store the output information per job (each record includes: arrival_timestamp, start_execution_timestamp, end_execution_timestamp, num_nodes, source, label) to keep job records after its execution is done (default value isNone)trace_filefile name, store a real time information about the system state (current_time, num_jobs_in_buffer, num_jobs_in_queue, num_jobs_executing, current_action) that's also can be printed out at the screen if the corresponding argumentverboseis set for the methodrun()(default value isNone)
Implementation of jobs generation is based on python function called generator, which is represented as a "stream" function inside the QSS package. There are several predefined such functions (see module qss/stream.py), but it is up to user to use one of them or to create a different one.
from qss import stream_generator, stream_generator_by_file
stream_1 = stream_generator(arrival_rate=11./36,
execution_rate=1./3,
num_nodes=100,
source='main',
num_jobs=None,
time_limit=1000.)
stream_2 = stream_generator_by_file(file_name='qss_input.txt',
source='external',
time_limit=1000.)It is important to note the following arguments in the predefined stream functions - num_jobs and time_limit - that represent restrictions imposed either on the number of generated jobs or on the processing time. Either of them should be set.
qss_obj.run(streams=[stream_1, stream_2],
verbose=True)
qss_obj.print_stats()