ParallelDevicePerformanceProfile

class ParallelDevicePerformanceProfile(configuration, processes_and_threads, equilibrium_methods=None, non_equilibrium_methods=None)

Class for performing timing and memory profiles of the different methods available for calculating the Green’s function and lesser Green’s function.

Parameters:
  • configuration (DeviceConfiguration) – The device configuration with an attached calculator to profile.
  • processes_and_threads (list of tuples of int) – The configurations of number of processes and threads to run, as list of tuples. E.g., [(1, 1), (2, 4)] will run a calculation with 1 MPI process and 1 thread per process, and a calculation with 2 MPI processes and 4 threads per process.
  • equilibrium_methods (GreensFunction | SparseGreensFunction | sequence of (GreensFunction | SparseGreensFunction)) – The methods benchmarked for the equilibrium calculation. If no methods should be benchmarked for the equilibrium calculation, an empty list can be specified.
    Default:: (GreensFunction, SparseGreensFunction)
  • non_equilibrium_methods – The methods benchmarked for the non-equilibrium calculation. If no methods should be benchmarked for the non-equilibrium calculation, an empty list can be specified.
    Default:: (GreensFunction, SparseGreensFunction)
equilibriumMethods()
Returns:The equilibrium methods profiled.
Return type:tuple of (GreensFunction | SparseGreensFunction)
generateScript(temporary_filename)

Generate a script to run a DevicePerformanceProfile.

nlprint(stream=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>)

Print out the profiling report.

Parameters:stream (file) – The stream to write to. This stream must support strings being written to it using ‘write’.
Default: sys.stdout
nonEquilibriumMethods()
Returns:The non-equilibrium methods profiled.
Return type:tuple of (GreensFunction | SparseGreensFunction)
processesAndThreads()
Returns:The list of processes and threads configurations profiled.
Return type:tuple of (int, int)
runAsSubProcess()

Run a serie of DevicePerformanceProfile for each combination of processes and threads provided.

Notes

This object extends the functionalities of DevicePerformanceProfile by running several profiles for different parallel settings, defined by the input parameter processes_and_threads. The aim is to help the user to to determine the best combination of parallelization strategy and Green’s function algorithm for the simulation of a given device, without the need to run DevicePerformanceProfile several times.

The input parameters are the same as in DevicePerformanceProfile, except for processes_and_threads. This parameter determines on how many processes the contour point calculation is parallelized (mimicking the usage of processes_per_contour_point), and how many threads per process are utilized.

For each method and each processes and threads configuration the elapsed time and memory usage are reported.

Note

The profiler has to be launched in serial mode (e.g. without using mpiexec, or using mpiexec -n 1). The parallel runs are spawned by the profiler itself. In order to obtain reliable results, there should be enough resources available for the spawned processes. I.e., the number of available physical cores should be at least equal to the largest sum of processes and threads. For example, if 2 processes and 2 threads are defined in processes_and_threads, then 4 physical cores should be available. If fewer cores are available, the results may be unreliable because the cores will be over-occupied. Note that some clusters do not support process spawning. On such systems ParallelDevicePerformanceProfile will not work.

Usage Example

The following script will read a configuration from file and run a profile for 3 different combinations of processes and threads:

(1, 1): A single process per contour point and a single thread.

(1, 2): A single process per contour point and two threads per process.

(2, 1): Two processes per contour point and a single threads per process.

device_configuration = nlread('device.hdf5', DeviceConfiguration)[-1]

profile = ParallelDevicePerformanceProfile(
    device_configuration,
    processes_and_threads=[(1, 1), (1, 2), (2, 1)])

nlprint(profile)

Here is an example of the output you might get, divided in different sections. First we see an header listing the available profiles corresponding to the different entries of processes_and_threads:

+------------------------------------------------------------------------------+
| Parallel Device Performance Profile                                          |
|                                                                              |
| 3 available device performance profiles:                                     |
|        1 process          1 thread                                           |
|        1 process          2 threads                                          |
|        2 processes        1 thread                                           |
+------------------------------------------------------------------------------+

It follows a detailed report of memory and timing for each profile, with the same structure as the output of DevicePerformanceProfile.

+------------------------------------------------------------------------------+
| Device Performance Profile (1)                                               |
|   1 process       1 thread                                                   |
+------------------------------------------------------------------------------+
| Contour point timing (s):                                                    |
|                                   EQ        NEQ                              |
| GreensFunction                 12.46      18.41                              |
| SparseGreensFunction           45.76      32.15                              |
|                                                                              |
| Fastest EQ method (by 3.7 times): GreensFunction                             |
| Fastest NEQ method (by 1.7 times): GreensFunction                            |
+------------------------------------------------------------------------------+
| Peak memory usage/process (MB):                                              |
|                                   EQ        NEQ                              |
| GreensFunction               1901.38    3966.34                              |
| SparseGreensFunction         3536.50    2596.16                              |
|                                                                              |
| Most memory-efficient EQ method (by 1.9 times): GreensFunction               |
| Most memory-efficient NEQ method (by 1.5 times): SparseGreensFunction        |
+------------------------------------------------------------------------------+
+------------------------------------------------------------------------------+
| Device Performance Profile (2)                                               |
|   1 process       2 threads                                                  |
+------------------------------------------------------------------------------+
| Contour point timing (s):                                                    |
|                                   EQ        NEQ                              |
| GreensFunction                  8.24      13.19                              |
| SparseGreensFunction           39.78      30.08                              |
|                                                                              |
| Fastest EQ method (by 4.8 times): GreensFunction                             |
| Fastest NEQ method (by 2.3 times): GreensFunction                            |
+------------------------------------------------------------------------------+
| Peak memory usage/process (MB):                                              |
|                                   EQ        NEQ                              |
| GreensFunction               1887.05    3985.32                              |
| SparseGreensFunction         3525.32    2629.77                              |
|                                                                              |
| Most memory-efficient EQ method (by 1.9 times): GreensFunction               |
| Most memory-efficient NEQ method (by 1.5 times): SparseGreensFunction        |
+------------------------------------------------------------------------------+
+------------------------------------------------------------------------------+
| Device Performance Profile (3)                                               |
|   2 processes     1 thread                                                   |
+------------------------------------------------------------------------------+
| Contour point timing (s):                                                    |
|                                   EQ        NEQ                              |
| GreensFunction                  9.11      21.22                              |
| SparseGreensFunction           39.42      34.22                              |
|                                                                              |
| Fastest EQ method (by 4.3 times): GreensFunction                             |
| Fastest NEQ method (by 1.6 times): GreensFunction                            |
+------------------------------------------------------------------------------+
| Peak memory usage/process (MB):                                              |
|                                   EQ        NEQ                              |
| GreensFunction               1450.88    3919.13                              |
| SparseGreensFunction         2224.71    1982.50                              |
|                                                                              |
| Most memory-efficient EQ method (by 1.5 times): GreensFunction               |
| Most memory-efficient NEQ method (by 2.0 times): SparseGreensFunction        |
+------------------------------------------------------------------------------+

At the bottom we have a summary of time and memory consumption for the different processes and threads configurations.

+------------------------------------------------------------------------------+
| Summary Report                                                               |
|                                                                              |
| The best and worst case scenario for resource occupation (time and peak      |
| memory) is reported for each method. The resource occupation is normalized   |
| to the number of physical cores utilized, here referred to as Processing     |
| Units (PU).                                                                  |
+------------------------------------------------------------------------------+
| Equilibrium GreensFunction:                                                  |
|                                    Time*PU (s)     Memory/PU (MB)            |
|  1 process    1 thread  (1 PU)    12.46 (best)    1901.38 (worst)            |
|  1 process    2 threads (2 PU)    16.47            943.52 (best)             |
|  2 processes  1 thread  (2 PU)    18.23 (worst)   1450.88                    |
+------------------------------------------------------------------------------+
| Equilibrium SparseGreensFunction:                                            |
|                                    Time*PU (s)     Memory/PU (MB)            |
|  1 process    1 thread  (1 PU)    45.76 (best)    3536.50 (worst)            |
|  1 process    2 threads (2 PU)    79.57 (worst)   1762.66 (best)             |
|  2 processes  1 thread  (2 PU)    78.83           2224.71                    |
+------------------------------------------------------------------------------+
| Non-equilibrium GreensFunction:                                              |
|                                    Time*PU (s)     Memory/PU (MB)            |
|  1 process    1 thread  (1 PU)    18.41 (best)    3966.34 (worst)            |
|  1 process    2 threads (2 PU)    26.38           1992.66 (best)             |
|  2 processes  1 thread  (2 PU)    42.44 (worst)   3919.13                    |
+------------------------------------------------------------------------------+
| Non-equilibrium SparseGreensFunction:                                        |
|                                    Time*PU (s)     Memory/PU (MB)            |
|  1 process    1 thread  (1 PU)    32.15 (best)    2596.16 (worst)            |
|  1 process    2 threads (2 PU)    60.15           1314.88 (best)             |
|  2 processes  1 thread  (2 PU)    68.44 (worst)   1982.50                    |
+------------------------------------------------------------------------------+

In the summary both the memory and time are normalized to the number of processing units (PU) utilized. The number of PUs is defined as the sum between the number of processes and the number of threads and it can be interpreted simply as the number of physical cores utilized.

On the right column the peak memory per PU is reported. On the left column the total CPU time, defined as wallclock time times the number of PUs utilized, is indicated.

Interpreting the Summary Report

The real simulation wallclock time and memory consumption depends on how many contour points can be simultaneously calculated, and it is not possible to give a general estimate upfront. The user should be able to interpret the quantities reported in the summary, especially when comparing different number of PUs. Let’s consider as an example an equilibrium calculation with GreensFunction method. From the summary we read

+------------------------------------------------------------------------------+
| Equilibrium GreensFunction:                                                  |
|                                    Time*PU (s)     Memory/PU (MB)            |
|  1 process    1 thread  (1 PU)    12.46 (best)    1901.38 (worst)            |
|  1 process    2 threads (2 PU)    16.47            943.52 (best)             |
|  2 processes  1 thread  (2 PU)    18.23 (worst)   1450.88                    |
+------------------------------------------------------------------------------+

Assume that 48 contour points will be calculated, and that the calculation will run on 48 physical cores. In this case a contour integration will take roughly as much as indicated in the Time*PU field (i.e. 12.46 seconds for 1 process and 1 thread, 16.47 s for 1 process and 2 threads, etc.). The total memory usage will be given multiplying the quantities in the memory column by 48, because all PUs will be engaged in the calculation. In this case the indication of best and worst time and memory are faithful.

But what if we have for example 96 physical cores available? In this case the calculation launched with 1 process and 1 threads per contour point will have some resources kept idle, because it will utilize at most one physical core per contour point. The wallclock time will be still approximately 12.46 s with no speedup, and similarly the total memory will be still given by (1901*48)MB.

Both configurations with 2 PUs will be able to run all contour points simultaneously, thus utilizing twice the number of physical cores with respect to the case with 1 PUs. Therefore the wallclock time will be approximately half the time in the Time*PU column (i.e., 8.2 s for 1 process, 2 threads and 9.1 for 2 processes, 1 thread). The total memory consumption will be (943*96)MB for the 1 process, 2 threads case, and (1451*96)MB for the 2 processes, 1 thread case. In this case, using 2 physical cores per contour point is advantageous.