Training Overview

PyTorch Support

Users can also use FlexFlow Train to optimize the parallelization performance of existing PyTorch models in two steps. First, a PyTorch model can be exported to the FlexFlow model format using flexflow.torch.fx.torch_to_flexflow.

import torch
import flexflow.torch.fx as fx

model = MyPyTorchModule()
fx.torch_to_flexflow(model, "mymodel.ff")

Second, a FlexFlow Train program can directly import a previously saved PyTorch model and autotune the parallelization performance for a given parallel machine.

from flexflow.pytorch.model import PyTorchModel

def top_level_task():
  torch_model = PyTorchModel("mymodel.ff")
  output_tensor = torch_model.apply(ffmodel, input_tensor)
  ## Model compilation
  ffmodel.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
  ## Model training
  (x_train, y_train) = cifar10.load_data(), y_train, epochs=30)

More FlexFlow PyTorch examples: see the pytorch examples folder.

TensorFlow Keras and ONNX Support

FlexFlow Train prioritizes PyTorch compatibility, but also includes frontends for Tensorflow Keras and ONNX models.

C++ Interface

For users that prefer to program in C/C++. FlexFlow Train supports a C++ program inference that is equivalent to its Python APIs.

More FlexFlow C++ examples: see the C++ examples folder.

Command-Line Flags

In addition to setting runtime configurations in a FlexFlow Train Python/C++ program, the FlexFlow Train runtime also accepts command-line arguments for various runtime parameters:

FlexFlow training flags:

  • -e or –epochs: number of total epochs to run (default: 1)

  • -b or –batch-size: global batch size in each iteration (default: 64)

  • -p or –print-freq: print frequency (default: 10)

  • -d or --dataset: path to the training dataset. If not set, synthetic data is used to conduct training.

Legion runtime flags:

  • -ll:gpu: number of GPU processors to use on each node (default: 0)

  • -ll:fsize: size of device memory on each GPU (in MB)

  • -ll:zsize: size of zero-copy memory (pinned DRAM with direct GPU access) on each node (in MB). This is used for prefecthing training images from disk.

  • -ll:cpu: number of data loading workers (default: 4)

  • -ll:util: number of utility threads to create per process (default: 1)

  • -ll:bgwork: number of background worker threads to create per process (default: 1)

Performance auto-tuning flags:

  • --search-budget or –budget: the number of iterations for the MCMC search (default: 0)

  • --search-alpha or –alpha: a hyper-parameter for the search procedure (default: 0.05)

  • --export-strategy or –export: path to export the best discovered strategy (default: None)

  • --import-strategy or –import: path to import a previous saved strategy (default: None)

  • --enable-parameter-parallel: allow FlexFlow Train to explore parameter parallelism for performance auto-tuning. (By default FlexFlow Train only considers data and model parallelism.)

  • --enable-attribute-parallel: allow FlexFlow Train to explore attribute parallelism for performance auto-tuning. (By default FlexFlow Train only considers data and model parallelism.) For performance tuning related flags: see performance autotuning.