Skip to content

Deeprec hangs in distributed mode. #125

@silingtong123

Description

@silingtong123

Current behavior

In distributed mode, deeprec works fine when training on one hour of data, but hangs when training on one day or more. Log:
6ca9fe77321c27383b3b3de9bb8fc5d5
Nvidia-smi:
a3ee237e24abfd35d1c087126b6331f8
cpu:
071c9938c994a484295fdc3ef25b483d

Expected behavior

Deeprec works fine in distributed mode. Log:
315532d0f8197d279e990d49332c85b3

System information

  • GPU model and memory: Two GPU devices: Tesla T4 . Memory: 15109MiB
  • OS Platform: x86_64 x86_64 x86_64 GNU/Linux
  • Docker version: Docker version 20.10.8, build 3967b7d
  • GCC/CUDA/cuDNN version: CUDA 11.4 /cuDnn 8
  • Python/conda version: python3.6
  • TensorFlow/PyTorch version: DeepRec deeprec2302, HybridBackend a832b4e

Code to reproduce

    sess_config = tf.ConfigProto(
        # If the device you specify doesn't exist, allow TF to assign the device automatically
        allow_soft_placement=True,
        log_device_placement=False,  # Whether to print the device assignment log
    )
    sess_config.gpu_options.force_gpu_compatible = True
    sess_config.gpu_options.allow_growth = True

    with tf.train.MonitoredTrainingSession(master="", checkpoint_dir=self.__ckpt_dir, config=sess_config):

Willing to contribute

Yes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions