Description

Context

There are several scenarios involving random seeds. In some algorithms, randomness is inherently involved in the training process. For the cross-validation process in the model evaluation stage, the dataset split by some library APIs can vary depending on random seeds.

Problem

If the random seed is not set, the result will be irreproducible, which increases the debugging effort. In addition, it will be difficult to replicate the study based on the previous one. For example, in Scikit-Learn, if the random seed is not set, the random forest algorithm may provide a different result every time it runs, and the dataset split by cross-validation splitter will also be different in the next run.

Solution

It is recommended to set global random seed first for reproducible results in Scikit-Learn, Pytorch, Numpy and other libraries where a random seed is involved. Specifically, DataLoader in PyTorch needs to be set with a random seed to ensure the data is split and loaded in the same way every time running the code.

Type

Generic

Existing Stage

Model Training & Model Evaluation

Effect

Reproducibility

Example

### python in general
import random
+ random.seed(0)

### Tensorflow
import tensoflow as tf
+ tf.random.set_seed(0)

### PyTorch
import torch
+ torch.manual_seed(0)

### Scikit-Learn
from sklearn.model_selection 
import KFold
+ rng = 0
- kf = KFold(random_state=None)
+ kf = KFold(random_state=rng)

### NumPy
import numpy as np
+ np.random.seed(0)

Source:

Paper

Grey Literature

GitHub Commit

Stack Overflow

Documentation