Description

Context

Data structures can be manipulated in mainly two different approaches: 1) by applying the changes to a copy of the data structure and leaving the original object intact, or 2) by changing the existing data structure (also known as in-place).

Problem

Some methods can adopt in-place by default, while others return a copy. If the developer assumes an in-place approach, he will not assign the returned value to any variable. Hence, the operation will be executed, but it will not affect the final outcome. For example, when using the Pandas library, the developer may not assign the result of df.dropna() to a variable. He may assume that this API will make changes on the original DataFrame and not set the in-place parameter to be True either. The original DataFrame will not be updated in this way. In the "TensorFlow Bugs" replication package, we also found an example where the developer thought np.clip() is an in-place operation and used it without assigning it to a new variable.

Solution

We suggest developers check whether the result of the operation is assigned to a variable or the in-place parameter is set in the API. Some developers hold the view that the in-place operation will save memory. However, this is a misconception in the Pandas library because the copy of the data is still created. In PyTorch, the in-place operation does save GPU memory, but it risks overwriting the values needed to compute the gradient.

Type

Generic

Existing Stage

Data Cleaning

Effect

Error-prone

Example

### Pandas
import pandas as pd
df = pd.DataFrame([-1])
- df.abs()
+ df = df.abs()

### NumPy
import numpy as np
zhats = [2, 3, 1, 0]
- np.clip(zhats, -1, 1)
+ zhats = np.clip(zhats, -1, 1)

Source:

Paper

  • MPA Haakman. 2020. Studying the Machine Learning Lifecycle and ImprovingCode Quality of Machine Learning Applications.

Grey Literature

GitHub Commit

Stack Overflow

Documentation