Description

Context

Loops are typically time-consuming and verbose, while developers can usually use some vectorized solutions to replace the loops.

Problem

As stated in the Pandas documentation: “Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed and can be avoided”. In EffectiveTensorflow github repository, it is also stated that the slicing operation with loops in TensorFlow is slow, and there is a substitute for better performance.

Solution

Machine learning applications are typically data-intensive, requiring operations on data sets rather than an individual value. Therefore, it is better to adopt a vectorized solution instead of iterating over data. In this way, the program runs faster and code complexity is reduced, resulting in more efficient and less error-prone code. Pandas’ built-in methods (e.g., join, groupby) are vectorized. It is therefore recommended to use Pandas built-in methods as an alternative to loops. In TensorFlow, using the tf.reduce_sum() API to perform reduction operation is much faster than combining slicing operation and loops.

Type

Generic

Existing Stage

Data Cleaning

Effect

Efficiency

Example

### Pandas
import pandas as pd
df = pd.DataFrame([1, 2, 3])
- result = []
- for index, row in df.iterrows():
- 	result.append(row[0] + 1)
- result = pd.DataFrame(result)
+ result = df.add(1)

### TensorFlow 2
import tensorflow as tf
x = tf.random.uniform([500, 10])
- z = tf.zeros([10])
- for i in range(500):
-    z += x[i]
+ z = tf.reduce_sum(x, axis=0)

Source:

Paper

  • MPA Haakman. 2020. Studying the Machine Learning Lifecycle and Improving Code Quality of Machine Learning Application

Grey Literature

GitHub Commit

Stack Overflow

Documentation