Code Smells for Machine Learning Applications

📝 The popularity of machine learning has wildly expanded in recent years. Machine learning techniques have been heatedly studied in academia and applied in the industry to create business value. However, there is a lack of guidelines for code quality in machine learning applications. In particular, code smells have rarely been studied in this domain. Although machine learning code is usually integrated as a small part of an overarching system, it usually plays an important role in its core functionality. Hence ensuring code quality is quintessential to avoid issues in the long run.

📝 Our paper proposes and identifies a list of 22 machine learning-specific code smells collected from various sources, including papers, grey literature, GitHub commits, and Stack Overflow posts. We pinpoint each smell with a description of its context, potential issues in the long run, and proposed solutions. In addition, we link them to their respective pipeline stage and the evidence from both academic and grey literature. The code smell catalog helps data scientists and developers produce and maintain high-quality machine learning application code.

😊 Here are the 22 machine learning-specific code smells described in our paper.

Unnecessary Iteration

Avoid unnecessary iterations. Use vectorized solutions instead of loops.

NaN Equivalence Comparison Misused

Be careful when using the NaN equivalence comparison in NumPy and Pandas.

Chain Indexing

Avoid using chain indexing in Pandas.

Columns and DataType Not Explicitly Set

Explicitly select columns and set DataType in Pandas.

Empty Column Misinitialization

When a new empty column is needed in a DataFrame in Pandas, use the NaN value in Numpy instead of using zeros or empty strings.

Merge API Parameter Not Explicitly Set

Explicitly specify on, how and validate parameter for df.merge() API in Pandas for better readability.

In-Place APIs Misused

Remember to assign the result of an operation to a variable or set the in-place parameter in the API.

Dataframe Conversion API Misused

Use df.to_numpy() in Pandas instead of df.values() for transform a DataFrame to a NumPy array.

Matrix Multiplication API Misused

When the multiply operation is performed on two-dimensional matrixes, use np.matmul() instead of np.dot() in NumPy for better semantics.

No Scaling Before Scaling-sensitive Operation

Check whether feature scaling is added before scaling-sensitive operations.

Hyperparameter not Explicitly Set

Hyperparameters should be set explicitly.

Memory not Freed

Free memory in time.

Deterministic Algorithm Option Not Used

Set deterministic algorithm option to True during the development process, and use the option that provides better performance in the production.

Randomness Uncontrolled

Set random seed explicitly during the development process whenever a possible random procedure is involved in the application.

Missing the Mask of Invalid Value

Add a mask for possible invalid values. For example, developers should add a mask for the input for tf.log() API.

Broadcasting Feature Not Used

Use the broadcasting feature in TensorFlow 2 to be more memory efficient.

TensorArray Not Used

Use tf.TensorArray() in TensorFlow 2 if the value of the array will change in the loop.

Training / Evaluation Mode Improper Toggling

Call the training mode in the appropriate place in PyTorch code to avoid forgetting to toggle back the training mode after the inference step.

Pytorch Call Method Misused

Use self.net() in PyTorch to forward the input to the network instead of self.net.forward().

Gradients Not Cleared before Backward Propagation

Use optimizer.zero_grad(), loss_fn.backward(), optimizer.step() together in order in PyTorch. Do not forget to use optimizer.zero_grad() before loss_fn.backward() to clear gradients.

Data Leakage

Use Pipeline() API in Scikit-Learn or check data segregation carefully when using other libraries to prevent data leakage.

Threshold-Dependent Validation

Use threshold-independent metrics instead of threshold-dependent ones in model evaluation.