DSLinter - Linter for Machine Learning - Specific Code Smells

  • dslinter is a pylint plugin for linting data science and machine learning code. We plan to support the following Python libraries: TensorFlow, PyTorch, Scikit-Learn, Pandas, NumPy and SciPy.
  • dslinter aims to help data scientists and developers produce and maintain high-quality machine learning application code.
  • dslinter does this by checking violations of best coding practices for machine learning libraries.

🔧 To install from source for development purposes: clone this repo and install the plugin with: pip install -e .

🔧 To install from the Python Package Index: pip install dslinter

😎 To only use the checkers implemented in this plugin, run: pylint --load-plugins=dslinter --disable=all --enable=dataframe,nan,hyperparameters,import,data-leakage,controlling-randomness,excessive-hyperparameter-precision,pca-scaler <other_options> <path_to_sources>

😎 To expand a current pylint configuration with the checkers from this plugin, run: pylint --load-plugins=dslinter <other_options> <path_to_sources>

📝 Tests can be run by using the pytest package: pytest .

In-Place APIs Misused

Remember to assign the result of an operation to a variable or set the in-place parameter in the API.

Unnecessary Iteration

Avoid unnecessary iterations. Use vectorized solutions instead of loops.

No Scaling Before Scaling-sensitive Operation

Check whether feature scaling is added before scaling-sensitive operations.

Hyperparameter not Explicitly Set

Hyperparameters should be set explicitly.

Memory not Freed

Free memory in time.

Deterministic Algorithm Option Not Used

Set deterministic algorithm option to True during the development process, and use the option that provides better performance in the production.

Missing the Mask of Invalid Value

Add a mask for possible invalid values. For example, developers should add a mask for the input for tf.log() API.

Randomness Uncontrolled

Set random seed explicitly during the development process whenever a possible random procedure is involved in the application.

Data Leakage

Use Pipeline() API in Scikit-Learn or check data segregation carefully when using other libraries to prevent data leakage.

Threshold-Dependent Validation

Use threshold-independent metrics instead of threshold-dependent ones in model evaluation.

NaN Equivalence Comparison Misused

Be careful when using the NaN equivalence comparison in NumPy and Pandas.

Chain Indexing

Avoid using chain indexing in Pandas.

Dataframe Coversion API Misused

Use df.to_numpy() in Pandas instead of df.values() for transform a DataFrame to a NumPy array.

Matrix Multiplication API Misused

When the multiply operation is performed on two-dimensional matrixes, use np.matmul() instead of np.dot() in NumPy for better semantics.

Columns and DataType Not Explicitly Set

Explicitly select columns and set DataType in Pandas.

Empty Column Misinitialization

When a new empty column is needed in a DataFrame in Pandas, use the NaN value in Numpy instead of using zeros or empty strings.

Merge API Parameter Not Explicitly Set

Explicitly specify on, how and validate parameter for df.merge() API in Pandas for better readability.

Broadcasting Feature Not Used

Use the broadcasting feature in TensorFlow 2 to be more memory efficient.

TensorArray Not Used

Use tf.TensorArray() in TensorFlow 2 if the value of the array will change in the loop.

Pytorch Call Method Misused

Use self.net() in PyTorch to forward the input to the network instead of self.net.forward().

Training / Evaluation Mode Improper Toggling

Call the training mode in the appropriate place in PyTorch code to avoid forgetting to toggle back the training mode after the inference step.

Gradients Not Cleared before Backward Propagation

Use optimizer.zero_grad(), loss_fn.backward(), optimizer.step() together in order in PyTorch. Do not forget to use optimizer.zero_grad() before loss_fn.backward() to clear gradients.