Melanoma Tumor Size Prediction

Regressor for Predicting Melanoma Tumor Size

Data Information

  • Datasets provided by Machine Hack, containing tumor sizes and relevant attributes.

  • Training set: 9,146 entries, 10 features

  • Test set: 36,584 entries, 10 features

  • 0 null values

  • Features

    • mass_npea: the mass of the area under study for melanoma tumor.
    • size_npear: the size of the area under study for melanoma tumor.
    • malign_ratio: the ratio of normal to malign surface under study.
    • damage_size: irrecoverable area of skin damaged by the tumor.
    • exposed_area: total area exposed to the tumor.
    • stddevmalign: standard deviation of malign skin measurements.
    • err_malign: error in malign skin measurements.
    • malign_penalty: penalty imposed due to measurement error in the lab.
    • damage_ratio: the ratio of damage to total spread on the skin.
    • tumor_size: size of melanoma tumor.

Exploratory Data Analysis

Distribution of Feature Values

  • Apparent correlations most likely due to the inherent proportionality between size and mass.

  • Mean of malign_ratio ≈ 0.3, indicating high prevalence of malignancy in the dataset.
  • Left-skewed distribution of damage_ratio supports this notion.

Pearson Correlations

  • Notable tumor_size correlations

    • size_npear
    • malign_ratio
    • damage_size

Modeling

Overview

  • Nature of task

    • Supervised learning
    • Regression to predict numerical value
  • Machine learning tools used

    • Scikit-Learn
    • Keras

Procedure

I. Data Preprocessing

  1. Training/Validation Split (70%: 30%)
  2. Standardization of features with StandardScaler.

II. Hyperparameter Tuning with Randomized Search

  • cv = 3
  • n_iter = 50
  • scoring = ‘neg_mean_squared_error’

III. Training with Tuned Hyperparameters

IV. Performance Evaluation

  • Evaluation metric

    • MSE
    • R2
  • Models trained and evaluated

    • Multiple Linear Regression
    • Random Forest
    • Support Vector Machine
    • Multi-Layer Perceptron
    • Keras Regression

Model Comparison

ModelMSER2
Multiple Linear Regression26.430.29
Random Forest16.210.57
Support Vector Machine21.530.42
Multi-Layer Perceptron29.620.21
Keras Regression18.69-
  • High performance model: Random Forest, Keras Regression

Performance on Test Set

ModelMSER2
Random Forest8.250.23
Keras Regression12.12-
  • Best model: Random Forest

Features of Importance

  • Primary features of importance

    • malign_ratio
    • damage_size
    • malign_penalty

Assumptions/Limitations

  • Assumption

    • All measurements have been obtained with sufficient accuracy/precision.
  • Limitations

    • Insufficient number of entries for training set compared to those of test set.
    • Low number of features.

Conclusion

  • Best model: Random Forest

  • Primary features of importance: malign_ratio, damage_size, malign_penalty

  • Prospective improvements

    • Larger dataset
    • Hyperparameter tuning with different techniques
    • Further experiment with Neural Network/Deep Learning models
Michael Son
Michael Son
Research Scientist | Data Scientist Data Analyst | ML Engineer

Interests: Biotechnology, Data Science, Machine Learning

Related