Context

Reduction of child mortality is reflected in several of the United Nations’ Sustainable Development Goals and is a key indicator of human progress. The UN expects that by 2030, countries end preventable deaths of newborns and children under 5 years of age, with all countries aiming to reduce under‑5 mortality to at least as low as 25 per 1,000 live births.

Parallel to notion of child mortality is of course maternal mortality, which accounts for 295 000 deaths during and following pregnancy and childbirth (as of 2017). The vast majority of these deaths (94%) occurred in low-resource settings, and most could have been prevented.

In light of what was mentioned above, Cardiotocograms (CTGs) are a simple and cost accessible option to assess fetal health, allowing healthcare professionals to take action in order to prevent child and maternal mortality. The equipment itself works by sending ultrasound pulses and reading its response, thus shedding light on fetal heart rate (FHR), fetal movements, uterine contractions and more.

Project content

Data Exploration and Visualization
Data cleaning
Data Transforming
Modeling and Hyperparameter tunning
Unsupervised training
Evaluation

Target Variable

Let’s talk about our target variable, fetal health. For this project, fetal health was separated into 3 different categories.

Normal
Suspect
Pathological

We will use other variables to try to predict in which category we would include the health of a baby never before seen by the algorithm. Given the importance of the prediction, we will set our goal to develop a tool that has at least a 90% accuracy.

Libraries and packages

import warnings
warnings.simplefilter(action ="ignore")
warnings.filterwarnings('ignore')

# Import the necessary packages
import numpy as np
import pandas as pd

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Algorithms
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from lightgbm import LGBMClassifier

from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

Let´s load the dataset

data = pd.read_csv('data/fetal_health.csv')

X = data.drop('fetal_health', axis=1)
y = data['fetal_health']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=123)

train = X_train.join(y_train)

Data Exploration, Visualization and Analysis

Exploration

train.head(5)

	baseline value	accelerations	fetal_movement	uterine_contractions	light_decelerations	abnormal_short_term_variability	mean_value_of_short_term_variability	percentage_of_time_with_abnormal_long_term_variability	...	histogram_min	histogram_max	histogram_number_of_peaks	histogram_number_of_zeroes	histogram_mode	histogram_mean	histogram_median	histogram_variance	histogram_tendency	fetal_health
678	140.0	0.010	0.000	0.000	0.000	58.0	1.2	0.0	...	62.0	188.0	4.0	0.0	176.0	164.0	169.0	27.0	1.0	1.0
512	154.0	0.003	0.003	0.003	0.000	56.0	0.6	1.0	...	140.0	171.0	1.0	0.0	161.0	160.0	162.0	2.0	1.0	1.0
613	146.0	0.009	0.003	0.001	0.000	41.0	1.9	0.0	...	50.0	180.0	7.0	0.0	154.0	152.0	155.0	18.0	1.0	1.0
831	152.0	0.000	0.000	0.002	0.002	61.0	0.5	61.0	...	99.0	160.0	4.0	3.0	159.0	155.0	158.0	4.0	1.0	2.0
369	138.0	0.000	0.008	0.001	0.002	64.0	0.4	30.0	...	118.0	159.0	2.0	0.0	144.0	143.0	145.0	5.0	0.0	2.0

5 rows × 22 columns

train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1488 entries, 678 to 1346
Data columns (total 22 columns):
 #   Column                                                  Non-Null Count  Dtype  
---  ------                                                  --------------  -----  
 baseline value                                          1488 non-null   float64
 accelerations                                           1488 non-null   float64
 fetal_movement                                          1488 non-null   float64
 uterine_contractions                                    1488 non-null   float64
 light_decelerations                                     1488 non-null   float64
 severe_decelerations                                    1488 non-null   float64
 prolongued_decelerations                                1488 non-null   float64
 abnormal_short_term_variability                         1488 non-null   float64
 mean_value_of_short_term_variability                    1488 non-null   float64
 percentage_of_time_with_abnormal_long_term_variability  1488 non-null   float64
mean_value_of_long_term_variability                     1488 non-null   float64
histogram_width                                         1488 non-null   float64
histogram_min                                           1488 non-null   float64
histogram_max                                           1488 non-null   float64
histogram_number_of_peaks                               1488 non-null   float64
histogram_number_of_zeroes                              1488 non-null   float64
histogram_mode                                          1488 non-null   float64
histogram_mean                                          1488 non-null   float64
histogram_median                                        1488 non-null   float64
histogram_variance                                      1488 non-null   float64
histogram_tendency                                      1488 non-null   float64
fetal_health                                            1488 non-null   float64
dtypes: float64(22)
memory usage: 307.4 KB

train.describe().T

	count	mean	std	min	25%	50%	75%	max
baseline value	1488.0	133.290995	9.883484	106.0	126.000	133.000	141.000	160.000
accelerations	1488.0	0.003185	0.003862	0.0	0.000	0.002	0.006	0.018
fetal_movement	1488.0	0.008259	0.042758	0.0	0.000	0.000	0.003	0.481
uterine_contractions	1488.0	0.004313	0.002934	0.0	0.002	0.004	0.006	0.014
light_decelerations	1488.0	0.001862	0.002961	0.0	0.000	0.000	0.003	0.015
severe_decelerations	1488.0	0.000003	0.000052	0.0	0.000	0.000	0.000	0.001
prolongued_decelerations	1488.0	0.000141	0.000535	0.0	0.000	0.000	0.000	0.005
abnormal_short_term_variability	1488.0	47.096102	17.085395	12.0	32.000	49.000	61.000	87.000
mean_value_of_short_term_variability	1488.0	1.328831	0.886662	0.2	0.700	1.200	1.700	7.000
percentage_of_time_with_abnormal_long_term_variability	1488.0	9.983871	18.577807	0.0	0.000	0.000	11.000	91.000
mean_value_of_long_term_variability	1488.0	8.246438	5.693201	0.0	4.600	7.500	10.900	50.700
histogram_width	1488.0	69.969758	39.372667	3.0	36.000	67.000	100.250	176.000
histogram_min	1488.0	93.774194	29.700954	50.0	67.000	93.000	120.000	155.000
histogram_max	1488.0	163.743952	18.140627	122.0	151.000	162.000	174.000	238.000
histogram_number_of_peaks	1488.0	4.040995	3.005767	0.0	2.000	3.000	6.000	18.000
histogram_number_of_zeroes	1488.0	0.327957	0.753364	0.0	0.000	0.000	0.000	10.000
histogram_mode	1488.0	137.543683	16.182615	60.0	129.000	139.000	148.000	187.000
histogram_mean	1488.0	134.691532	15.432030	73.0	125.000	136.000	145.000	180.000
histogram_median	1488.0	138.201613	14.269137	79.0	129.000	139.000	148.000	183.000
histogram_variance	1488.0	18.342742	28.499592	0.0	2.000	7.000	23.000	269.000
histogram_tendency	1488.0	0.324597	0.603854	-1.0	0.000	0.000	1.000	1.000
fetal_health	1488.0	1.301747	0.609009	1.0	1.000	1.000	1.000	3.000

Visualization

# Definition of a function to visualize correlation between variables
import seaborn as sn
import matplotlib.pyplot as plt


def plot_correlation(df):
    corr_matrix = df.corr()
    heat_map = sn.heatmap(corr_matrix, annot=False)
    plt.show(heat_map)

Lets visualize first and comprehend the meaning behind our target variable, fetal_health

train.groupby(['fetal_health']).count().plot(kind='pie', y='accelerations', autopct='%1.1f%%',
                                             startangle=90, labels=['Normal', 'Suspect', 'Pathological'], figsize=(7, 7))

<matplotlib.axes._subplots.AxesSubplot at 0x2a5c37cdb20>

png

In the chart below we can see the correlation between fetal health and the rest of the variables

plot_correlation(train)

png

In the following comparison, we can see the distribution of data between each variable and fetal health

data_hist_plot = data.hist(figsize = (20,20), color = "#5F9EA0")

png

Data Cleaning

we will clean the data set of variables that contribute noise to the machine learning algorithm

from sklearn.ensemble import ExtraTreesRegressor
import matplotlib.pyplot as plt
etr_model = ExtraTreesRegressor()
etr_model.fit(X_train, y_train)

feat_importances = pd.Series(etr_model.feature_importances_)
feat_importances.nlargest(30).plot(kind='barh', figsize=(10, 10))
plt.show()

png

selected_features = etr_model.feature_importances_

selected = [index for index in range(
    selected_features.size) if selected_features[index] >= 0.06]

X_train = X_train.iloc[:, selected]
X_test = X_test.iloc[:, selected]

Data Transforming

Now it is time to normalize data, in this case we used MinMaxScaler with a range of 0-1

from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer

dif_values = y_train.unique()

# Now we create the transformers
t_norm = ("normalizer", MinMaxScaler(feature_range=(0, 1)), X_train.columns)

column_transformer_X = ColumnTransformer(
    transformers=[t_norm], remainder='passthrough')

column_transformer_X.fit(X_train);

X_train = column_transformer_X.transform(X_train)
X_test = column_transformer_X.transform(X_test)

Modeling and Hyperparameter tunning

We will run serveral grid search (some of them randomized to optimize time/performance efficiency)

from sklearn.metrics import make_scorer
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline


from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from lightgbm import LGBMClassifier


from sklearn import tree
from sklearn.ensemble import GradientBoostingClassifier

X_val_train, X_val_test, y_val_train, y_val_test = train_test_split(
    X_train, y_train, test_size=0.20, random_state=123)

scorer = make_scorer(accuracy_score, greater_is_better=True)

Decision Tree Classifier

pipe_tree = tree.DecisionTreeClassifier(random_state=123)

criteria = ['gini', 'entropy']
splitter = ['best', 'random']
max_depth = [int(x) for x in np.linspace(10, 110, num=22)]
max_features = ['auto', 'sqrt', 'log2', None]
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]

param_grid_tree = {'max_features': max_features,
                   'max_depth': max_depth,
                   'min_samples_split': min_samples_split,
                   'min_samples_leaf': min_samples_leaf,
                   'criterion': criteria,
                   'splitter': splitter
                   }

gs_tree = GridSearchCV(estimator=pipe_tree, param_grid=param_grid_tree,
                       cv=10, verbose=2, n_jobs=-1)
best_tree = gs_tree.fit(X_val_train, y_val_train)

Fitting 10 folds for each of 3168 candidates, totalling 31680 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    1.3s
[Parallel(n_jobs=-1)]: Done 492 tasks      | elapsed:    1.8s
[Parallel(n_jobs=-1)]: Done 20620 tasks      | elapsed:    7.3s
[Parallel(n_jobs=-1)]: Done 31680 out of 31680 | elapsed:    9.8s finished

print(classification_report(y_val_test,best_tree.predict(X_val_test)))

              precision    recall  f1-score   support

         1.0       0.95      0.96      0.96       240
         2.0       0.78      0.73      0.75        44
         3.0       0.69      0.79      0.73        14

    accuracy                           0.92       298
   macro avg       0.81      0.82      0.81       298
weighted avg       0.92      0.92      0.92       298

print(best_tree.best_params_)

{'criterion': 'gini', 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 2, 'min_samples_split': 5, 'splitter': 'best'}

Logistic regression

pipe_lr = LogisticRegression()

penalty = ['l1', 'l2', 'elasticnet', 'none']
dual = [True, False]
c = np.linspace(0.0001, 10000, num=10)
tol = np.linspace(1e-4, 1e-2, num=10)
fit_intercept = [True, False]
class_weight = ['balanced', None]
solver = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
multi_class = ['auto', 'ovr', 'multinomial']

param_grid_lr = {'penalty': penalty,
                 'dual': dual,
                 'C': c,
                 'tol': tol,
                 'fit_intercept': fit_intercept,
                 'class_weight': class_weight,
                 'solver': solver,
                 'multi_class': multi_class}

gs_lr = GridSearchCV(estimator=pipe_lr, param_grid=param_grid_lr,
                     cv=10, verbose=2, n_jobs=-1)
gs_lr.fit(X_val_train, y_val_train)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    0.0s


Fitting 10 folds for each of 48000 candidates, totalling 480000 fits


[Parallel(n_jobs=-1)]: Done 1320 tasks      | elapsed:    0.6s
[Parallel(n_jobs=-1)]: Done 14120 tasks      | elapsed:    3.8s
[Parallel(n_jobs=-1)]: Done 32232 tasks      | elapsed:   12.4s
[Parallel(n_jobs=-1)]: Done 55592 tasks      | elapsed:   22.0s
[Parallel(n_jobs=-1)]: Done 73224 tasks      | elapsed:   37.5s
[Parallel(n_jobs=-1)]: Done 98172 tasks      | elapsed:   52.0s
[Parallel(n_jobs=-1)]: Done 122700 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 164280 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 201336 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 254874 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done 298032 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done 353130 tasks      | elapsed:  3.7min
[Parallel(n_jobs=-1)]: Done 406570 tasks      | elapsed:  4.3min
[Parallel(n_jobs=-1)]: Done 469368 tasks      | elapsed:  4.9min
[Parallel(n_jobs=-1)]: Done 480000 out of 480000 | elapsed:  5.1min finished





GridSearchCV(cv=10, estimator=LogisticRegression(), n_jobs=-1,
             param_grid={'C': array([1.0000000e-04, 1.1111112e+03, 2.2222223e+03, 3.3333334e+03,
       4.4444445e+03, 5.5555556e+03, 6.6666667e+03, 7.7777778e+03,
       8.8888889e+03, 1.0000000e+04]),
                         'class_weight': ['balanced', None],
                         'dual': [True, False], 'fit_intercept': [True, False],
                         'multi_class': ['auto', 'ovr', 'multinomial'],
                         'penalty': ['l1', 'l2', 'elasticnet', 'none'],
                         'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag',
                                    'saga'],
                         'tol': array([0.0001, 0.0012, 0.0023, 0.0034, 0.0045, 0.0056, 0.0067, 0.0078,
       0.0089, 0.01  ])},
             verbose=2)

print(classification_report(gs_lr.best_estimator_.predict(X_val_test), y_val_test))

              precision    recall  f1-score   support

         1.0       0.95      0.93      0.94       247
         2.0       0.57      0.74      0.64        34
         3.0       0.86      0.71      0.77        17

    accuracy                           0.89       298
   macro avg       0.79      0.79      0.79       298
weighted avg       0.90      0.89      0.90       298

print(gs_lr.best_params_)

{'C': 5555.555600000001, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'multi_class': 'multinomial', 'penalty': 'none', 'solver': 'sag', 'tol': 0.0089}

Gradient Boosting Classifier

pipe_gbc = GradientBoostingClassifier(random_state=123)

learning_rate = np.linspace(0.001, 1, num=10)
n_estimators = [50, 100, 200, 500]
criteria = ['friedman_mse', 'mse', 'mae']
min_samples_split = np.linspace(0, 20, num=10, dtype=int)
min_samples_leaf = np.linspace(1e-2, 0.5, num=10)
np.append(min_samples_leaf, int(1))
min_weight_fraction_leaf = [0, 1]
max_depth = np.arange(1, 21, 2)
max_features = ['auto', 'sqrt', 'log2']

param_grid_gbc = {'learning_rate': learning_rate,
                  'n_estimators': n_estimators,
                  'criterion': criteria,
                  'min_samples_split': min_samples_split,
                  'min_samples_leaf': min_samples_leaf,
                  'min_weight_fraction_leaf': min_weight_fraction_leaf,
                  'max_depth': max_depth,
                  'max_features': max_features}

gs_gbc = RandomizedSearchCV(estimator=pipe_gbc, param_distributions=param_grid_gbc, n_iter=350,
                            cv=10, verbose=2, n_jobs=-1)
gs_gbc.fit(X_val_train, y_val_train)

Fitting 10 folds for each of 350 candidates, totalling 3500 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    2.7s
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:   17.9s
[Parallel(n_jobs=-1)]: Done 341 tasks      | elapsed:   50.2s
[Parallel(n_jobs=-1)]: Done 788 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 1472 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 2141 tasks      | elapsed:  3.5min
[Parallel(n_jobs=-1)]: Done 3012 tasks      | elapsed:  4.9min
[Parallel(n_jobs=-1)]: Done 3500 out of 3500 | elapsed:  5.8min finished





RandomizedSearchCV(cv=10,
                   estimator=GradientBoostingClassifier(random_state=123),
                   n_iter=350, n_jobs=-1,
                   param_distributions={'criterion': ['friedman_mse', 'mse',
                                                      'mae'],
                                        'learning_rate': array([0.001, 0.112, 0.223, 0.334, 0.445, 0.556, 0.667, 0.778, 0.889,
       1.   ]),
                                        'max_depth': array([ 1,  3,  5,  7,  9, 11, 13, 15, 17, 19]),
                                        'max_features': ['auto', 'sqrt',
                                                         'log2'],
                                        'min_samples_leaf': array([0.01      , 0.06444444, 0.11888889, 0.17333333, 0.22777778,
       0.28222222, 0.33666667, 0.39111111, 0.44555556, 0.5       ]),
                                        'min_samples_split': array([ 0,  2,  4,  6,  8, 11, 13, 15, 17, 20]),
                                        'min_weight_fraction_leaf': [0, 1],
                                        'n_estimators': [50, 100, 200, 500]},
                   verbose=2)

print(classification_report(y_val_test, gs_gbc.predict(X_val_test)))

              precision    recall  f1-score   support

         1.0       0.96      0.95      0.96       240
         2.0       0.77      0.75      0.76        44
         3.0       0.76      0.93      0.84        14

    accuracy                           0.92       298
   macro avg       0.83      0.88      0.85       298
weighted avg       0.92      0.92      0.92       298

print(gs_gbc.best_params_)

{'n_estimators': 50, 'min_weight_fraction_leaf': 0, 'min_samples_split': 11, 'min_samples_leaf': 0.01, 'max_features': 'auto', 'max_depth': 3, 'learning_rate': 0.223, 'criterion': 'friedman_mse'}

LightGBM Classifier

pipe_light = LGBMClassifier(random_state=123)

n_estimators = [int(x) for x in np.linspace(start=200, stop=2000, num=10)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num=11)]
bootstrap = [True, False]

param_grid_light = {'n_estimators': n_estimators,
                    'max_features': max_features,
                    'max_depth': max_depth,
                    'bootstrap': bootstrap
                    }

gs_light = GridSearchCV(estimator=pipe_light, param_grid=param_grid_light, cv=5, verbose=2, n_jobs=-1)
gs_light.fit(X_val_train, y_val_train);

Fitting 5 folds for each of 440 candidates, totalling 2200 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    1.3s
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:   11.8s
[Parallel(n_jobs=-1)]: Done 341 tasks      | elapsed:   28.8s
[Parallel(n_jobs=-1)]: Done 624 tasks      | elapsed:   52.2s
[Parallel(n_jobs=-1)]: Done 989 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 1434 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 1961 tasks      | elapsed:  2.7min
[Parallel(n_jobs=-1)]: Done 2200 out of 2200 | elapsed:  3.1min finished


[LightGBM] [Warning] Unknown parameter: bootstrap
[LightGBM] [Warning] Unknown parameter: max_features
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).

print(classification_report(y_val_test, gs_light.predict(X_val_test)))

              precision    recall  f1-score   support

         1.0       0.95      0.96      0.96       240
         2.0       0.79      0.70      0.75        44
         3.0       0.72      0.93      0.81        14

    accuracy                           0.92       298
   macro avg       0.82      0.86      0.84       298
weighted avg       0.92      0.92      0.92       298

print(gs_light.best_params_)

{'bootstrap': True, 'max_depth': 30, 'max_features': 'auto', 'n_estimators': 200}

Random Forest

n_estimators = [int(x) for x in np.linspace(start=200, stop=2000, num=10)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num=11)]
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]

random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

rf = RandomForestClassifier(random_state=123)

gs_rf = RandomizedSearchCV(estimator=rf, param_distributions=random_grid,
                               n_iter=100, cv=10, verbose=2, random_state=123, n_jobs=-1)

gs_rf.fit(X_val_train, y_val_train);

Fitting 10 folds for each of 100 candidates, totalling 1000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    7.4s
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:   33.3s
[Parallel(n_jobs=-1)]: Done 341 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 624 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:  3.2min finished

print(classification_report(y_val_test, gs_rf.predict(X_val_test)))

              precision    recall  f1-score   support

         1.0       0.95      0.97      0.96       240
         2.0       0.84      0.70      0.77        44
         3.0       0.87      0.93      0.90        14

    accuracy                           0.93       298
   macro avg       0.88      0.87      0.87       298
weighted avg       0.93      0.93      0.93       298

print(gs_rf.best_params_)

{'n_estimators': 800, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 'auto', 'max_depth': 10, 'bootstrap': True}

Modeling Conclusion

We concluded that random forest was the model that had the best performance overall.

Unsupervised training

Now that we found our best model, we will try adding some clusters that could help classify easier our target variables.

from sklearn.cluster import KMeans

kmeans = KMeans(
    n_clusters=3, 
    random_state=123
).fit(X_val_train)

test_cluster = kmeans.predict(X_val_test)

X_val_train = np.append(X_val_train, np.expand_dims(kmeans.labels_, axis=1), axis=1)
X_val_test = np.append(X_val_test, np.expand_dims(test_cluster, axis=1), axis=1)

best_rf = rf = RandomForestClassifier(random_state=123, n_estimators=800, min_samples_split=5,
                                      min_samples_leaf=2, max_features='auto', max_depth=10, bootstrap=True)
best_rf.fit(X_val_train, y_val_train)

RandomForestClassifier(max_depth=10, min_samples_leaf=2, min_samples_split=5,
                       n_estimators=800, random_state=123)

print(classification_report(y_val_test,best_rf.predict(X_val_test)))

              precision    recall  f1-score   support

         1.0       0.95      0.97      0.96       240
         2.0       0.84      0.70      0.77        44
         3.0       0.87      0.93      0.90        14

    accuracy                           0.93       298
   macro avg       0.88      0.87      0.87       298
weighted avg       0.93      0.93      0.93       298

it seems that no metrics changed, so we will not include the clusters to avoid any noise in the algorithm

Evaluation

We will recreate the best model, but this time we will train it with all the data from the train set.

best_rf = rf = RandomForestClassifier(random_state=123, n_estimators=800, min_samples_split=5,
                                      min_samples_leaf=2, max_features='auto', max_depth=10, bootstrap=True)
best_rf.fit(X_train, y_train)

RandomForestClassifier(max_depth=10, min_samples_leaf=2, min_samples_split=5,
                       n_estimators=800, random_state=123)

print(classification_report(y_test,best_rf.predict(X_test)))

              precision    recall  f1-score   support

         1.0       0.95      0.95      0.95       497
         2.0       0.76      0.74      0.75        84
         3.0       0.83      0.86      0.84        57

    accuracy                           0.92       638
   macro avg       0.85      0.85      0.85       638
weighted avg       0.92      0.92      0.92       638

Conclusion

We have developed a tool that predicts with 92% accuracy in which category we would place the health of a fetus. Since our goal was set to reach 90%, we can conclude that this project was successfully resolved.

Fetal Health Classification

This tool was developed with the goal of helping predict fetal health in order to prevent child and maternal mortality.