Journal of Minimally Invasive Surgery 2024; 27(3): 129-137
Published online September 15, 2024
https://doi.org/10.7602/jmis.2024.27.3.129
© The Korean Society of Endo-Laparoscopic & Robotic Surgery
Correspondence to : Youngho Park
Department of Big Data Application, College of Smart Interdisciplinary Engineering, Hannam University, 70 Hannam-ro, Daedeok-gu, Daejeon 34430, Korea
E-mail: yhpark@hnu.kr
https://orcid.org/0000-0002-7096-3967
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Recently, interest in machine learning (ML) has increased as the application fields have expanded significantly. Although ML methods excel in many fields, establishing an ML pipeline requires considerable time and human resources. Automated ML (AutoML) tools offer a solution by automating repetitive tasks, such as data preprocessing, model selection, hyperparameter optimization, and prediction analysis. This review introduces the use of AutoML tools for general research, including clinical studies. In particular, it outlines a simple approach that is accessible to beginners using the R programming language (R Foundation for Statistical Computing). In addition, the practical code and output results for binary classification are provided to facilitate direct application by clinical researchers in future studies.
Keywords Automated machine learning, AutoML, R programming language, Google Colab, Classification
A typical machine learning (ML) analysis process involves several key steps: data collection and preprocessing, model selection and training, performance evaluation, hyperparameter tuning, and model deployment. The field of automated machine learning (AutoML) aims to develop methods for constructing suitable ML models with minimal or no human intervention [1]. Although commercial AutoML systems have recently emerged, ongoing research ensures consistent results. A comparative analysis using OpenML demonstrated that the general machine learning (GML) AutoML tool achieves competitive results across seven datasets, surpassing human ML modeling in five of these datasets [2]. A review of 101 papers in the AutoML domain has revealed that such automation techniques can match or exceed expert-level performance in specific ML tasks, often within shorter timeframes [4]. AutoML has been applied in diverse fields, such as healthcare [3–5], intrusion detection [6], groundwater level [7], autonomous vehicles [8], and time series forecasting [9].
Recent research has focused on developing tools to fully automate this process. Most previous studies, except for Auto-WEKA, have focused on Python. Although the focus on Python is not problematic, the growing popularity of the R language (R Foundation for Statistical Computing) and its specialization in statistical computing present new research opportunities in R-based automation [10]. This study provides R users with code-based instructions for using AutoML tools and demonstrates the analysis process using example data. In particular, it offers simple methods to make ML more accessible to beginners.
R offers a robust ecosystem of packages to implement and optimize ML models. The following descriptions describe 10 essential R packages widely used in ML applications.
1. caret: The “Classification And REgression Training” (caret) package is integral to the R ML toolkit. It facilitates model training and evaluation and supports a diverse array of algorithms [11]. In addition, it offers extensive functionalities for data preprocessing, feature selection, and hyperparameter tuning, making it a comprehensive solution for developing predictive models.
2. randomForest: This package implements the random forest algorithm, which is a powerful ensemble learning method widely used for both classification and regression tasks [12]. It builds multiple decision trees and merges them to enhance predictive accuracy and control overfitting, making it a staple in ML.
3. xgboost: XGBoost, which stands for Extreme Gradient Boosting, is a high-performance package that implements the gradient boosting framework [13]. Known for its efficiency and accuracy, XGBoost is particularly effective in handling large-scale datasets and complex problem domains, and it often outperforms other models in competitive ML tasks.
4. e1071: The e1071 package provides implementations of several fundamental ML algorithms, including support vector machines (SVM), Naive Bayes classifiers, and clustering methods [14]. It is particularly noteworthy for its support of SVM, which is a powerful method for both classification and regression and is especially effective in high-dimensional spaces.
5. lightgbm: The lightgbm package implements the Light Gradient Boosting Machine (LightGBM), which is an advanced boosting algorithm designed for speed and efficiency [15]. It is optimized for large-scale data and offers higher performance with lower memory consumption and faster training speed than other boosting methods.
6. nnet: nnet is an R package for constructing and training feedforward neural networks [16]. While it is primarily designed for simpler neural network models, it provides essential tools for understanding and applying basic neural network concepts, making it useful for small- to medium-sized tasks.
7. tensorflow: tensorflow is an R package that provides an interface to TensorFlow, an open-source library developed by Google for large-scale ML and deep learning tasks [17]. It is capable of handling complex computations and supports the development of advanced neural network models, including those used in research and production environments.
8. keras: The R interface to Keras allows users to build and train deep learning models with relative ease using TensorFlow as the backend. Keras simplifies the construction of complex neural network architectures by supporting various layers and optimization techniques, making it a popular choice for deep learning applications [18].
9. h2o: The h2o package, developed by H2O.ai, provides a scalable and distributed ML platform [19,20]. It supports a wide array of algorithms and is designed to efficiently handle large datasets. H2O’s integration with R allows users to seamlessly build, train, and deploy models across various computing environments.
10. mlr3: mlr3 is a modern and extensible framework for ML in R, offering a streamlined and modular approach to model training, evaluation, and benchmarking [21]. It supports a broad range of algorithms and evaluation metrics, and its flexible architecture allows for easy integration and customization, making it suitable for research and applied ML tasks.
The analytical framework for applying ML methods to research is centered on addressing specific research questions and ensuring the academic validity of the results. The process typically involves the following systematic steps.
Definition of research questions and objectives
• Problem definition: The problem or hypothesis addressed in the research must be clearly defined. This involves articulating the research question, identifying its scope, and establishing its significance in the context of existing knowledge.
• Establishment of objectives and hypotheses: The research objectives and the corresponding hypotheses must be established and formulated. These may include understanding the relationships between variables, predicting outcomes, and testing theoretical constructs.
Literature review and background research
• Review of related studies: A comprehensive examination of the existing literature review is essential to identify and analyze previous research that addresses similar problems or uses relevant methodologies. This review positions the current research within the broader academic discourse and highlights research gaps or opportunities for contribution.
• Analysis of ML techniques: A comprehensive examination of contemporary ML methods is crucial, with a particular focus on their applications in similar research contexts. This evaluation informs the selection of models and techniques that yield reliable and valid results.
Data collection and preparation
• Data acquisition: This step involves the identification and acquisition of data from appropriate sources, such as public datasets, experimental results, or survey results. The selection of data sources should align with research objectives, prioritizing relevance, sufficiency, and quality.
• Data cleaning and transformation: Data preprocessing is performed by addressing missing values, eliminating outliers, correcting inaccuracies, and transforming variables into suitable formats. This step ensures that the data are clean, consistent, and ready for analysis, thereby enhancing model accuracy and reliability.
Exploratory data analysis
• Data visualization and statistical summarization: Exploratory data analysis (EDA) is performed using visualization techniques and summary statistics to uncover patterns, distributions, and relationships in the data. The insights provided by EDA serve as a foundation for subsequent modeling decisions.
• Correlation and feature analysis: A thorough analysis of the relationships between variables is essential to identify significant correlations and potential predictors. This analysis facilitates a deeper understanding of the data structure and allows selection of the most relevant features for modeling.
Feature engineering
• Feature selection and creation: The identification and selection of the most informative features and the creation of new features when necessary are crucial steps in enhancing model performance. Feature engineering involves selecting variables that have the greatest impact on the predictive accuracy of the model.
• Feature transformation: To optimize the features of ML models, various transformations, including normalization, encoding, and dimensionality reduction, can be employed. These transformations are crucial for enhancing model efficiency and interpretability.
Model selection, training, and tuning
• Model selection: The selection of ML models is based on the nature of the research problem and data characteristics. This process often requires experimentation with multiple algorithms to determine the most suitable algorithm for achieving the desired research objectives.
• Model training and hyperparameter tuning: The selected model(s) are trained on the prepared dataset with the parameters adjusted to optimize performance. This process includes hyperparameter optimization, which is employed to refine the model and enhance predictive accuracy.
Model evaluation and interpretation
• Evaluation using performance metrics: A thorough evaluation of the model’s performance is essential to assess its effectiveness in addressing the research question. Metrics, such as accuracy, precision, recall, and the F1-score, can provide valuable insights into the model’s performance.
• Cross-validation and generalization assessment: To assess the model’s generalizability and robustness, cross-validation is employed. Cross-validation ensures that the model performs well on unseen data, thus minimizing the risks of overfitting and underfitting.
• Interpretation and discussion of results: A comprehensive analysis of the model’s predictions and performance metrics is essential to draw meaningful conclusions related to the research objectives. By examining the results in the context of the research hypotheses, researchers can highlight the implications, limitations, and potential areas for future research.
This procedure can be adjusted according to specific research objectives and challenges. Through iterative processes, the model can be refined, or new insights can be uncovered.
Google Colab provides an Ubuntu Linux environment in which users can execute Python and R code. Leveraging Google Colab for ML offers several advantages that contribute to efficient execution and development. The advantages of Google Colab include the following.
• Integrated development environment: Google Colab offers a cloud-based Jupyter Notebook interface that seamlessly integrates with Python and R, making it a versatile platform for developing and running ML models.
• Access to free graphics processing units (GPUs) and tensor processing units (TPUs): Google Colab provides access to free hardware accelerators, such as GPUs and TPUs, which significantly enhance computational speed and efficiency when training ML models, especially on large datasets.
• Collaborative features: Google Colab allows real-time collaboration, enabling multiple users to work on the same notebook simultaneously. This feature is particularly useful for team-based projects and academic research.
To use R in Google Colab, the following procedural steps are followed:
1. Access the “Runtime” menu: Navigate to the top of the Google Colab interface and select the “Runtime” menu.
2. Select “Change Runtime Type”: From the dropdown menu, choose the “Change runtime type” option.
3. Configure runtime settings: In the “Notebook settings” window that appears, set the “Runtime type” to “R” from the available options.
By completing these steps, users can configure their Google Colab environment to support the execution of R code.
The dataset used for the ML analysis is the “Heart Failure Prediction” dataset sourced from Kaggle [22]. This dataset consolidates data from five distinct heart disease datasets available in the University of California Irvine Machine Learning Repository and is organized into 12 variables. The final dataset comprises 918 observations, providing a comprehensive basis for the predictive modeling of heart failure.
Attribute information is as follows.
1. Age: patient age (years)
2. Sex: sex of the patient (M: male, F: female)
3. ChestPainType: chest pain type (TA: typical angina, ATA: atypical angina, NAP: non-anginal pain, ASY: asymptomatic)
4. RestingBP: resting blood pressure (mmHg)
5. Cholesterol: serum cholesterol (mm/dL)
6. FastingBS: fasting blood sugar (1: if FastingBS >120 mg/dL, 0: otherwise)
7. RestingECG: resting electrocardiogram results (Normal: normal, ST: having ST-T wave abnormality [T wave inversions and/or ST elevation or depression of >0.05 mV], LVH: showing probable or definite left ventricular hypertrophy according to Estes’ criteria)
8. MaxHR: maximum heart rate achieved (numeric value between 60 and 202)
9. ExerciseAngina: exercise-induced angina (Y: yes, N: no)
10. Oldpeak: oldpeak = ST (numeric value measured in depression)
11. ST_Slope: the slope of the peak exercise ST segment (Up: upsloping, Flat: flat, Down: downsloping)
12. HeartDisease: output class (1: heart disease, 0: normal)
Before conducting the analysis, verify the R and Java and install the required packages.
Install the packages required for data processing and analysis.
Load the packages required for the analysis and create user-defined functions.
The following commands load the necessary R packages: caret for training and evaluating ML models; dplyr for data manipulation; ROSE for oversampling techniques; h2o for AutoML.
This step involves presetting the values to be applied to the analysis before starting the analysis for beginners who may not be able to modify the R code. The sections highlighted in red within the provided code are modified to ensure suitability for analysis.
Variable description is as follows.
• data_path: defines the path to the CSV file containing the dataset (file uploaded to Colab)
• target_var_name: represents the name of the target variable
• positive_str: represents the value of interest within the target variable
• train_ratios: specifies the proportion of the data to be used for training (range, 0–1)
Read the data from the CSV file and convert categorical variables using one-hot encoding. After preprocessing the data, verify the results and ensure that the data types are accurate. The target variable is converted to factor data type.
Initialize the H2O cluster, which is required for running H2O’s ML algorithms.
Convert the processed dataset into an H2O frame, which is compatible with H2O’s functions.
Split the data into training and test sets using the specified ratio. ‘train’ contains the training set, and ‘test’ contains the test set.
Perform oversampling on the training data to balance the class distribution. The dataset is increased to twice its original size.
Run the H2O-AutoML function to train multiple models on the training data. The parameters specify the features ‘x,’ the target variable ‘y,’ the maximum number of models to train ‘max_models,’ and the maximum runtime in seconds ‘max_runtime_secs.’
Retrieve the leaderboard of trained models and converts it into a data frame for review.
Save the performance confirmed with test data for all constructed models in a data frame and compare the results.
In the leaderboard results calculated with the training data, “(Gradient Boosting Machine) GBM_4” exhibited the highest performance based on the area under the curve (AUC), and “(Generalized Linear Model) GLM_1” exhibited the lowest performance. However, when verified on the test data, “GLM_1” exhibited the highest performance, followed by “GBM_4.” Although drastic changes in performance rankings are not common, such changes can occur occasionally.
ML typically focuses on classification or prediction accuracy. However, understanding the influence of input variables on the target variable is often essential. In such cases, determining the importance of input variables, such as the regression coefficient of the regression model, can provide valuable insights.
Select a model (e.g., GBM, DRF, XGBoost) from the leaderboard for further analysis based on “model_no.”
Select a GLM from the leaderboard for additional examination.
SVMs are a class of supervised learning models used for classification, regression, and outlier detection. SVM can also be extended to nonlinear classification using kernel functions, which map the input data into higher-dimensional spaces where linear separation is possible. Given their versatility and effectiveness, SVMs have been extensively applied in diverse fields, such as bioinformatics, image recognition, and text categorization. However, the H2O-AutoML framework lacks support for constructing SVM models. Therefore, SVM tuning and evaluation using packages such as {caret}, {kernlab}, and {e1071} are required.
In SVM, the hyperparameters of the linear kernel are optimized through random search using 10-fold cross-validation. The search extent can be adjusted by modifying the tuneLength parameter. In addition, a nonlinear SVM with a radial kernel can be constructed by setting ‘method’ to “svmRadial.”
The classification results obtained using linear SVM were moderate compared to the AUC obtained using H2O-AutoML.
In recent years, ML, including deep learning, has been primarily developed and applied in Python environments. However, R also offers various packages to perform AutoML. Despite the availability of numerous AutoML platforms, finding and learning how to use these packages can often be challenging. This study introduces methods to automate repetitive tasks using AutoML tools, which can reduce time and effort. AutoML excels at tasks such as data preprocessing, model selection, hyperparameter optimization, and predictive result analysis. Furthermore, previous research has confirmed that AutoML tools can outperform human modeling efforts. We anticipate that these AutoML tools will be increasingly used in future research, contributing to the enhancement of the efficiency of ML modeling.
The author has no conflicts of interest to declare.
This work was supported by 2023 Hannam University Research Fund.
This data is available by request to the Heart Failure Prediction Dataset (https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction).
Journal of Minimally Invasive Surgery 2024; 27(3): 129-137
Published online September 15, 2024 https://doi.org/10.7602/jmis.2024.27.3.129
Copyright © The Korean Society of Endo-Laparoscopic & Robotic Surgery.
Department of Big Data Application, College of Smart Interdisciplinary Engineering, Hannam University, Daejeon, Korea
Correspondence to:Youngho Park
Department of Big Data Application, College of Smart Interdisciplinary Engineering, Hannam University, 70 Hannam-ro, Daedeok-gu, Daejeon 34430, Korea
E-mail: yhpark@hnu.kr
https://orcid.org/0000-0002-7096-3967
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Recently, interest in machine learning (ML) has increased as the application fields have expanded significantly. Although ML methods excel in many fields, establishing an ML pipeline requires considerable time and human resources. Automated ML (AutoML) tools offer a solution by automating repetitive tasks, such as data preprocessing, model selection, hyperparameter optimization, and prediction analysis. This review introduces the use of AutoML tools for general research, including clinical studies. In particular, it outlines a simple approach that is accessible to beginners using the R programming language (R Foundation for Statistical Computing). In addition, the practical code and output results for binary classification are provided to facilitate direct application by clinical researchers in future studies.
Keywords: Automated machine learning, AutoML, R programming language, Google Colab, Classification
A typical machine learning (ML) analysis process involves several key steps: data collection and preprocessing, model selection and training, performance evaluation, hyperparameter tuning, and model deployment. The field of automated machine learning (AutoML) aims to develop methods for constructing suitable ML models with minimal or no human intervention [1]. Although commercial AutoML systems have recently emerged, ongoing research ensures consistent results. A comparative analysis using OpenML demonstrated that the general machine learning (GML) AutoML tool achieves competitive results across seven datasets, surpassing human ML modeling in five of these datasets [2]. A review of 101 papers in the AutoML domain has revealed that such automation techniques can match or exceed expert-level performance in specific ML tasks, often within shorter timeframes [4]. AutoML has been applied in diverse fields, such as healthcare [3–5], intrusion detection [6], groundwater level [7], autonomous vehicles [8], and time series forecasting [9].
Recent research has focused on developing tools to fully automate this process. Most previous studies, except for Auto-WEKA, have focused on Python. Although the focus on Python is not problematic, the growing popularity of the R language (R Foundation for Statistical Computing) and its specialization in statistical computing present new research opportunities in R-based automation [10]. This study provides R users with code-based instructions for using AutoML tools and demonstrates the analysis process using example data. In particular, it offers simple methods to make ML more accessible to beginners.
R offers a robust ecosystem of packages to implement and optimize ML models. The following descriptions describe 10 essential R packages widely used in ML applications.
1. caret: The “Classification And REgression Training” (caret) package is integral to the R ML toolkit. It facilitates model training and evaluation and supports a diverse array of algorithms [11]. In addition, it offers extensive functionalities for data preprocessing, feature selection, and hyperparameter tuning, making it a comprehensive solution for developing predictive models.
2. randomForest: This package implements the random forest algorithm, which is a powerful ensemble learning method widely used for both classification and regression tasks [12]. It builds multiple decision trees and merges them to enhance predictive accuracy and control overfitting, making it a staple in ML.
3. xgboost: XGBoost, which stands for Extreme Gradient Boosting, is a high-performance package that implements the gradient boosting framework [13]. Known for its efficiency and accuracy, XGBoost is particularly effective in handling large-scale datasets and complex problem domains, and it often outperforms other models in competitive ML tasks.
4. e1071: The e1071 package provides implementations of several fundamental ML algorithms, including support vector machines (SVM), Naive Bayes classifiers, and clustering methods [14]. It is particularly noteworthy for its support of SVM, which is a powerful method for both classification and regression and is especially effective in high-dimensional spaces.
5. lightgbm: The lightgbm package implements the Light Gradient Boosting Machine (LightGBM), which is an advanced boosting algorithm designed for speed and efficiency [15]. It is optimized for large-scale data and offers higher performance with lower memory consumption and faster training speed than other boosting methods.
6. nnet: nnet is an R package for constructing and training feedforward neural networks [16]. While it is primarily designed for simpler neural network models, it provides essential tools for understanding and applying basic neural network concepts, making it useful for small- to medium-sized tasks.
7. tensorflow: tensorflow is an R package that provides an interface to TensorFlow, an open-source library developed by Google for large-scale ML and deep learning tasks [17]. It is capable of handling complex computations and supports the development of advanced neural network models, including those used in research and production environments.
8. keras: The R interface to Keras allows users to build and train deep learning models with relative ease using TensorFlow as the backend. Keras simplifies the construction of complex neural network architectures by supporting various layers and optimization techniques, making it a popular choice for deep learning applications [18].
9. h2o: The h2o package, developed by H2O.ai, provides a scalable and distributed ML platform [19,20]. It supports a wide array of algorithms and is designed to efficiently handle large datasets. H2O’s integration with R allows users to seamlessly build, train, and deploy models across various computing environments.
10. mlr3: mlr3 is a modern and extensible framework for ML in R, offering a streamlined and modular approach to model training, evaluation, and benchmarking [21]. It supports a broad range of algorithms and evaluation metrics, and its flexible architecture allows for easy integration and customization, making it suitable for research and applied ML tasks.
The analytical framework for applying ML methods to research is centered on addressing specific research questions and ensuring the academic validity of the results. The process typically involves the following systematic steps.
Definition of research questions and objectives
• Problem definition: The problem or hypothesis addressed in the research must be clearly defined. This involves articulating the research question, identifying its scope, and establishing its significance in the context of existing knowledge.
• Establishment of objectives and hypotheses: The research objectives and the corresponding hypotheses must be established and formulated. These may include understanding the relationships between variables, predicting outcomes, and testing theoretical constructs.
Literature review and background research
• Review of related studies: A comprehensive examination of the existing literature review is essential to identify and analyze previous research that addresses similar problems or uses relevant methodologies. This review positions the current research within the broader academic discourse and highlights research gaps or opportunities for contribution.
• Analysis of ML techniques: A comprehensive examination of contemporary ML methods is crucial, with a particular focus on their applications in similar research contexts. This evaluation informs the selection of models and techniques that yield reliable and valid results.
Data collection and preparation
• Data acquisition: This step involves the identification and acquisition of data from appropriate sources, such as public datasets, experimental results, or survey results. The selection of data sources should align with research objectives, prioritizing relevance, sufficiency, and quality.
• Data cleaning and transformation: Data preprocessing is performed by addressing missing values, eliminating outliers, correcting inaccuracies, and transforming variables into suitable formats. This step ensures that the data are clean, consistent, and ready for analysis, thereby enhancing model accuracy and reliability.
Exploratory data analysis
• Data visualization and statistical summarization: Exploratory data analysis (EDA) is performed using visualization techniques and summary statistics to uncover patterns, distributions, and relationships in the data. The insights provided by EDA serve as a foundation for subsequent modeling decisions.
• Correlation and feature analysis: A thorough analysis of the relationships between variables is essential to identify significant correlations and potential predictors. This analysis facilitates a deeper understanding of the data structure and allows selection of the most relevant features for modeling.
Feature engineering
• Feature selection and creation: The identification and selection of the most informative features and the creation of new features when necessary are crucial steps in enhancing model performance. Feature engineering involves selecting variables that have the greatest impact on the predictive accuracy of the model.
• Feature transformation: To optimize the features of ML models, various transformations, including normalization, encoding, and dimensionality reduction, can be employed. These transformations are crucial for enhancing model efficiency and interpretability.
Model selection, training, and tuning
• Model selection: The selection of ML models is based on the nature of the research problem and data characteristics. This process often requires experimentation with multiple algorithms to determine the most suitable algorithm for achieving the desired research objectives.
• Model training and hyperparameter tuning: The selected model(s) are trained on the prepared dataset with the parameters adjusted to optimize performance. This process includes hyperparameter optimization, which is employed to refine the model and enhance predictive accuracy.
Model evaluation and interpretation
• Evaluation using performance metrics: A thorough evaluation of the model’s performance is essential to assess its effectiveness in addressing the research question. Metrics, such as accuracy, precision, recall, and the F1-score, can provide valuable insights into the model’s performance.
• Cross-validation and generalization assessment: To assess the model’s generalizability and robustness, cross-validation is employed. Cross-validation ensures that the model performs well on unseen data, thus minimizing the risks of overfitting and underfitting.
• Interpretation and discussion of results: A comprehensive analysis of the model’s predictions and performance metrics is essential to draw meaningful conclusions related to the research objectives. By examining the results in the context of the research hypotheses, researchers can highlight the implications, limitations, and potential areas for future research.
This procedure can be adjusted according to specific research objectives and challenges. Through iterative processes, the model can be refined, or new insights can be uncovered.
Google Colab provides an Ubuntu Linux environment in which users can execute Python and R code. Leveraging Google Colab for ML offers several advantages that contribute to efficient execution and development. The advantages of Google Colab include the following.
• Integrated development environment: Google Colab offers a cloud-based Jupyter Notebook interface that seamlessly integrates with Python and R, making it a versatile platform for developing and running ML models.
• Access to free graphics processing units (GPUs) and tensor processing units (TPUs): Google Colab provides access to free hardware accelerators, such as GPUs and TPUs, which significantly enhance computational speed and efficiency when training ML models, especially on large datasets.
• Collaborative features: Google Colab allows real-time collaboration, enabling multiple users to work on the same notebook simultaneously. This feature is particularly useful for team-based projects and academic research.
To use R in Google Colab, the following procedural steps are followed:
1. Access the “Runtime” menu: Navigate to the top of the Google Colab interface and select the “Runtime” menu.
2. Select “Change Runtime Type”: From the dropdown menu, choose the “Change runtime type” option.
3. Configure runtime settings: In the “Notebook settings” window that appears, set the “Runtime type” to “R” from the available options.
By completing these steps, users can configure their Google Colab environment to support the execution of R code.
The dataset used for the ML analysis is the “Heart Failure Prediction” dataset sourced from Kaggle [22]. This dataset consolidates data from five distinct heart disease datasets available in the University of California Irvine Machine Learning Repository and is organized into 12 variables. The final dataset comprises 918 observations, providing a comprehensive basis for the predictive modeling of heart failure.
Attribute information is as follows.
1. Age: patient age (years)
2. Sex: sex of the patient (M: male, F: female)
3. ChestPainType: chest pain type (TA: typical angina, ATA: atypical angina, NAP: non-anginal pain, ASY: asymptomatic)
4. RestingBP: resting blood pressure (mmHg)
5. Cholesterol: serum cholesterol (mm/dL)
6. FastingBS: fasting blood sugar (1: if FastingBS >120 mg/dL, 0: otherwise)
7. RestingECG: resting electrocardiogram results (Normal: normal, ST: having ST-T wave abnormality [T wave inversions and/or ST elevation or depression of >0.05 mV], LVH: showing probable or definite left ventricular hypertrophy according to Estes’ criteria)
8. MaxHR: maximum heart rate achieved (numeric value between 60 and 202)
9. ExerciseAngina: exercise-induced angina (Y: yes, N: no)
10. Oldpeak: oldpeak = ST (numeric value measured in depression)
11. ST_Slope: the slope of the peak exercise ST segment (Up: upsloping, Flat: flat, Down: downsloping)
12. HeartDisease: output class (1: heart disease, 0: normal)
Before conducting the analysis, verify the R and Java and install the required packages.
Install the packages required for data processing and analysis.
Load the packages required for the analysis and create user-defined functions.
The following commands load the necessary R packages: caret for training and evaluating ML models; dplyr for data manipulation; ROSE for oversampling techniques; h2o for AutoML.
This step involves presetting the values to be applied to the analysis before starting the analysis for beginners who may not be able to modify the R code. The sections highlighted in red within the provided code are modified to ensure suitability for analysis.
Variable description is as follows.
• data_path: defines the path to the CSV file containing the dataset (file uploaded to Colab)
• target_var_name: represents the name of the target variable
• positive_str: represents the value of interest within the target variable
• train_ratios: specifies the proportion of the data to be used for training (range, 0–1)
Read the data from the CSV file and convert categorical variables using one-hot encoding. After preprocessing the data, verify the results and ensure that the data types are accurate. The target variable is converted to factor data type.
Initialize the H2O cluster, which is required for running H2O’s ML algorithms.
Convert the processed dataset into an H2O frame, which is compatible with H2O’s functions.
Split the data into training and test sets using the specified ratio. ‘train’ contains the training set, and ‘test’ contains the test set.
Perform oversampling on the training data to balance the class distribution. The dataset is increased to twice its original size.
Run the H2O-AutoML function to train multiple models on the training data. The parameters specify the features ‘x,’ the target variable ‘y,’ the maximum number of models to train ‘max_models,’ and the maximum runtime in seconds ‘max_runtime_secs.’
Retrieve the leaderboard of trained models and converts it into a data frame for review.
Save the performance confirmed with test data for all constructed models in a data frame and compare the results.
In the leaderboard results calculated with the training data, “(Gradient Boosting Machine) GBM_4” exhibited the highest performance based on the area under the curve (AUC), and “(Generalized Linear Model) GLM_1” exhibited the lowest performance. However, when verified on the test data, “GLM_1” exhibited the highest performance, followed by “GBM_4.” Although drastic changes in performance rankings are not common, such changes can occur occasionally.
ML typically focuses on classification or prediction accuracy. However, understanding the influence of input variables on the target variable is often essential. In such cases, determining the importance of input variables, such as the regression coefficient of the regression model, can provide valuable insights.
Select a model (e.g., GBM, DRF, XGBoost) from the leaderboard for further analysis based on “model_no.”
Select a GLM from the leaderboard for additional examination.
SVMs are a class of supervised learning models used for classification, regression, and outlier detection. SVM can also be extended to nonlinear classification using kernel functions, which map the input data into higher-dimensional spaces where linear separation is possible. Given their versatility and effectiveness, SVMs have been extensively applied in diverse fields, such as bioinformatics, image recognition, and text categorization. However, the H2O-AutoML framework lacks support for constructing SVM models. Therefore, SVM tuning and evaluation using packages such as {caret}, {kernlab}, and {e1071} are required.
In SVM, the hyperparameters of the linear kernel are optimized through random search using 10-fold cross-validation. The search extent can be adjusted by modifying the tuneLength parameter. In addition, a nonlinear SVM with a radial kernel can be constructed by setting ‘method’ to “svmRadial.”
The classification results obtained using linear SVM were moderate compared to the AUC obtained using H2O-AutoML.
In recent years, ML, including deep learning, has been primarily developed and applied in Python environments. However, R also offers various packages to perform AutoML. Despite the availability of numerous AutoML platforms, finding and learning how to use these packages can often be challenging. This study introduces methods to automate repetitive tasks using AutoML tools, which can reduce time and effort. AutoML excels at tasks such as data preprocessing, model selection, hyperparameter optimization, and predictive result analysis. Furthermore, previous research has confirmed that AutoML tools can outperform human modeling efforts. We anticipate that these AutoML tools will be increasingly used in future research, contributing to the enhancement of the efficiency of ML modeling.
The author has no conflicts of interest to declare.
This work was supported by 2023 Hannam University Research Fund.
This data is available by request to the Heart Failure Prediction Dataset (https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction).