Intro to Author: Pablo J Moreno Contact: [email protected]
Modules in PyCaret Supervised learning pycaret.classification pycaret.regression Unsupervised Learning pycaret.clustering pycaret.anomaly pycaret.nlp pycaret.association from pycaret.[module name] import *
Machine Learning Pipeline Get data Set up Compare Create Tune algorithm models model model Plot model Evaluate Finalize Predict Save / Load results model model with model model * Classification only
Unsupervised Learning
Unsupervised learning Process න(������) Model selected Organized data Model trained න(������) Validation
Clustering Pipeline Get data Set up List Create Assign algorithm models model model import pandas setup() models() create_model() assign_model() Plot model Evaluate Finalize Predict Save / Load results model model with model model plot_model() Tutorial predict_model() save_model() load_model()
Clustering – List of Models ‘kmeans’ - K-Means Clustering ‘ap’ - Affinity Propagation ‘meanshift’ - Mean shift Clustering ‘sc’ - Spectral Clustering ‘hclust’ - Agglomerative Clustering ‘dbscan’ - Density-Based Spatial Clustering ‘optics’ - OPTICS Clustering ‘birch’ - Birch Clustering ‘kmodes’ - K-Modes Clustering
Anomaly Detection Pipeline Get data Set up List Create Assign algorithm models model model import pandas setup() models() create_model() assign_model() Plot model Evaluate Finalize Predict Save / Load results model model with model model plot_model() Tutorial predict_model() save_model() load_model()
Anomaly Detection – List of Models ‘abod’ - Angle-base Outlier Detection ‘cluster’ - Clustering-Based Local Outlier ‘cof’ - Connectivity-Based Outlier Factor ‘histogram’ - Histogram-based Outlier Detection ‘knn’ - k-Nearest Neighbors Detector ‘lof’ - Local Outlier Factor ‘svm’ - One-class SVM detector ‘pca’ - Principal Component Analysis ‘mcd’ - Minimum Covariance Determinant ‘sod’ - Subspace Outlier Detection ‘sos’ - Stochastic Outlier Selection
Natural Language Processing Pipeline Get data Set up List Create Assign algorithm models model model import pandas setup() models() create_model() assign_model() Plot model Evaluate Finalize Predict Save / Load results model model with model model plot_model() evaluate_model() Tutorial predict_model() save_model() load_model()
Natural Language Processing Pipeline ‘lda’ - Latent Dirichlet Allocation ‘lsi’ - Latent Semantic Indexing ‘hdp’ - Hierarchical Dirichlet Process ‘rp’ - Random Projections ‘nmf’ - Non-Negative Matrix Factorization
Association Rule Mining Pipeline Get data Set up Compare Create Tune algorithm models model model import pandas setup() create_model() Plot model Evaluate Finalize Predict Save / Load results model model with model model plot_model() save_model() load_model() Tutorial
Supervised Learning
Supervised learning Process 70% - 80% Organized data Training Set න(������) Dataset split Model selected for training 30% - 20% Test Set Result න(������) validation Model trained
Regression & Classification Pipeline Get data Set up Compare Create Tune algorithm models model model import pandas setup() compare_models() create_model() tune_model() Plot model Evaluate Finalize Predict Save / Load results model model with model model plot_model() evaluate_model() finalize_model() predict_model() save_model() optimize_threshold()* load_model() calibrate_model()* Tutorial Regression Tutorial Classification * Classification only
Classification & Regression – List of Models ‘lr’ - Logistic Regression ‘lr’ - Linear Regression ‘knn’ - K Neighbors Classifier ‘nb’ - Naive Bayes ‘lasso’ - Lasso Regression ‘dt’ - Decision Tree Classifier ‘svm’ - SVM - Linear Kernel ‘ridge’ - Ridge Regression ‘rbfsvm’ - SVM - Radial Kernel ‘gpc’ - Gaussian Process Classifier ‘en’ - Elastic Net ‘mlp’ - MLP Classifier ‘ridge’ - Ridge Classifier ‘lar’ - Least Angle Regression ‘rf’ - Random Forest Classifier ‘qda’ - Quadratic Discriminant Analysis ‘llar’ - Lasso Least Angle Regression ‘ada’ - Ada Boost Classifier ‘gbc’ - Gradient Boosting Classifier ‘omp’ - Orthogonal Matching Pursuit ‘lda’ - Linear Discriminant Analysis ‘et’ - Extra Trees Classifier ‘br’ - Bayesian Ridge ‘xgboost’ - Extreme Gradient Boosting ‘lightgbm’ - Light Gradient Boosting Machine ‘ard’ - Automatic Relevance Determination ‘catboost’ - CatBoost Classifier ‘par’ - Passive Aggressive Regressor ‘ransac’ - Random Sample Consensus ‘tr’ - TheilSen Regressor ‘huber’ - Huber Regressor ‘kr’ - Kernel Ridge ‘svm’ - Support Vector Regression ‘knn’ - K Neighbors Regressor ‘dt’ - Decision Tree Regressor ‘rf’ - Random Forest Regressor ‘et’ - Extra Trees Regressor ‘ada’ - AdaBoost Regressor ‘gbr’ - Gradient Boosting Regressor ‘mlp’ - MLP Regressor ‘xgboost’ - Extreme Gradient Boosting ‘lightgbm’ - Light Gradient Boosting Machine ‘catboost’ - CatBoost Regressor
New dataset න(������) Training Set (*) (**) න(������) PKL file i) Training and loading ii) Prediction (*) To be refreshed with historical data (**) To be refreshed with new data
Special Functions – Model Ensembling Ensembling a trained model is as simple as writing ensemble_model() It takes only one mandatory parameter: the trained model object. This functions returns a table with k-fold cross validated scores of common evaluation metrics along with trained model object. The evaluation metrics used are: Classification: Accuracy, AUC, Recall, Precision, F1, Kappa, MCC Regression: MAE, MSE, RMSE, R2, RMSLE, MAPE Bagging: or Bootstrap aggregating, is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach. Boosting: Boosting is an ensemble meta-algorithm for primarily reducing bias and variance in supervised learning. Boosting is in the family of machine learning algorithms that convert weak learners to strong ones. A weak learner is defined to be a classifier that is only slightly correlated with the true classification (it can label examples better than random guessing). In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification.
Special Functions – Blend models Blending models is a method of ensembling which uses consensus among estimators to generate final predictions. The idea behind blending is to combine different machine learning algorithms and use a majority vote or the average predicted probabilities in case of classification to predict the final outcome. Blending models in PyCaret is as simple as writing blend_models(). This function can be used to blend specific trained models that can be passed using estimator_list parameter within blend_models or if no list is passed, it will use all the models in model library. In case of classification, method parameter can be used to define ‘soft‘ or ‘hard‘ where soft uses predicted probabilities for voting and hard uses predicted labels. This functions returns a table with k-fold cross validated scores of common evaluation metrics along with trained model object. The evaluation metrics used are: Classification: Accuracy, AUC, Recall, Precision, F1, Kappa, MCC Regression: MAE, MSE, RMSE, R2, RMSLE, MAPE
Special Functions – Stack models Stacking models is method of ensembling that uses meta learning. The idea behind stacking is to build a meta model that generates the final prediction using the prediction of multiple base estimators. stack_models() This function takes a list of trained models using estimator_list parameter. All these models form the base layer of stacking and their predictions are used as an input for a meta model that can be passed using meta_model parameter. If no meta model is passed, a linear model is used by default. In case of Classification, method parameter can be used to define ‘soft‘ or ‘hard‘ where soft uses predicted probabilities for voting and hard uses predicted labels. The evaluation metrics used are: Classification: Accuracy, AUC, Recall, Precision, F1, Kappa, MCC Regression: MAE, MSE, RMSE, R2, RMSLE, MAPE
Pycaret.classification.setup( polynomial_threshold: float = 0.1, data: pandas.core.frame.DataFrame, target: str, group_features: Optional[List[str]] = None, train_size: float = 0.7, group_names: Optional[List[str]] = None, test_data: Optional[pandas.core.frame.DataFrame] = None, feature_selection: bool = False, preprocess: bool = True, feature_selection_threshold: float = 0.8, imputation_type: str = 'simple’, feature_selection_method: str = 'classic’, iterative_imputation_iters: int = 5, feature_interaction: bool = False, categorical_features: Optional[List[str]] = None, feature_ratio: bool = False, categorical_imputation: str = 'constant’, interaction_threshold: float = 0.01, categorical_iterative_imputer: Union[str, Any] = 'lightgbm’, fix_imbalance: bool = False, ordinal_features: Optional[Dict[str, list]] = None, fix_imbalance_method: Optional[Any] = None, high_cardinality_features: Optional[List[str]] = None, data_split_shuffle: bool = True, high_cardinality_method: str = 'frequency’, data_split_stratify: Union[bool, List[str]] = False, numeric_features: Optional[List[str]] = None, fold_strategy: Union[str, Any] = 'stratifiedkfold’, numeric_imputation: str = 'mean’, fold: int = 10, numeric_iterative_imputer: Union[str, Any] = 'lightgbm’, fold_shuffle: bool = False, date_features: Optional[List[str]] = None, fold_groups: Optional[Union[str, pandas.core.frame.DataFrame]] = None, ignore_features: Optional[List[str]] = None, n_jobs: Optional[int] = 1, normalize: bool = False, use_gpu: bool = False, normalize_method: str = 'zscore’, custom_pipeline: Union[Any, Tuple[str, Any], List[Any], List[Tuple[str, Any]]] = None, transformation: bool = False, html: bool = True, transformation_method: str = ’yeo-johnson’, session_id: Optional[int] = None, handle_unknown_categorical: bool = True, log_experiment: bool = False, unknown_categorical_method: str = 'least_frequent’, experiment_name: Optional[str] = None, pca: bool = False, log_plots: Union[bool, list] = False, pca_method: str = 'linear’, log_profile: bool = False, pca_components: Optional[float] = None, log_data: bool = False, ignore_low_variance: bool = False, silent: bool = False, combine_rare_levels: bool = False, verbose: bool = True, rare_level_threshold: float = 0.1, profile: bool = False, bin_numeric_features: Optional[List[str]] = None, profile_kwargs: Dict[str, Any] = None) remove_outliers: bool = False, outliers_threshold: float = 0.05, Mandatory remove_multicollinearity: bool = False, Very important multicollinearity_threshold: float = 0.9, Good to know remove_perfect_collinearity: bool = True, Needed if training within Power BI create_clusters: bool = False, cluster_iter: int = 20, polynomial_features: bool = False, polynomial_degree: int = 2, trigonometry_features: bool = False,
Pycaret.regression.setup( trigonometry_features: bool = False, data: pandas.core.frame.DataFrame, target: str, polynomial_threshold: float = 0.1, train_size: float = 0.7, group_features: Optional[List[str]] = None, test_data: Optional[pandas.core.frame.DataFrame] = None, group_names: Optional[List[str]] = None, preprocess: bool = True, feature_selection: bool = False, imputation_type: str = 'simple’, feature_selection_threshold: float = 0.8, iterative_imputation_iters: int = 5, feature_selection_method: str = 'classic’, categorical_features: Optional[List[str]] = None, feature_interaction: bool = False, categorical_imputation: str = 'constant’, feature_ratio: bool = False, categorical_iterative_imputer: Union[str, Any] = 'lightgbm’, interaction_threshold: float = 0.01, ordinal_features: Optional[Dict[str, list]] = None, transform_target: bool = False, high_cardinality_features: Optional[List[str]] = None, transform_target_method: str = 'box-cox’, high_cardinality_method: str = 'frequency’, data_split_shuffle: bool = True, numeric_features: Optional[List[str]] = None, data_split_stratify: Union[bool, List[str]] = False, numeric_imputation: str = 'mean’, fold_strategy: Union[str, Any] = 'kfold’, numeric_iterative_imputer: Union[str, Any] = 'lightgbm’, fold: int = 10, date_features: Optional[List[str]] = None, fold_shuffle: bool = False, ignore_features: Optional[List[str]] = None, fold_groups: Optional[Union[str, pandas.core.frame.DataFrame]] = None, normalize: bool = False, n_jobs: Optional[int] = - 1, normalize_method: str = 'zscore’, use_gpu: bool = False, transformation: bool = False, custom_pipeline: Union[Any, Tuple[str, Any], List[Any], List[Tuple[str, Any]]] = None, transformation_method: str = 'yeo-johnson’, html: bool = True, handle_unknown_categorical: bool = True, session_id: Optional[int] = None, unknown_categorical_method: str = 'least_frequent’, log_experiment: bool = False, pca: bool = False, experiment_name: Optional[str] = None, pca_method: str = 'linear’, log_plots: Union[bool, list] = False, pca_components: Optional[float] = None, log_profile: bool = False, ignore_low_variance: bool = False, log_data: bool = False, combine_rare_levels: bool = False, silent: bool = False, rare_level_threshold: float = 0.1, verbose: bool = True, bin_numeric_features: Optional[List[str]] = None, profile: bool = False, remove_outliers: bool = False, profile_kwargs: Dict[str, Any] = None) outliers_threshold: float = 0.05, remove_multicollinearity: bool = False, Mandatory multicollinearity_threshold: float = 0.9, Very important remove_perfect_collinearity: bool = True, Good to know create_clusters: bool = False, Needed if training within Power BI cluster_iter: int = 20, polynomial_features: bool = False, polynomial_degree: int = 2,
Search
Read the Text Version
- 1 - 22
Pages: