Publié le 14 February 2025
Machine learning Company : ● http://the.echonest.com/ Bon à savoir : ● Kaggle Liens: ● https://machinelearningmastery.com/clean-text-machine-learning-python/ ● https://www.springboard.com/blog/data-mining-python-tutorial/ Prérequis : ● Connaissance des librairies python : SciPy(Numpy,Panda),Matplotlib ● Installer Anaconda ● Installé Jupyter Notions importantes du machine learning ● Data objects are also called data points, samples, examples, vectors, instances, or data tuples. They are entities in a given context in a given dataset, eg patients, products etc. ● Attributes are also called features, variables, dimensions. ● Attribute vector is a set of attributes used to describe a given data object. Eg., the attribute vector <Name, Disease, Prescription> describes patient data objects. ● (Observed) Values for attributes are called observations. Eg, cancer, high blood pressure, flu may be the observations for the disease attribute in a given dataset. Data preprocessing Central tendency ● mean ● median ● mode data dispersion ● range ● quantile ● variance ● standard deviation ● covariance ● correlation tips : Corrélation et covariance. Because correlation is standardized between -1 and 1, it is a better indicator of how dependent two attributes are to each other. In this way, we can safely compare the correlations derived on different attributes. Displaying data: ● histograms : are used to summarize the distribution of observations ● scatter plots : are used to observe correlations between pairs of numeric attributes. ● heatmaps : visualise data through variations in coloring Data cleaning Handling missing data ● Measure of central tendency: use the attribute mean/median to fill in the missing value ● Use the most probable value to fill in the missing value: in a supervised manner, find the most possible value using inference-based mechanisms such as a Bayesian formula or decision tree Unified data format ● Conversion of data : binary to numeric, categorical to numeric, ordered to numeric Noise data : how to handle it ● smooth by bin mean/bin median/bin boundaries Discretization method : binning ● Equi-width (distance) partitioning ● Equi-depth (frequency) partitioning Data integration Data transformation ● Smoothing: remove noise from data ● Discretization: binning, histograms, clusters ● Normalization: scale to fall within a small, specified range : min-max normalization, z-score normalization, normalization by decimal scaling ● Concept hierarchy generalization: replace a value with a higher class ● Aggregation: summarization, data cube construction Data reduction Dimensionality reduction : reduce the number of considered attributes ● Automatic feature selection: Model-based selection • The most important features are selected using a supervised ML algorithm (eg decision tree). Iterative sélection • Iteratively, the least important features are discarded (backward elimination) or the most important are added (forward selection) until the desired number is reached ● Principal Component Analysis (PCA) Numerosity reduction : reduce the volume of data to smaller but representative data representations ● Simple random sample without replacement (SRSWOR) of size s: randomly pick s samples, all with equal probability ● Simple random sample with replacement (SRSWR) of size s: the same item can be picked more than once ● Cluster sample: when data are clustered, pick randomly s clusters. Eg. data retrieved in memory pages ● Stratified sample: create strata (levels) in the data to represent different categories Data compression: compress data in lossless (if original data can be reconstructed) or lossy (otherwise) manner Supervised Learning Problem : Given a dataset D={(x1,y1), …, (xN,yN)}, where xi in X is a data point vector and yi in Y is a (class) label or value, choose a function f(x)=y for x not in X. k-Nearest Neighbor (k-NN) Some characteristics: ● Performs well in low dimensional spaces. Curse of dimensionality! ● Scalability issues. ● Needs attribute reformatting (to numeric) and scaling. Lazy learner (vs Eager learner) ● Stores the input (training) data. ● Does not construct a learning model. ● ‘Learns’ the class or the target value of a new data point by comparing it to the stored data. Distance metrics: NUMERIC : Minkowski distance of order p CATEGORICAL : Hamming Distance on categorical attributes K-sélection : Model parameter tuning: ● Use cross validation (will see it later in this course) to tune the model’s parameters (here k). Model Evaluation The larger the training set the better the model captures the characteristics of the data. The larger the test set the more reliable the error estimate. Learning Method : ● Represents a family of learning models, e.g., k-Nearest Neighbours, or Linear Regression. ● using hyperparameters Learned model : ● Is a specific instance of a learning method, for example 3-NN, or the linear regression model y=0.2x +2 ● A model has parameters that are learned from the training data ● Note, that k-NN has no parameters to be learned from the training data. Learning objective: Generalization Learn a model that satisfactorily generalizes to previously unseen data, i.e., can successfully predict the respective target. ● Quantify success: Prediction error estimation. Danger 1: Overfitting (high variance) The model overfits the training dataset, capturing noise and outliers. It may be caused because of a small quantity of training data, too complex algorithms, and many features (i.e., not using only informative features). Solution : Split training and test sets in good percentages. Try feature selection or regularisation. Danger 2: Underfitting (high bias) The model cannot capture the relationships in the training data. It is typically a problem raising from simple models. Solution : Make sure you always hold out a test set and not use it for training or parameter tuning! Try more complex models. Danger 3: Error estimation tied to specific test set Poor split to training and test set, e.g. class existing in test set but not in training set! Solution : Iterative methods of computing error estimation e.g., crossfolding validation. Learning Curves How do we observe graphically the impact of the training set size over the error estimate? By learning curves. Algorithm: Input: training/test partition for each sample size s of the learning curve: 1. randomly select s instances from training set 2. learn model 3. evaluate model on training and test set to determine accuracies a training and a test 4. plot (s, a) for both test and training accuracies Output: learning curves Problems : High bias (underfitting). This algorithm will not benefit from more data. The scores converge to a low number. A more complex model might be needed. High variance (overfitting). In the begining we see that there is a great gap between the scores. The prediction benefits from more data. What if we don’t have enough data to split to training and test set and to ensure good error estimations? Cross-validation: Partition the same data to train and test dataset more than once! ● Random Resampling: Partition the labeled dataset to training and test sets N times by randomly selecting data for each set ● Stratified Sampling: Random resampling + Ensure that all the strata (classes) of the data will be represented proportionally in the training and test sets ● K-fold cross-validation: Create k partitions (k-folds) of the labeled data. In each iteration k-1 folds are used as training and 1 fold as test set. In this way, every item is in the test set exactly once. The final accuracy is the average of the k iterations. Informs us for the average prediction quality (error estimate) of a learning method. ● Gives us an indication of the variance (standard deviation) of the prediction quality, given different training and test sets. ● Does not learn a specific model in itself; but it provides the previously discussed supplementary information on the error estimate . ● Except for model evaluation, cross-validation is used also for model selection (seen later in the slides). Somes extras : 5-fold or 10-fold cross validation is common, because higher values of k render the learning process inefficient. Cross-validation for model selection: To find candidate hyperparameters of a model ● GridSearch: search all possible combinations in a grid of parameters. ● RandomSearch: within given ranges, randonly pick a number of parameters to test. Metrics
Retour aux articles Modifier