ITEC 4230 Study Guide - Final Guide: Gini Coefficient, Box Plot, Scatter Plot

157 views2 pages

Document Summary

Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies. Integration of multiple databases, data cubes, or files. Density and distance between points, which is critical to clustering, outlier analysis, becomes less meaningful. The possible combinations of subspaces will grow exponentially. Reduce time and space required in data mining . allow easier visualization. 3 - supervised and nonlinear techniques (e. g. , feature selection) Reduce data volume by choosing alternative, smaller forms of data representation. Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers) : log-linear models obtain value at a point in m-d space as the product on appropriate marginal subspaces. Find a projection that captures the largest amount of variation in data. The original data are projected onto a much smaller space, resulting in dimensionality reduction. We find the eigenvectors of the covariance matrix, and these eigenvectors define the new space.