ENV330H5 Lecture Notes - Lecture 5: Angina Pectoris, Support Vector Machine, Overfitting
Document Summary
The best way to infer causation (randomization, replication & blocking) Replication = our results represent what is typical of the population of interest. Randomization = results are generalizable, and free of bias. Blocking = control for confounding & lurking variables. Not every scientific question is about cause and effect. Sometimes we want to understand patterns (to classify observations), or to make predictions. What variables are important in classifying a watershed as being degraded or not degraded? . High n (number of observations) & high p (number of variables) Data are collected without a specific hypothesis test in mind. Machine learning methods have become very popular in environmental science & ecology. Many, many methods (e. g. random trees, support vector machines, neural networks, k- nearest-neighbour clustering . Simplest (but still powerful): classification & regression trees (decision trees) Classification tree: the leaves are levels of a categorical variable (simplest is binary) Regression tree: the leaves represent values of a continuous variable.