CAS MA 113 Lecture Notes - Lecture 1: Cumulative Frequency Analysis, Continuous Or Discrete Variable, Statistical Inference
Chapter 1:
General Definitions
•Statistics - the science of collecting, organizing, summarizing, analyzing information to draw
conclusions/answer questions
•also about providing a measure of confidence in any conclusion.
•ex: given a set of data, how can we use digital & numeric ways to analyze this while making bigger
inferences about this
•Data - a “fact or proposition used to draw a conclusion or make a decision’
•also can describe the characteristics of an individual
•‘information’ = data
•Margin of Error - measure of confidence
•*a good sample should have a margin of error
•Population - the entire group of individuals who are being studied
•Sample - a subset of the population that is being studied
•statistics usually uses a population sample to make an inference about the general population
•Individual - a person or object who is a member of the population being studied
•Descriptive Statistics - the process of organizing and summarizing data
•describe data through numerical summaries/tables/graphs
•Statistics - a numerical summary based on a sample
•Inferential statistics - taking results from a sample and extending them to the population in order to
measure the reliability of a certain result
•Parameter - a numerical summary of a population
•ex: the population mean, the population proportion
•ex: the percentage of all students on campus who have a job is 84.9%
•Statistic - a numerical summary based on a sample
•can help approximate the parameter
•ex: a sample of 250 students is obtained and from that sample, 84.9% have a job
•1. Step 1: Identify the research objective
•2. Step 2: Collect the information needed to answer the question
•3. Step 3: Describe the data
•organize/summarize the information
•4. Step 4: Draw conclusions from the data
Variables
•Variables - the characteristics of the individuals within the population
•attributes that vary between individuals within a population
•ex: age, height, weight, gender
•Qualitative/Categorical Variables - allow for the classification of individuals based on some
attribute or characteristic
•will be a preference/description/characteristic
•Quantitative Variables - provide a numerical measure of individuals
•the values of a quantitative variable can be added or subtracted to provide meaningful results
•will be a number
•Discrete Variable - a quantitative variable that has a finite/countable number of possible values
•cannot take on every possible value between two given values
•Countable - values result from counting such as 0, 1, 2, 3 and so on
•Continuous Variable - a quantitative variable that has an infinite number of possible values that it
can take on
•can take on every possible value between two given values
Data
•Data - the list of observation a variable assumes
find more resources at oneclass.com
find more resources at oneclass.com
•ex: gender = variable, male/female (observations) = data
•Qualitative Data - observations corresponding to a qualitative variable
•Quantitative Data - observations corresponding to a quantitative variable
•Discrete Data - observations corresponding to a discrete variable
•Continuous Data - observations corresponding to a continuous variable
•Bias - the tendency to over-estimate or under-estimate the value of a parameter
•if the results of a sample are not representative of the population, then the sample has a bias
•Raw Data - data that is not organized
•When data is collected from a survey or designed experiment, it must be organized into a manageable
form.
•Ways to organize data
•Tables
•Graphs
•Numerical Summaries (chapter 3)
Three Sources of Bias
•Sampling Bias - the technique used to obtain the individuals used in the sample tends to favor one
part of the population over another
•Under-coverage (a type of sampling bias) - when the proportion of one segment of the population is
lower in a sample than it is in the population
•Nonresponse Bias - when the individuals in a sample do not respond to the survey but have different
opinions from those who do respond
•can be improved through the use of callbacks/rewards/incentives
•Response Bias - exists when the answers on a survey do not reflect the true feelings of the respondent
•Types of Response Bias
•Interviewer Error
•Misrepresented Answers
•Wording of Questions
•Order of Questions/Words (can lead people on)
find more resources at oneclass.com
find more resources at oneclass.com
Chapter 2:
Frequency
•Frequency Distribution - lists each category of data and the number of occurrences for each
category of data (counting how frequently a category of data was & listing this distribution)
•Relative Frequency - the proportion (or percent) of observations within a category and is found
using the formula
•Formula = (frequency)/(sum of all frequencies)
•Relative Frequency Distribution - lists each category of data with the relative frequency
•Cumulative Frequency Distribution - displays the aggregate/total frequency of the category by
adding the categories together thus showing the total number of observations less than or equal to the
category while for continuous data, it displays the total number of observations
less than or equal to the upper class limit of a class.
•ex: picture to the right
•answer: B.
•why: the vertical axis (y-axis) already lists the cumulative frequency for each
grade, and for a 70, it says that the cumulative frequency is 37%, meaning
that 37% of the class got a 70 or lower
•Cumulative Relative Frequency Distribution - displays the proportion/
percentage of observations less than or equal to the category for discrete data
and the proportion/percentage of observations less than or equal to the upper
class limit for continuous data.
Charts
•Pareto Chart - a bar graph where the bars are drawn in decreasing order of frequency or relative
frequency
•Pie Chart - a circle divided into sectors, where each sector represents a category of data
•the area of each sector is proportional to the frequency of the category
•For large discrete sets, or continuous variables, it is harder to group things individually
•solution = classes
•Classes - categories into which data is grouped
•When a data set consists of a large number of different discrete data values OR when a data set
consists on continuous data, we must create classes by using intervals of numbers
•^most similar to quantitative data
•*random variable numbers go on the horizontal axis
•When organizing data, it is easy to manipulate how we present it (ex: through certain
charts)
•Stem-and-Leaf Plot - uses digits to the left or the rightmost digit to form the stem
while each rightmost digit forms a leaf
•Rather than using classes, you just list the tens digit separately
•ex: a data value of 147 would have 14 as the stem and 7 as the leaf
•ex: a data value of 4.7 would have 4 be the stem and 7 be the leaf
•ex (picture): how you display the data
•2.8, 3.8, 3.8, 3.8, 3.3, 3.9…etc.
•you must specify alongside the data table what the vertical line represents (ex:
if a number is 2.8 or 28)
•Common Distribution Shapes
•Uniform/Symmetric - a flat frequency distribution
•Bell-Shaped - data is symmetric at the highest middle point
•can fold vertically and have the left side data be equal to the right side dat
•Skewed Right - more data is clustered on the left side of the chart
•usually results from having outlying/large values on the right side of the chart
•Skewed Left - more data is clustered on the right side of the chart
•usually results from having outlying/large values on the left side of the chart
find more resources at oneclass.com
find more resources at oneclass.com
Document Summary
Step 1: identify the research objective: 2. Step 2: collect the information needed to answer the question: 3. Step 3: describe the data: organize/summarize the information, 4. Distribution shape - mean v. median: skewed left - mean is substantially smaller than the median, symmetric - mean is roughly equal to the median, skewed right - mean is substantially larger than the median. Computational formula: an equivalent formula for determining the population standard deviation, square root of the (sum of the squares) - (sum of the squares) divided by (number of observations) all over the (number of observations) It must be whatever value forces the sum of the deviations about the mean to be zero. Variance - the variance of a variable is the square of the standard deviation. 14. 99 with one individual of a value 14. 13. which observation is closer to its population mean? answer: population 1.