Sampling and Statistics
We start the discussion in the natural way. We all have a general feeling about what statistics is. In the course of these lecture notes, we will lay out the detail about what statistics is and how it is used. For now we give a quick definition.
Statistics is the study of how to collect, organize, analyze, and interpret numerical information from data.
Sampling and Types of Data
Population vs. Sample
We can classify data into two types:
There is a more refined way to classify data. Data can be put into one of several categories called levels of measurement
Data is called univariate if it represents one attribute and bivariate if it contains two attributes. Bivariate data is often used to compare and contrast. For example, we may study weight gain and caloric intake.
Numerical data is called discrete if the number of possible values within every bounded range is finite. Examples include: rolling dice, number of times that..., ...
Otherwise, numerical data is called continuous. For example, height, weight, temperature, distance,...
When we conduct a survey we always attempt to achieve a random sample. A simple random sample of size n is one in which every possible subset of size n has equal chance of being selected. For example, to choose a random sample of 20 people with phone numbers, we can use a random number generator to randomly select 20 phone numbers.
Caution: A simple random sample is almost always impossible to achieve in the real world. For example, using the phone number generator, we will only be able to collect data from those who have a phone, pick up the phone, and are willing to participate in the phone survey. Because of this most surveys have inherent flaws. However, a survey with a small flaw is better then no information. Many surveys are done using convenience sampling. For example a researcher stands outside a supermarket and interviews anyone eager to respond.
One way to overcome the problem of obtaining a random sample is to use stratified sampling. Stratified sampling ensures that members of each strata (or type) are included in the survey. For example we may randomly select 50 Caucasians, 25 Hispanics, and 10 Philipinos from the Lake Tahoe community to ensure that the main three ethnic groups are represented.
One problem with sampling is that often the researcher only gets respondents who are eager to be interviewed. One way to combat this is to use cluster sampling. This process involves breaking the population into several groups or clusters. Some of the clusters are randomly selected and the researcher makes sure that every individual in the selected clusters are surveyed. This usually involves paying for the respondents to take the survey.
On of the most fatal mistakes a researcher can make is to have faulty experimental design, that is poor planning. Deep thought needs to go into the design of the experiment before any field work can take place. Below are some guidelines for planning a statistical study.
The coin toss
is a type of experimentation. Experimentation is a type
of data collection where the researcher creates the data. Usually
experimentation answers the question, "If we do this what happens." The response variable is the variable
being studied by the experimenter. Often the experimenter sets the
environment to run the experiment. For example, a psychologist may
want to determine mood based on weather conditions. She may study several
peoples' moods under various weather conditions. These conditions are
called experimental conditions or factors. Sometimes
there is a factor that cannot be distinguished from another factor.
For example red wine drinking has been correlated with a low risk of heart
disease. But if people with a low stress level tend to drink red wine,
then the two factors are confounded. Low stress is said
to be an extraneous factor. One way to handle this is called blocking
which means to create groups that are similar in
every way except what you are trying to experiment. Another way to handle
this is called control which means to keep
all extraneous factors constant.