Sampling and Statistics

Statistics

We start the discussion in the natural way.  We all have a general feeling about what statistics is.  In the course of these lecture notes, we will lay out the detail about what statistics is and how it is used.  For now we give a quick definition.

Statistics is the study of how to collect, organize, analyze, and interpret numerical information from data.

 


Sampling and Types of Data

Population vs. Sample

We define the population the total set of individuals that we are interested about and a sample a subset of the individuals selected in a prescribed manner of study.

Typically, population data is very hard or even impossible to gather.  Statisticians and researchers will instead extract data from a sample.  There are several types of data that is of interest.

We can classify data into two types:

  1. Numerical or Quantitative data is data where the observations are numbers.  For example, age, height, on a scale from one to ten..., distance, number of ,...

  2. Categorical or Qualitative data is data where the observations are non-numerical.  For example, favorite color, choice of politician, ...

There is a more refined way to classify data.  Data can be put into one of several categories called levels of measurement

  1. The nominal level is synonymous with qualitative data.  

  2. The ordinal level is data that involves ranking.  For example Williams took second place in the US Open.  There are no actual values assigned to each variable, but we can still compare one with another.

  3. The interval level is data such that one outcome can be compared with another outcome by taking differences.  For example one outcome may be 12 degrees warmer than another, or an outcome may have occurred 35 minutes later than another.

  4. The ratio level is data that both differences and ratios can be taken.  For example if the cost of a hamburger is $2 and that of a steak is $12, it make sense to either say that the steak costs $10 more or that the steak is 6 times more expensive. 

  5. Boolean data is data that can achieve one of two values such as true or false, yes or no, on or off, etc.  For example the outcome of a questionnaire asking if you agree with Bush's policy on the Middle East.

Data is called univariate if it represents one attribute and bivariate if it contains two attributes.  Bivariate data is often used to compare and contrast.  For example, we may study weight gain and caloric intake.  

Numerical data is called discrete if the number of  possible values within every bounded range is finite.  Examples include:  rolling dice, number of times that..., ...

Otherwise, numerical data is called continuous.  For example, height, weight, temperature, distance,...

 


Random Samples

 

When we conduct a survey we always attempt to achieve a random sample.  A simple random sample of size n is one in which every possible subset of size n has equal chance of being selected.  For example, to choose a random sample of 20 people with phone numbers, we can use a random number generator to randomly select 20 phone numbers.  

Caution:  A simple random sample is almost always impossible to achieve in the real world.  For example, using the phone number generator, we will only be able to collect data from those who have a phone,  pick up the phone,  and are willing to participate in the phone survey.  Because of this most surveys have inherent flaws.  However, a survey with a small flaw is better then no information.  Many surveys are done using convenience sampling.  For example a researcher stands outside a supermarket and interviews anyone eager to respond.

One way to overcome the problem of obtaining a random sample is to use stratified sampling. Stratified sampling ensures that members of each strata (or type) are included in the survey.  For example we may randomly select 50 Caucasians, 25 Hispanics, and 10 Philipinos from the Lake Tahoe community to ensure that the main three ethnic groups are represented.

One problem with sampling is that often the researcher only gets respondents who are eager to be interviewed.  One way to combat this is to use cluster sampling.  This process involves breaking the population into several groups or clusters.  Some of the clusters are randomly selected and the researcher makes sure that every individual in the selected clusters are surveyed.  This usually involves paying for the respondents to take the survey.

Click here to have the computer generate a random number

 


Experimental Design  

On of the most fatal mistakes a researcher can make is to have faulty experimental design, that is poor planning.  Deep thought needs to go into the design of the experiment before any field work can take place.  Below are some guidelines for planning a statistical study.

  • Identify the population.

  • Decide what your variables are and how you are going to take measurements.  This can involve lawyers and regulatory agencies.

  • Determine the sample size.

  • Collect the data.

  • Organize the data and use either descriptive or inferential statistical methods to interpret and report on the data

  • Publish, noting how you need more funding to do a more extensive survey.

 

The coin toss is a type of experimentation.  Experimentation is a type of data collection where the researcher creates the data.  Usually experimentation answers the question, "If we do this what happens."  The response variable  is the variable being studied by the experimenter.  Often the experimenter sets the environment to run the experiment.  For example, a psychologist may want to determine mood based on weather conditions.  She may study several peoples' moods under various weather conditions.  These conditions are called experimental conditions or factors. Sometimes there is a factor that cannot be distinguished from another factor.   For example red wine drinking has been correlated with a low risk of heart disease.  But if people with a low stress level tend to drink red wine, then the two factors are confounded.  Low stress is said to be an extraneous factor.  One way to handle this is called blocking which means to create groups that are similar in every way except what you are trying to experiment.  Another way to handle this is called control which means to keep all extraneous factors constant.


Back to the Descriptive Statistics Home Page

Back to the Elementary Statistics (Math 201) Home Page

Back to the Math Department Home Page

e-mail Questions and Suggestions