Statistics
Collection of
methods for planning experiments, obtaining data, and then organizing,
summarizing, presenting, analyzing, interpreting, and drawing conclusions.
Variable
Characteristic
or attribute that can assume different values
Population
All subjects
possessing a common characteristic that is being studied.
Sample
A subgroup or
subset of the population.
Parameter
Characteristic
or measure obtained from a population.
Statistic (not to be confused with Statistics)
Characteristic
or measure obtained from a sample.
Descriptive Statistics
Collection,
organization, summarization, and presentation of data.
Inferential Statistics
Generalizing
from samples to populations using probabilities.Performing hypothesis testing,
determining relationships between variables, and making predictions.
Qualitative Variables
Variables which
assume non-numerical values.
Quantitative Variables
Variables which
assume numerical values.
Discrete Variables
Variables which
assume a finite or countable number of possible values.Usually obtained by
counting.
Continuous Variables
Variables which
assume an infinite number of possible values.Usually obtained by measurement.
Raw Data
Data collected in original form.
Frequency
The number of times a certain value or class of values occurs.
Frequency Distribution
The organization of raw data in table form with classes and
frequencies.
Ungrouped Frequency Distribution
A frequency distribution of numerical data. The raw data is not
grouped.
Grouped Frequency Distribution
A frequency distribution where several numbers are grouped into one
class.
Class Limits
Separate one class in a grouped frequency distribution from
another. The limits could actually appear in the data and have gaps between the
upper limit of one class and the lower limit of the next.
Class Boundaries
Separate one class in a grouped frequency distribution from
another. The boundaries have one more decimal place than the raw data and
therefore do not appear in the data. There is no gap between the upper boundary
of one class and the lower boundary of the next class. The lower class boundary
is found by subtracting 0.5 units from the lower class limit and the upper
class boundary is found by adding 0.5 units to the upper class limit.
Class Width
The difference between the upper and lower boundaries of any class.
The class width is also the difference between the lower limits of two
consecutive classes or the upper limits of two consecutive classes. It is not
the difference between the upper and lower limits of the same class.
Class Mark (Midpoint)
The number in the middle of the class. It is found by adding the
upper and lower limits and dividing by two. It can also be found by adding the
upper and lower boundaries and dividing by two.
Cumulative Frequency
The number of values less than the upper class boundary for the
current class. This is a running total of the frequencies.
Relative Frequency
The frequency divided by the total frequency. This gives the
percent of values falling in that class.
Cumulative Relative Frequency (Relative Cumulative Frequency)
The running total of the relative frequencies or the cumulative
frequency divided by the total frequency.Gives the percent of the values which
are less than the upper class boundary.
Histogram
A graph which displays the data by using vertical bars of various
heights to represent frequencies. The horizontal axis can be either the class
boundaries, the class marks, or the class limits.
Frequency Polygon
A line graph. The frequency is placed along the vertical axis and
the class midpoints are placed along the horizontal axis. These points are
connected with lines.
Ogive
A frequency polygon of the cumulative frequency or the relative
cumulative frequency.The vertical axis the cumulative frequency or relative cumulative
frequency. The horizontal axis is the class boundaries. The graph always starts
at zero at the lowest class boundary and will end up at the total frequency
(for a cumulative frequency) or 1.00 (for a relative cumulative frequency).
Pie Chart
Graphical depiction of data as slices of a pie. The frequency
determines the size of the slice. The number of degrees in any slice is the
relative frequency times 360 degrees.
Percentile
The percent of
the population which lies below that value. The data must be ranked to find
percentiles.
Quartile
Either the
25th, 50th, or 75th percentiles. The 50th percentile is also called the median.
Decile
Either the
10th, 20th, 30th, 40th, 50th, 60th, 70th, 80th, or 90th percentiles.
InterQuartile Range (IQR)
The difference
between the 3rd and 1st Quartiles.
Outlier
An extremely
high or low value when compared to the rest of the values.
Population vs Sample
The population includes all objects of interest
whereas the sample is only a portion of the population. Parameters are
associated with populations and statistics with samples. Parameters are usually
denoted using Greek letters (mu, sigma) while statistics are usually denoted
using Roman letters (x, s).
There are several reasons why we don't work
with populations. They are usually large, and it is often impossible to get
data for every object we're studying. Sampling does not usually occur without
cost, and the more items surveyed, the larger the cost.
We compute statistics, and use them to estimate
parameters. The computation is the first part of the statistics course
(Descriptive Statistics) and the estimation is the second part (Inferential
Statistics)
Discrete vs
Continuous
Discrete variables are usually obtained by
counting. There are a finite or countable number of choices available with
discrete data. You can't have 2.63 people in the room.
Continuous variables are usually obtained by
measuring. Length, weight, and time are all examples of continous variables.
Since continuous variables are real numbers, we usually round them. This
implies a boundary depending on the number of decimal places. For example: 64
is really anything 63.5 <= x < 64.5. Likewise, if there are two decimal
places, then 64.03 is really anything 63.025 <= x < 63.035. Boundaries
always have one more decimal place than the data and end in a 5.
Grouped
Frequency Distributions
Guidelines
for classes
- There should be between 5 and
20 classes.
- The class width should be an
odd number. This will guarantee that the class midpoints are integers
instead of decimals.
- The classes must be mutually
exclusive. This means that no data value can fall into two different
classes
- The classes must be all
inclusive or exhaustive. This means that all data values must be included.
- The classes must be continuous.
There are no gaps in a frequency distribution. Classes that have no values
in them must be included (unless it's the first or last class which are
dropped).
- The classes must be equal in
width. The exception here is the first or last class. It is possible to
have an "below ..." or "... and above" class. This is
often used with ages.
Creating a
Grouped Frequency Distribution
- Find the largest and smallest
values
- Compute the Range = Maximum -
Minimum
- Select the number of classes
desired. This is usually between 5 and 20.
- Find the class width by
dividing the range by the number of classes and rounding up. There are two
things to be careful of here. You must round up, not off.
Normally 3.2 would round to be 3, but in rounding up, it becomes 4. If the
range divided by the number of classes gives an integer value (no
remainder), then you can either add one to the number of classes or add
one to the class width. Sometimes you're locked into a certain number of
classes because of the instructions. The Bluman text fails to mention the
case when there is no remainder.
- Pick a suitable starting point
less than or equal to the minimum value. You will be able to cover:
"the class width times the number of classes" values. You need
to cover one more value than the range. Follow this rule and you'll be
okay: The starting
point plus the number of classes times the class width must be greater
than the maximum value.
Your starting point is the lower limit of the first class. Continue to add
the class width to this lower limit to get the rest of the lower limits.
- To find the upper limit of the
first class, subtract one from the lower limit of the second class. Then
continue to add the class width to this upper limit to find the rest of
the upper limits.
- Find the boundaries by
subtracting 0.5 units from the lower limits and adding 0.5 units from the
upper limits. The boundaries are also half-way between the upper limit of
one class and the lower limit of the next class. Depending on what you're
trying to accomplish, it may not be necessary to find the boundaries.
- Tally the data.
- Find the frequencies.
- Find the cumulative
frequencies. Depending on what you're trying to accomplish, it may not be
necessary to find the cumulative frequencies.
- If necessary, find the relative
frequencies and/or relative cumulative frequencies.
Percentiles,
Deciles, Quartiles
Percentiles (100 regions)
The kth percentile is the number which has k%
of the values below it. The data must be ranked.
- Rank the data
- Find k% (k /100) of the sample
size, n.
- If this is an integer, add 0.5.
If it isn't an integer round up.
- Find the number in this
position. If your depth ends in 0.5, then take the midpoint between the
two numbers.
It is sometimes easier to count from the high
end rather than counting from the low end. For example, the 80th
percentile is the number which has 80% below it and 20% above it. Rather than
counting 80% from the bottom, count 20% from the top.
Note: The 50th percentile is the
median.
If you wish to find the percentile for a number
(rather than locating the kth percentile), then
- Take the number of values below
the number
- Add 0.5
- Divide by the total number of
values
- Convert it to a percent
Deciles (10 regions)
The
percentiles divide the data into 100 equal regions. The deciles divide the data
into 10 equal regions. The instructions are the same for finding a percentile,
except instead of dividing by 100 in step 2, divide by 10.
Quartiles (4 regions)
The
quartiles divide the data into 4 equal regions. Instead of dividing by 100 in
step 2, divide by 4.
Note:
The 2nd quartile is the same as the median. The 1st
quartile is the 25thpercentile, the 3rd quartile is the
75th percentile.
The
quartiles are commonly used (much more so than the percentiles or deciles). The
TI-82 calculator will find the quartiles for you. Some textbooks include the
quartiles in the five number summary.
Range
The
range is the simplest measure of variation to find. It is simply the highest
value minus the lowest value.
RANGE = MAXIMUM - MINIMUM
Since the range only uses the lar gest and
smallest values, it is greatly affected by extreme values, that is - it is not
resistant to change.
No comments:
Post a Comment