Statistics Toolbox
Methods
Summary Statistics
Introduction
This module computes up to 18 summary statistics for the data. The data may be subsetted using the panel and
group variables. Further discussion of the setting and assumptions is contained in the guidance section of the
module.
Method
PURPOSE: Summary statistics are quantities computed from the data that
describe distributional characteristics of the data and provide estimates of the underlying population. This
module computes many summary statistics that describe distributional characteristics such as location and spread.
Standard measures of location are the mean and median which describe the center of the distribution. Common
measures of distributional spread are the standard deviation, the minimum, and the maximum. Percentiles are
measures of location and also provide insight into the spread of the distribution.
DATA: A simple or systematic random sample,
x1,...,
xn, from the population of interest.
The sample may or may not contain compositing. Several data sets from separate populations may be
analyzed using the panel and group variables.
ASSUMPTIONS: Inference requires a random sample from the population of
interest and for most applications, the data need to be independent.
LIMITATIONS AND ROBUSTNESS: Interepretation of population characteristics
using summary statistics provides a quantitative picture of the data distribution. A more complete picture of the
data can be seen by combining summary statistics and one or more exploratory data analysis plots such as
boxplots, histograms, and intensity plots.
NON-DETECTS: A non-detect is an analytical sample where the concentration is
deemed to be lower than could be detected using the method employed by the laboratory. The number of non-detects
and the values for the minimum and maximum non-detect are reported in the summary table. When calculating
statistics, such as the mean and standard deviation, non-detects are usually replaced by either zero, half
the detection limit or the detection limit itself for computational purposes. There are some concerns with
each of these substitution methods. When substituting a zero, the estimate of the mean is biased low but the
estimate of the standard deviation is biased high. When non-detects are replaced by the detection limit, the
estimate of the mean is biased high but the estimate of the standard deviation is biased low. Therefore, it
is usually recommended that half the detection limit is used for statistical calculations such as the mean and
standard deviation.
QUANTILES: The kth quantile of a set of values
divides the values so that k/100 of the values lies below and 1.0-(k/100) of the values lie above k.
The 5th, 25th, 75th
and 95th quantiles are reported in the summary table. Note that the 50th
quantile is the median.
CENTRAL TENDENCY: The mean and the median are two measures of central tendency
reported in the summary statistics table. Mathematically, the mean is the average of the values in the data
set whereas the median is the value found at the exact midpoint of the ordered data values. While both
measures are used to characterize the center of a data set, there can be differences between the reported mean
and median for a given distribution.
A majority of the differences between the mean and median arise when the distribution is skewed. In general,
as the amount of skewness increases, the mean is pulled further away from the median in the direction of the
tail exhibiting the skewness. Therefore, by examining the relationship between the mean and the median it is
possible to visualize the shape of the distribution. For example, if the mean is much less than the median, the
distribution is likely left skewed. If the mean is much larger than the median, the distribution is likely right
skewed.
The mean and median will be close to one another when the distribution is approximately symmetric. Moreover,
when the mean and median are identical, the distribution is exactly symmetric.
MEASURES OF SPREAD: There is only one measure of spread directly reported in
the summary statistics. That measure is the standard deviation. The standard deviation is a measure of how
far the data observations are from the mean of the distribution. Therefore, a small standard deviation indicates
many of the values are close to the mean (i.e. there is little spread in the data) whereas a large standard
deviation indicates there are values far from the mean (i.e. there is a lot of spread in the data). Since
the standard deviation is a function of the mean, it can be affected by outliers. Therefore, data sets with
large outliers should use the Inter Quartile Range (described below) as its measure of spread.
Other measures of spread that may be calculated from the summary statistics include the overall range and the
Inter Quartile Range (IQR). The range of the data is calculated as the maximum minus the minimum values in the
data set. This number presents the complete spread of the data. The IQR is calculated as the
75th quantile minus the 25th quantile.
This number presents the spread associated with the middle half of the data.
SHAPIRO-WILK: The Shapiro-Wilk test is used to determine if a set of data are
normally distributed. The null hypothesis for this test is that the data are distributed normally, while the
alternative hypothesis is that the data are not distributed normally. The p-value associated with the
Shapiro-Wilk test is provided as one of the statistics in the summary table. Small p-values indicate there
is enough evidence to reject the null hypothesis in favor of the alternative hypothesis.
UPPER CONFIDENCE LEVEL (UCL): A confidence interval is an interval estimate
for the population parameter of interest. Confidence limits for the mean provide an upper and lower estimate
for the mean. The interval estimate also gives an indication of how much uncertainty there is in our estimate
of the true mean. The narrower the interval, the more precise is our estimate.
UPPER TOLERANCE LEVEL (UTL): A tolerance interval is an interval estimate for
a certain proportion of the population. The width of the interval is dependent upon the population variance,
sample size, desired proportion of the population and confidence level. More specifically, the width is large
if the variance is large, the sample size is small, the proportion is large or the confidence level is large.
In general, the upper tolerance limit provides the upper limit in which some percentage of all your data will lie.
Summary Statistics
Introduction
This module computes up to 18 summary statistics for the data. The data may be subsetted using the panel and
group variables. Further discussion of the setting and assumptions is contained in the guidance section of the
module.
Method
PURPOSE: Summary statistics are quantities computed from the data that
describe distributional characteristics of the data and provide estimates of the underlying population. This
module computes many summary statistics that describe distributional characteristics such as location and spread.
Standard measures of location are the mean and median which describe the center of the distribution. Common
measures of distributional spread are the standard deviation, the minimum, and the maximum. Percentiles are
measures of location and also provide insight into the spread of the distribution.
DATA: A simple or systematic random sample,
x1,...,
xn, from the population of interest.
The sample may or may not contain compositing. Several data sets from separate populations may be
analyzed using the panel and group variables.
ASSUMPTIONS: Inference requires a random sample from the population of
interest and for most applications, the data need to be independent.
LIMITATIONS AND ROBUSTNESS: Interepretation of population characteristics
using summary statistics provides a quantitative picture of the data distribution. A more complete picture of the
data can be seen by combining summary statistics and one or more exploratory data analysis plots such as
boxplots, histograms, and intensity plots.
NON-DETECTS: A non-detect is an analytical sample where the concentration is
deemed to be lower than could be detected using the method employed by the laboratory. The number of non-detects
and the values for the minimum and maximum non-detect are reported in the summary table. When calculating
statistics, such as the mean and standard deviation, non-detects are usually replaced by either zero, half
the detection limit or the detection limit itself for computational purposes. There are some concerns with
each of these substitution methods. When substituting a zero, the estimate of the mean is biased low but the
estimate of the standard deviation is biased high. When non-detects are replaced by the detection limit, the
estimate of the mean is biased high but the estimate of the standard deviation is biased low. Therefore, it
is usually recommended that half the detection limit is used for statistical calculations such as the mean and
standard deviation.
QUANTILES: The kth quantile of a set of values
divides the values so that k/100 of the values lies below and 1.0-(k/100) of the values lie above k.
The 5th, 25th, 75th
and 95th quantiles are reported in the summary table. Note that the 50th
quantile is the median.
CENTRAL TENDENCY: The mean and the median are two measures of central tendency
reported in the summary statistics table. Mathematically, the mean is the average of the values in the data
set whereas the median is the value found at the exact midpoint of the ordered data values. While both
measures are used to characterize the center of a data set, there can be differences between the reported mean
and median for a given distribution.
A majority of the differences between the mean and median arise when the distribution is skewed. In general,
as the amount of skewness increases, the mean is pulled further away from the median in the direction of the
tail exhibiting the skewness. Therefore, by examining the relationship between the mean and the median it is
possible to visualize the shape of the distribution. For example, if the mean is much less than the median, the
distribution is likely left skewed. If the mean is much larger than the median, the distribution is likely right
skewed.
The mean and median will be close to one another when the distribution is approximately symmetric. Moreover,
when the mean and median are identical, the distribution is exactly symmetric.
MEASURES OF SPREAD: There is only one measure of spread directly reported in
the summary statistics. That measure is the standard deviation. The standard deviation is a measure of how
far the data observations are from the mean of the distribution. Therefore, a small standard deviation indicates
many of the values are close to the mean (i.e. there is little spread in the data) whereas a large standard
deviation indicates there are values far from the mean (i.e. there is a lot of spread in the data). Since
the standard deviation is a function of the mean, it can be affected by outliers. Therefore, data sets with
large outliers should use the Inter Quartile Range (described below) as its measure of spread.
Other measures of spread that may be calculated from the summary statistics include the overall range and the
Inter Quartile Range (IQR). The range of the data is calculated as the maximum minus the minimum values in the
data set. This number presents the complete spread of the data. The IQR is calculated as the
75th quantile minus the 25th quantile.
This number presents the spread associated with the middle half of the data.
SHAPIRO-WILK: The Shapiro-Wilk test is used to determine if a set of data are
normally distributed. The null hypothesis for this test is that the data are distributed normally, while the
alternative hypothesis is that the data are not distributed normally. The p-value associated with the
Shapiro-Wilk test is provided as one of the statistics in the summary table. Small p-values indicate there
is enough evidence to reject the null hypothesis in favor of the alternative hypothesis.
UPPER CONFIDENCE LEVEL (UCL): A confidence interval is an interval estimate
for the population parameter of interest. Confidence limits for the mean provide an upper and lower estimate
for the mean. The interval estimate also gives an indication of how much uncertainty there is in our estimate
of the true mean. The narrower the interval, the more precise is our estimate.
UPPER TOLERANCE LEVEL (UTL): A tolerance interval is an interval estimate for
a certain proportion of the population. The width of the interval is dependent upon the population variance,
sample size, desired proportion of the population and confidence level. More specifically, the width is large
if the variance is large, the sample size is small, the proportion is large or the confidence level is large.
In general, the upper tolerance limit provides the upper limit in which some percentage of all your data will lie.
Boxplot
Introduction
This module draws an original-scale boxplot and a log-scale boxplot for each panel-group combination. Separate
plots are drawn for each panel variable while side-by-side boxplots are drawn for each group variable
within a panel plot (see the example plot in the guidance). Further description of the boxplot is given in
the guidance.
Method
Box and Whisker plots, also known as boxplots, are useful in situations where a picture of the distribution
is desired, but it is not necessary or feasible to portray all the details of the data. A boxplot (see the
example figure) displays several percentiles of the data set. It is a simple plot, yet provides insight into
the location, shape, and spread of the data and underlying population. A simple boxplot contains only the
0th (minimum data value), 25th, 50th, 75th and 100th (maximum data value) percentiles. A more complex version
included here plots potential outliers as open circles. These are points that lie further than 1.5 times the
interquartile range (75th percentile minus the 25th percentile) from either end of the central box. Since the
boxplot is compact (essentially one-dimensional), several can be placed on a single graph. This allows the user
to comparing the locations, spreads and shapes of several data sets or different groups
within a single data set.
A boxplot divides the data into 4 sections, each containing 25% of the data. The length of the central box
indicates the spread of the central 50% of the data, while the length of the whiskers shows the breadth of the
tails of the distribution. The boxplot also demonstrates the shape of the data in the following manner. If
the upper box and whisker are approximately the same length as the lower box and whisker, then the data are
distributed symmetrically. If the upper box and whisker are longer than the lower box and whisker, then the
data are right-skewed. If the upper box and whisker are shorter than the lower box and whisker, then the data
are left-skewed.
Other options that can be displayed are non-detects (plotted as X's at detection limit), detects (plotted as
solid circles), and sample sizes (plotted at the top of each boxplot). The example plot has arsenic as the
only panel variable while there are two group subsets, PrimInv and SedBkg. Also displayed are the sample sizes
and non-detects.
Example Boxplot

Histogram
Introduction
Histograms provide a visual method of assessing the location, shape and spread of the data.
Also, extreme values and multiple modes can be identified. The details of the data are
lost, but an overall picture of the data is obtained.
Method
A histogram is a visual representation of the data collected into groups. This
graphical technique provides a method for identifying the underlying
distribution of the data. The number of bins or classes is determined by the user
or by using the range of data values. Once the number of bins is determined, the
data is sorted into those bins. A histogram is a bar graph conveying the
bins and the frequency of data points in each bin. Other forms of the histogram
use a normalization of the bin frequencies for the heights of the bars. The two
most common normalizations are relative frequencies (frequencies divided by sample
size) and densities (relative frequency divided by the bin width). The figure below is
an example of a histogram using frequencies.
Example Histogram of Frequencies

Histograms not only provide a means of identifying and characterizing the data, but they
also make it possible to identify extreme values and multiple modes. In the process of
creating the histogram, but an overall picture of the data is obtained. As seen in the example
above, individual data points are not apparent. Other kinds of plots, specifically the stem-and-leaf
plot described in the next section, offers the same insights into the data as a histogram, but
the individual data values are retained on the plot. Therefore, stem-and-leaf plots can be more
informative than histograms for smaller data sets.
On concern when creating a histogram is that the visual impression of a histogram is sensitive
to the number of bins chosen for the plot. A large number of bins increases data detail while fewer
bins increases the smoothness of the histogram. A good starting point when choosing the number of bins
is the square-root of the sample size. Another factor to consider when constructing histogram bins
is the choice of bin endpoints. Using convenient bin endpoints can improve the readability of the
histogram. For an integer k, convenient endpoints would include multiples of 5k units. For example,
0 to <5, 5 to <10, etc., as the bins in the figure or 1 to <1.5, 1.5 to <2, etc.
Finally, when plotting a histogram for a continuous variable, e.g., concentration,
it is necessary to decide on an endpoint convention; that is, what to do with data points that
fall on the boundary of a bin. With discrete variables, (e.g., family size) the intervals can be
centered in between the variables. For the family size data, the intervals can span between 1.5
and 2.5, 2.5 and 3.5, and so on. Then the whole numbers that relate to the family size can be
centered within the box.
As with all graphical techniques, the histogram is intended to increase understanding of the data.
Since the histogram has subjective options, it is up to the user to select those that convey a clear
picture of the data.
Normal QQ-Plot
Introduction
This module draws a Normal QQ-plot for each group-panel combination. The underlying population
distribution of the data is approximately normal if the plotted points approximately follow a
straight line. Further details can be found in the guidance section.
Method
A normal quantile-quantile plot or qq-plot (also known as a normal probability plot) is a
graph of the quantiles of a data set against the quantiles of the standard normal
distribution. This is a technique used to qualitatively determine if the underlying population of
the data is normally distributed. A statistical hypothesis test of normality called the
Shapiro-Wilk test can also be used to evaluate whether the underlying distribution is normal.
This test can be found under the 'Verifying Assumptions' menu option.
If the plotted points in the normal qq-plot are approximately linear, then this is an
indication that the data are normally distributed. If the graph is not linear, then the
departures from linearity give important information about how the data distribution deviates
from normality. A nonlinear normal probability plot may be used to interpret the shape and
tail behavior of the data. The quartile line, the line through the first and third
quartiles, is added to the plot to help determine departures from linearity. Examine the
relationship of the tails of the normal probability plot to the quartile line. If the data are
exactly normal, then the qq-plot will fall exactly upon the quartile line.
If the qq-plot has an upper-tail above the quartile line while the lower-tail is below, then
the distribution is symmetric with tails that are heavier than a normal distribution. If the
qq-plot has an upper-tail below the quartile line while the lower-tail is above, then the
distribution is symmetric with tails that are lighter than a normal distribution. If both tails
of the the qq-plot are above the quartile line, then the distribution is right-skewed. If both
tails of the the qq-plot are below the quartile line, then the distribution is left-skewed.
A normal probability plot can also be used to identify potential outliers. A data value (or a
few data values) much larger or much smaller than the rest of the data will appear distant
from the other values concentrated in the center of the graph.
Example Normal QQ-plot

Intensity Plot
Introduction
Intensity plots are two-dimensional displays of the data values according to the spatial coordinates of the
samples. Plots for subsets of the data can be obtained using the panel and group variables. Further details are
available in the guidance section.
Method
Intensity plots offer displays not possible in other EDA plots. These plots allow the user to visualize the
distribution of the data points and can be used to locate extreme values, view overall spatial trends, and view
the degree of continuity among neighboring locations. An intensity plot is a scatterplot of the sample
coordinates where a colored symbol is plotted at the set of coordinates rather than just a point. The color of
the symbol is defined by a color rainbow that runs from green, to yellow, to orange, and finally to red. Green
corresponds to the lowest values and red to the highest. Detects are plotted as circles and non-detects are
plotted as squares.
Advanced options for the intensity plot are the minimum value, maximum value, and a threshold. All of these
options modify the intensity color range. The minimum and maximum values will set the range of color values. This
allows the user to compare more than one intensity plot on the same scale. Setting a threshold value (a UCL or
some action level) modifies the plot in two ways. First, detected values below the threshold are plotted as
circles while detected values above the threshold are plotted as triangles. Non-detects are still plotted as
squares. Secondly, values below the threshold are plotted using a green to yellow palette (yellow being closer to
the threshold) while values above the threshold are plotted using an orange to red palette (orange being closer
to the threshold).
Example Intensity Plot

Control Charts
Introduction
Control charts provide a visual display of monitoring statistics that indicate if a
process has changed modes or produced anomalous results.
Method
Control charts are used in monitoring for detecting a variety of behaviors.
The X-bar chart is useful for detecting a batch with an anomolous mean.
The cusum chart is useful for detecting a gradual change in the mean.
The S chart is useful for detecting an anomaly in the variation of a batch.
One-Sample t-test
Introduction
This module performs a one-sample t-test comparing the population mean to a fixed threshold. The
data may be subsetted using the panel and grouping variables (note that only one panel subset and
one group subset can be selected). Further discussion of the test setting and assumptions is
contained in the guidance section of the module.
Method
PURPOSE:
To test for a difference between a population mean and a fixed threshold. This test should be used
in conjunction with exploratory data analysis (EDA) so that a complete picture of the data can be
viewed. If the assumptions for this test are not met, then nonparametric alternatives are the Sign
test and the Wilcoxon Signed Ranks test. If the assumptions are met, then the t-test has more power
than the nonparametric tests.
DATA:
A simple or systematic random sample, x1,...,xn, from
the population of interest. The sample may or may not contain compositing.
ASSUMPTIONS:
A key assumption for the accurate computation of the t-test p-value is that the sample mean has an
approximate normal distribution. If the data are normal or the sample size is large (approximately 30
or larger), then the distribution of the sample mean is approximately normal (the more skewed or non-normal
the data, the larger the sample size must be). The Shapiro-Wilk test is used to test the normality of the
data. Other key assumptions for the t-test are randomness and independence of the samples. Randomness
means that sampling locations were selected at random from all possible sampling locations. Data
independence means that the value of one sample does not effect the value of another. Review the sampling
plan to ensure these assumptions are met.
LIMITATIONS AND ROBUSTNESS:
One-sample t methods are robust against the population distribution deviating moderately from normality.
However, these methods are not robust against outliers. In either case, nonparametric methods offer an alternative.
Finally, the one-sample t methods have difficulty dealing with non-detects. Substitution methods are an
alternative, but it would be best to use a nonparametric method. In general, very large sample sizes will
produce small p-values. Therefore, a very large sample size may cause a statistically significant result
that is not practically significant.
CHOICE OF SIGNIFICANCE LEVEL:
The significance level (or type I error rate) is the chance of rejecting the null hypothesis given the null
hypothesis is true. Typical values are 0.001, 0.01, 0.05, and 0.1. The specific significance level for a test
is set by assessing the consequences of making this decision error in the context of the specific study.
Generally speaking, the more severe the consequences of making this decision error, the lower the significance
level. For example, suppose we were testing whether a plutonium clean-up level had been reached. Since the
consequences of rejecting the null hypothesis when it is in fact true, i.e., conclude the site is clean when
it is in fact dirty, are quite severe, the signicance level should be small, say 0.001 or 0.01. On the other
hand, suppose we were testing whether a stream is contaminated with arsenic. Since the consequences of rejecting
the null hypothesis when it is in fact true, i.e., concluding the stream is contaminated when it is not, are
not so severe, the significance level should be set higher, say 0.05 or 0.10. In conclusion, setting the
significance level is somewhat subjective and dependent upon your specific field and study goal.
One-Sample Sign Test
Introduction
This module performs a one-sample Sign test comparing the population median to a fixed threshold. The
data may be subsetted using the panel and grouping variables (note that only one panel subset and
one group subset can be selected). Further discussion of the test setting and assumptions is
contained in the guidance section of the module.
Method
PURPOSE:
To test for a difference between the population median and a fixed threshold. This test should be used
in conjunction with exploratory data analysis (EDA) so that a complete picture of the data can be
viewed. This test has few assumptions, however it has less power than the parametric t-test or the
nonparametric Wilcoxon Signed Ranks test.
DATA:
A simple or systematic random sample, x1,...,xn, from
the population of interest. The sample may or may not contain compositing.
ASSUMPTIONS:
The Sign test can be used for any underlying population distribution. Key assumptions for the Sign test
are randomness and independence of the samples. Randomness means that sampling locations were selected
at random from all possible sampling locations. Data independence means that the value of one sample
does not effect the value of another. Review the sampling plan to ensure these assumptions are met.
LIMITATIONS AND ROBUSTNESS:
The Sign test has less power than the one-sample t-test or the Wilcoxon Signed Rank test. However, the
Sign test makes no distributional assumptions like the other two tests and it can handle non-detects if
the detection limit is below the threshold.
CHOICE OF SIGNIFICANCE LEVEL:
The significance level (or type I error rate) is the chance of rejecting the null hypothesis given the null
hypothesis is true. Typical values are 0.001, 0.01, 0.05, and 0.1. The specific significance level for a test
is set by assessing the consequences of making this decision error in the context of the specific study.
Generally speaking, the more severe the consequences of making this decision error, the lower the significance
level. For example, suppose we were testing whether a plutonium clean-up level had been reached. Since the
consequences of rejecting the null hypothesis when it is in fact true, i.e., conclude the site is clean when
it is in fact dirty, are quite severe, the signicance level should be small, say 0.001 or 0.01. On the other
hand, suppose we were testing whether a stream is contaminated with arsenic. Since the consequences of rejecting
the null hypothesis when it is in fact true, i.e., concluding the stream is contaminated when it is not, are
not so severe, the significance level should be set higher, say 0.05 or 0.10. In conclusion, setting the
significance level is somewhat subjective and dependent upon your specific field and study goal.
One-Sample Wilcoxon Signed Rank Test
Introduction
This module performs a one-sample Wilcoxon Signed Rank test that compares the population location (mean
or median) to a fixed threshold. The data may be subsetted using the panel and grouping variables
(note that only one panel subset and one group subset can be selected). Further discussion of the
test setting and assumptions is contained in the guidance section of the module.
Method
PURPOSE:
To test for a difference between the true location (mean or median) of a population and a fixed threshold.
If the underlying population distribution is approximately normal, then the one-sample t-test will have
more power (chance of rejecting the null hypothesis when it is false) than the Wilcoxon Signed Rank test.
For symmetric distributions, the Wilcoxon Signed Rank test will have more power than the Sign test. If
the sample size is small and the data is not normally distributed nor approximately symmetric, then the
Sign test should be performed. If non-detects are present, then an alternative ranking system called
Gehan ranking is used.
DATA:
A simple or systematic random sample, x1,...,xn, from the
population of interest. The sample may or may not contain compositing.
ASSUMPTIONS:
The data set comes from an approximately symmetric distribution. Other key assumptions are randomness
and independence of the samples. Randomness means that sampling locations were selected at random from
all possible sampling locations. Data independence means that the value of one sample does not effect
the value of another. Review the sampling plan to ensure these assumptions are met.
LIMITATIONS AND ROBUSTNESS:
For large sample sizes (n > 50), the one-sample t-test is more robust against violations of its assumptions
than the Wilcoxon Signed Rank test. The Wilcoxon Signed Rank test may produce misleading results if there
are many tied data values. If possible, measurements should be recorded with sufficient accuracy so that a
large number of tied values do not occur. Estimated concentrations should be reported for data below the
detection limit, even if these estimates are negative, as their relative magnitude is of importance.
CHOICE OF SIGNIFICANCE LEVEL:
The significance level (or type I error rate) is the chance of rejecting the null hypothesis given the null
hypothesis is true. Typical values are 0.001, 0.01, 0.05, and 0.1. The specific significance level for a test
is set by assessing the consequences of making this decision error in the context of the specific study.
Generally speaking, the more severe the consequences of making this decision error, the lower the significance
level. For example, suppose we were testing whether a plutonium clean-up level had been reached. Since the
consequences of rejecting the null hypothesis when it is in fact true, i.e., conclude the site is clean when
it is in fact dirty, are quite severe, the signicance level should be small, say 0.001 or 0.01. On the other
hand, suppose we were testing whether a stream is contaminated with arsenic. Since the consequences of rejecting
the null hypothesis when it is in fact true, i.e., concluding the stream is contaminated when it is not, are
not so severe, the significance level should be set higher, say 0.05 or 0.10. In conclusion, setting the
significance level is somewhat subjective and dependent upon your specific field and study goal.
Confidence Limits or Intervals for the Population Mean
Introduction
Computation of upper (UCL) or lower (LCL) confidence limits or confidence intervals for the population mean.
A confidence limit or interval specifies a region that covers the population mean with a certain level of
confidence.
Method
PURPOSE:
Student's-t and bootstrap methods are employed to compute confidence
limits or intervals. A confidence limit or interval specifies a region that covers the population mean with a
certain level of confidence. For example, the statement, 'The 95% UCL for the population mean is 7.8.', is
interpreted as, 'I can be 95% confident that the population mean lies below 7.8.'
DATA:
A simple or systematic random sample, x1,...,
xn, from
the population of interest. The sample may or may not contain compositing.
ASSUMPTIONS:
A key assumption for the accurate computation of the student's-t confidence
limit or interval is that the sample mean has an approximate normal distribution. If the data are normal or the
sample size is large (approximately 30 or larger), then the distribution of the sample mean is approximately normal
(the more skewed or non-normal the data, the larger the sample size must be). The bootstrap methods provide an
alternative if it is believed that the sample mean is not normally distributed. For any of the methods, it is assumed
that the data are randomly and independently selected. Randomness means that sampling locations were selected
at random from all possible sampling locations. Data independence means that the value of one sample does not
effect the value of another. Review the sampling plan to ensure these assumptions are met.
LIMITATIONS AND ROBUSTNESS:
The student's-t method is robust against the distribution of the sample mean
deviating moderately from normality. In this case, the nonparametric bootstrap method offers an alternative. However,
none of the methods is robust against outliers. Finally, all of the methods have difficulty dealing with non-detects.
Substitution methods are the offered alternative.
STUDENT'S-t COMPUTATIONS:
NONPARAMETRIC BOOTSTRAP COMPUTATIONS:
STUDENTIZED BOOTSTRAP COMPUTATIONS:
Tolerance Limit
Introduction
Computation of upper (UTL) or lower (LTL) tolerance limit. A tolerance limit specifes a region that covers a
certain portion of the population with a certain level confidence. Tolerance limits can also be viewed as
confidence limits on population percentiles.
Method
PURPOSE:
Normal-based and nonparametric bootstrap methods are employed to compute upper (UTL) or lower (LTL) tolerance
limits. A tolerance limit specifes a region that covers a certain portion of the population with a certain
confidence. For example, the statement, 'The 95% UTL for 90% of the population is 7.8.', is interpreted as,
'I can be 95% confident that 90% of the population lies below 7.8.'
DATA:
A simple or systematic random sample, x1,...,
xn, from
the population of interest. The sample may or may not contain compositing.
ASSUMPTIONS:
A key assumption for the accurate computation of the normal-based tolerance limit is that the underlying
distrubution is approximately normally distributed. In either the normal or bootstrap case, it is assumed
that the data are randomly and independently selected. Randomness means that sampling locations were selected
at random from all possible sampling locations. Data independence means that the value of one sample does not
effect the value of another. Review the sampling plan to ensure these assumptions are met.
LIMITATIONS AND ROBUSTNESS:
The normal-based method is robust against the population distribution deviating moderately from normality.
This deviation can lead to the normal-based LTL method producing negative estimates. In this case, the
nonparametric bootstrap method offers an alternative. However, neither the normal or bootstrap method is
robust against outliers. Finally, both methods have difficulty dealing with non-detects. Substitution methods
are the offered alternative.
NORMAL-BASED COMPUTATIONS:
NONPARAMETRIC BOOTSTRAP COMPUTATIONS:
Two-Sample t-test
Introduction
This module performs a test for a difference between two population means. Select only a single
'sample 1' subset and a single 'sample 2' subset. The data can be further subsetted using the panel
and group variables. Further discussion of the test setting and assumptions is contained in the
method section of the module.
Method
PURPOSE:
The purpose of this module is to test for a difference between two population means. This test
should be used in conjunction with exploratory data analysis (EDA) so that a complete picture of
the data can be viewed. If the assumptions for this test are not met, then a nonparametric
alternative is the Wilcoxon Rank Sum test. If the assumptions are met, then the t-test has more
power than the nonparametric test.
DATA:
A simple or systematic random sample x1, x2, . . . ,
xm from one population, and an independent simple or systematic random
sample y1, y2, . . . , yn
from the second population.
ASSUMPTIONS:
The two populations are independent. If not, then it is possible that a paired method could be used.
Both are approximately normally distributed or the sample sizes are large (m and n both at least 30).
If this is not the case, then a nonparametric procedure is an alternative. Finally, this test does not
assume the populations have equal variances.
LIMITATIONS AND ROBUSTNESS:
The two-sample t-test is robust to moderate violations of the assumption of normality. The t-test is
not robust against outliers because sample means and standard deviations are sensitive to outliers.
CHOICE OF SIGNIFICANCE LEVEL:
The significance level (or type I error rate) is the chance of rejecting the null hypothesis given the null
hypothesis is true. Typical values are 0.001, 0.01, 0.05, and 0.1. The specific significance level for a test
is set by assessing the consequences of making this decision error in the context of the specific study.
Generally speaking, the more severe the consequences of making this decision error, the lower the significance
level. For example, suppose we were testing whether a plutonium clean-up level had been reached. Since the
consequences of rejecting the null hypothesis when it is in fact true, i.e., conclude the site is clean when
it is in fact dirty, are quite severe, the signicance level should be small, say 0.001 or 0.01. On the other
hand, suppose we were testing whether a stream is contaminated with arsenic. Since the consequences of rejecting
the null hypothesis when it is in fact true, i.e., concluding the stream is contaminated when it is not, are
not so severe, the significance level should be set higher, say 0.05 or 0.10. In conclusion, setting the
significance level is somewhat subjective and dependent upon your specific field and study goal.
Wilcoxon Rank Sum Test
Introduction
This module performs a test for a difference between two population locations (means or medians). This is
a nonparametric test, therefore knowledge of the precise form of the population distributions is not
necessary. Select only a single 'sample 1' subset and a single 'sample 2' subset. The data can be further
subsetted using the panel and group variables. Further discussion of the test setting and assumptions
is contained in the method section of the module.
Method
PURPOSE:
The Wilcoxon Rank Sum test performs a test for a difference between two population means locations
(means or medians). This is a nonparametric method that relies on the relative rankings of data values.
Knowledge of the precise form of the population distributions is not necessary. The Wilcoxon Rank Sum (WRS)
test has less power than the two-sample t-test, but the assumptions are not as restrictive. This version
of the WRS test uses the Mantel approach which is equivalent to using the Gehan ranking system.
DATA:
A random sample x1, x2,..., xm
from one population, and an independent random sample y1, y2,
..., yn from the second population.
ASSUMPTIONS:
The validity of the random sampling and independence assumptions should be verified by review of the
procedures used to select the sampling points. The two underlying distributions are assumed to have
approximately the same shape (variance) and that the only difference between them is a shift in location.
A qualitative test of this assumption can be done by comparing boxplots or histograms. A quantititative
test is contained in the 'Compare Variances' module under the 'Verifying Assumptions' menu.
TEST STATISTIC:
The test statistic is computed using Mantel's approach which is equivalent to the Gehan approach. For each data
point in 'sample 1', compute the number of pooled data values that are definitely less than the data point in
question. This value is 0 for all non-detects. Also compute the number of pooled data values that are definitely
greater that the data point in question. Now, for each point in 'sample 1', compute the difference of the less than
total and the greater than total. Finally, the test statistic is the sum of all the differences. If this value
is a large positive value, then 'sample 1' is elevated over 'sample 2'. If this value is a large negative value, then
'sample 2' is elevated over 'sample 1'.
LIMITATIONS AND ROBUSTNESS:
The Wilcoxon Rank Sum test may produce misleading results if there are many tied data values. When many
ties are present, their relative ranks are the same, and this has the effect of diluting the statistical
power of the Wilcoxon test. If possible, results should be recorded with sufficient accuracy so that a
large number of tied values do not occur. Estimated concentrations should be reported for data below the
detection limit, even if these estimates are negative, as their relative magnitude to the rest of the data
is of importance.
CHOICE OF SIGNIFICANCE LEVEL:
The significance level (or type I error rate) is the chance of rejecting the null hypothesis given the null
hypothesis is true. Typical values are 0.001, 0.01, 0.05, and 0.1. The specific significance level for a test
is set by assessing the consequences of making this decision error in the context of the specific study.
Generally speaking, the more severe the consequences of making this decision error, the lower the significance
level. For example, suppose we were testing whether a plutonium clean-up level had been reached. Since the
consequences of rejecting the null hypothesis when it is in fact true, i.e., conclude the site is clean when
it is in fact dirty, are quite severe, the signicance level should be small, say 0.001 or 0.01. On the other
hand, suppose we were testing whether a stream is contaminated with arsenic. Since the consequences of rejecting
the null hypothesis when it is in fact true, i.e., concluding the stream is contaminated when it is not, are
not so severe, the significance level should be set higher, say 0.05 or 0.10. In conclusion, setting the
significance level is somewhat subjective and dependent upon your specific field and study goal.
Quantile Test
Introduction
This module performs a test for a difference between the right-tails of two distributions. Further
discussion of the test setting and assumptions is contained in the method section of the module.
Method
PURPOSE:
This method performs a test for a shift to the right in the right-tail of the site or tested population
versus the reference population. This may be regarded as being equivalent to detecting if the values in
the right-tail of site or tested distribution are generally larger than the values in the right-tail
of the reference distribution. The tested location in the right-tail is determined by the quantile value.
Values closer to 0.5 will result in a test similar to a test of center like the two-sample t-test or the
Wilcoxon Rank Sum test. Values closer to 1 will be testing the extreme right-tails of the distributions
and will be similar the Slippage test. The desired quantile should be between 0.5 and 1. If the chosen
quantile is 0.5, then the test is equivalent to the Median Test described by Conover.
DATA:
A simple or systematic random sample, x1,
x2,...,
xn, from the reference population and an independent
simple or systematic random sample, y1,
y2,...,
ym, from the site population.
ASSUMPTIONS:
It is assumed that the populations have approximately the same shape (variance). The validity of the random
sampling and independence assumptions is assured by using proper randomization procedures, which can be
verified by reviewing the procedures used to select the sampling points.
LIMITATIONS AND ROBUSTNESS:
Since the Quantile test focuses on the right-tail, large outliers will bias results. Also, the Quantile
test says nothing about the center of the two distributions. Therefore, this test should be used in combination
with a location test like the t-test (if the data are normally distributed) or the Wilcoxon Rank Sum test.
If the percentage of non-detects in the pooled data is larger than the desired quantile percentage or if there is a
non-detect larger than the quantile of the pooled data, then the test cannot be performed.
CHOICE OF SIGNIFICANCE LEVEL:
The significance level (or type I error rate) is the chance of rejecting the null hypothesis given the null
hypothesis is true. Typical values are 0.001, 0.01, 0.05, and 0.1. The specific significance level for a test
is set by assessing the consequences of making this decision error in the context of the specific study.
Generally speaking, the more severe the consequences of making this decision error, the lower the significance
level. For example, suppose we were testing whether a plutonium clean-up level had been reached. Since the
consequences of rejecting the null hypothesis when it is in fact true, i.e., conclude the site is clean when
it is in fact dirty, are quite severe, the signicance level should be small, say 0.001 or 0.01. On the other
hand, suppose we were testing whether a stream is contaminated with arsenic. Since the consequences of rejecting
the null hypothesis when it is in fact true, i.e., concluding the stream is contaminated when it is not, are
not so severe, the significance level should be set higher, say 0.05 or 0.10. In conclusion, setting the
significance level is somewhat subjective and dependent upon your specific field and study goal.
Slippage Test
Introduction
This module performs a test for a difference between the right-tails of two distributions. Further
discussion of the test setting and assumptions is contained in the method section of the module.
Method
PURPOSE:
Test for a shift to the right in the extreme right-tail of population 1 (reference) versus population 2 (site).
This is equivalent to asking if a set of the largest values of the site distribution are larger than the maximum
value of the background distribution.
DATA:
A simple or systematic random sample, x1,
x2,...,
xn, from the reference population and
an independent simple or systematic random sample,
y1, y2,...,
ym, from the site population.
ASSUMPTIONS:
The validity of the random sampling and independence assumptions is assured by using proper randomization
procedures, which can be verified by reviewing the procedures used to select the sampling points.
LIMITATIONS AND ROBUSTNESS:
Since the Slippage test focuses on the right-tail, large outliers will bias results. Also, the Slippage test
says nothing about the center of the two distributions. Therefore, this test should be used in combination with
a location test like the t-test (if the data are normally distributed) or the Wilcoxon Rank Sum test. Non-detects
larger than the maximum background detect are eliminated from the analysis.
CHOICE OF SIGNIFICANCE LEVEL:
The significance level (or type I error rate) is the chance of rejecting the null hypothesis given the null
hypothesis is true. Typical values are 0.001, 0.01, 0.05, and 0.1. The specific significance level for a test
is set by assessing the consequences of making this decision error in the context of the specific study.
Generally speaking, the more severe the consequences of making this decision error, the lower the significance
level. For example, suppose we were testing whether a plutonium clean-up level had been reached. Since the
consequences of rejecting the null hypothesis when it is in fact true, i.e., conclude the site is clean when
it is in fact dirty, are quite severe, the signicance level should be small, say 0.001 or 0.01. On the other
hand, suppose we were testing whether a stream is contaminated with arsenic. Since the consequences of rejecting
the null hypothesis when it is in fact true, i.e., concluding the stream is contaminated when it is not, are
not so severe, the significance level should be set higher, say 0.05 or 0.10. In conclusion, setting the
significance level is somewhat subjective and dependent upon your specific field and study goal.
Paired t-test
Introduction
Paired t-tests are used to test for a difference between paired population means.
The data from these studys are often collected on the same sampling units (individuals or objects) at different
times, for example, heart rate measurements taken before and after a stress test. These measurements
performed on the same sampling units are correlated, so they are handled differently than independent
measurements taken from two sampling units.
Method
PURPOSE:
To test for a difference between paired population means.
DATA:
Two data sets x1,...,
xn and y1,...,
yn that are paired, i.e.
x1 is paired with
y1,
x2 is paired with
y2, etc.
ASSUMPTIONS:
The main assumption associated with the Paired t-test is that the two data sets
come from approximately normal distributions. As a result of the Central Limit Theorm, this assumption may be relaxed
if the number of paired observations is greater than 30 .
LIMITATIONS AND ROBUSTNESS:
Since there is really only one sample, a sample of differences, the limitations for the Paired
t-test are the same as those for the one-sample
t-test. These methods are robust against population distributions deviating
moderately from normality. However, they are not robust against outliers. When the distribution is determined to
be distinctly non-normal or there are outliers in the distribution the nonparametric Sign Test or the Wilcoxon
Signed Rank Test are valid alternative tests for paired data.
Testing Normality
Introduction
This module uses the Shapiro-Wilk test to determine if a population is normally distributed. While other tests for normality
are available, the Shapiro-Wilk test is one of the most powerful and is therefore the only one presented in this module.
Method
PURPOSE:
To determine if a single population is normally distributed.
SHAPIRO-WILK TEST:
The Shapiro-Wilk W test is one of the most powerful and widely used statistical tests for
normality. This test is similar to computing a correlation between the quantiles of the standard normal distribution and the
ordered values of a data set. If the points on a normal probability plot lie approximately along a straight line, the
W test statistic will be relatively large and the data are considered to be normal. However,
systematic deviations from a straight line indicate the distribution is non-normal. In this case, the W
test statistic will be relatively small. Tables of critical values for sample sizes up to 50 have been developed for determining
the significance of the W test statistic. However, this module can perform the
W test for data sets with sample sizes as large as 5000.
OTHER NORMALITY TESTS:
There are several other statistical tests for normality that are related to the Shapiro-Wilk test. D'Agostino's test for
sample sizes between 50 and 1000 and Royston's test for sample sizes up to 2000 are two such tests that approximate some of
the key quantities or parameters of the Shapiro-Wilk W test. The Filliben statistic,
also called the probability plot correlation coefficient, also is a test for normality and measures the linearity of the
points on the normal probability plot. Similar to the W test, if the normal probability
plot is approximately linear, then the correlation coefficient will be relatively large.
Background Comparison
Introduction
This module compares the distributions of two data sets to determine if one is significantly elevated
over the other. This problem typically arises when comparing a potentially contaminated site to
background levels. Multiple sites may be compared to a single background data set. Subsets of the data
may be obtained by using the panel and group variables. However, only one panel subset and one
group subset can be analyzed at a time.
Method
A common problem encountered in environmental statistics involves comparing concentrations from a
particular location to background or reference concentrations. The procedure described in this section
is appropriate for characterization purposes. It is not intended to demonstrate a location has met
clean-up standards. This section provides a useful method for determining if the location is different
from background, implying that a contamination source may currently exist or may have existed at one time.
The data consists of a simple or systematic random samples from the site (may be more than one) and
background populations of interest. The samples may or may not contain compositing. While subsets of the
data may be obtained using the panel and group variables, only a single panel subset and a single group
subset can be analyzed at any one time.
The comparison procedure involves simultaneously running four two-sample hypothesis tests each at a
specified significance level. The tests are the t-test, the Wilcoxon Rank Sum test, the Quantile test and
the Slippage test. The t-test and the Wilcoxon Rank Sum test compare the centers of the underlying
populations while the Quantile test and the Slippage test compare the right-tails of the two distributions.
A final conclusion should be made using exploratory data analysis (EDA) along with the results of all the
tests. To that end, summary statistics and boxplots also are presented with the test results.
Care needs to be taken when setting the significance level for an individual test because it is necessary
to control the false rejection rate for the entire procedure. In this case, we are finding 4 conclusions
at the individual significance level so the overall false rejection rate is much higher. Suppose an
overall false rejection rate of at most 0.05 is desired. One method to set the significance level is
to divide 0.05 by the number of tests performed. However, since the t-test and the Wilcoxon Rank Sum test
are similar and the Quantile test and the Slippage test are similar, this leads to a conservative overall
rate. Therefore, as a compromise between 0.05 and 0.0125, a significance level of 0.02 is suggested for
each test to attain an overall false rejection rate of 0.05.
Time Plot
Introduction
This module generates a time plot of the data. A time plot is simple the data plotted in chronological order. Detects
are plotted as solid circles and non-detects are plotted as open circles. Further discussion of the setting and assumptions
is contained in the guidance section of the module.
Method
PURPOSE: A time plot provides insight into how the data changes over time. This plot
type aides in identifying trends or outliers. A trend is an increasing or decreasing change in the data and indicates a
change in the process that creates the data. Detects are plotted as solid circles and non-detects are plotted as open
circles. The module on trend estimation uses a nonparametric method to quantitatively estimate the trend.
DATA: A simple or systematic random sample over n
time points, x1,..., xn, from the population of interest. The sample may or may not contain
compositing. Several data sets from separate populations may be analyzed using the panel and group variables.
Trend Estimation
Introduction
This module estimates a trend in temporal data. Several summary and inferential statistics for
trend are computed and a graphical display of the confidence interval for slope is presented.
Further discussion of the setting and assumptions is contained in the guidance section of the
module.
Method
PURPOSE: Summarize the trend (if any) in monitoring data. Several
nonparametric methods are used. The summary table includes the samples size, number of detects,
Conover's estimate for the intercept, the Theil-Sen estimate of slope, Gilbert's confidence interval
for slope, and the Mann-Kendall test for trend p-value. Note that these are all nonparametric methods.
Sources include Hollander and Wolfe (1973) and Gilbert (1987).
DATA: A simple or systematic random sample over n
time points, x1,...,
xn, from the population of interest.
The sample may or may not contain compositing. Several data sets from separate populations may be
analyzed using the panel and group variables.
ASSUMPTIONS: That the data does not contain serial correlation. That
is, values close together temporal are not correlated.
LIMITATIONS AND ROBUSTNESS: Currently, the nonparametric methods
require the date/time points be equally spaced. The reported value is used for a non-detect.
Sampling -- One Sample
Introduction
This module provides guidance in setting up sampling plans for single populations; that is,
when the interest is comparing a site mean to a pre-defined threshold (action level). An appropriate
sample size is calculated, and samples can be located on a map in ESRI shapefile format.
Method
PURPOSE:
The purpose of this module is to provide guidance in setting up sampling plans for testing the
mean of a single population. This scenario should be utilized when the mean is to be compared
to a pre-determined value (sometimes called a threshold or action level). An appropriate
sample size is returned, as well as sampling locations if a map is provided.
MAP:
If a map is provided, sampling locations will be returned. Two types of maps can be provided.
A map of the sampling region (provided as a set of polygons) will determine the region(s) in which
sampling is of interest. An overlay map can also be provided, to provide a better visual context
for viewing. If only an overlay map is provided, then the extent of the overlay map is considered
the sampling region.
DATA:
Sampling plans can utilize current data, if it is available. If data is available, it can be
used to estimate the site mean, proportion, or standard deviation. Also, site-specific information
may indicate need to sample pre-specified certain locations. If sampling locations are provided,
then they will be included in the sampling plan.
Sampling -- Two Sample
Introduction
This module provides guidance in setting up sampling plans for two populations; that is,
when the interest is comparing a site mean to a pre-defined threshold (action level). An appropriate
sample size is calculated, and samples can be located on a map in ESRI shapefile format.
Method
PURPOSE:
The purpose of this module is to provide guidance in setting up sampling plans for testing the
mean of two populations. This scenario should be utilized when the mean is to be compared
for two different sampling regions, typically site versus background. Appropriate
sample sizes are returned, as well as sampling locations if appropriate maps are provided.
MAP:
If maps are provided, sampling locations will be returned. Two types of maps can be provided.
A map of a sampling region (provided as a set of polygons) will determine the region(s) in which
sampling is of interest. An overlay map can also be provided, to provide a better visual context
for viewing. At least two maps must be provided. If a single sampling region is supplied, then
the extent of the overlay map (excluding the sampling region) is assumed to define the other
sampling region.
DATA:
Sampling plans can utilize current data, if it is available. If data is available, it can be
used to estimate the site mean, proportion, or standard deviation. Also, site-specific information
may indicate need to sample pre-specified certain locations. If sampling locations are provided,
then they will be included in the sampling plan.
