Statistics Toolbox

Methods

Summary Statistics

Introduction

This module computes up to 18 summary statistics for the data. The data may be subsetted using the panel and group variables. Further discussion of the setting and assumptions is contained in the guidance section of the module.

Method

PURPOSE: Summary statistics are quantities computed from the data that describe distributional characteristics of the data and provide estimates of the underlying population. This module computes many summary statistics that describe distributional characteristics such as location and spread. Standard measures of location are the mean and median which describe the center of the distribution. Common measures of distributional spread are the standard deviation, the minimum, and the maximum. Percentiles are measures of location and also provide insight into the spread of the distribution.
DATA: A simple or systematic random sample, x1,..., xn, from the population of interest. The sample may or may not contain compositing. Several data sets from separate populations may be analyzed using the panel and group variables.
ASSUMPTIONS: Inference requires a random sample from the population of interest and for most applications, the data need to be independent.
LIMITATIONS AND ROBUSTNESS: Interepretation of population characteristics using summary statistics provides a quantitative picture of the data distribution. A more complete picture of the data can be seen by combining summary statistics and one or more exploratory data analysis plots such as boxplots, histograms, and intensity plots.
NON-DETECTS: A non-detect is an analytical sample where the concentration is deemed to be lower than could be detected using the method employed by the laboratory. The number of non-detects and the values for the minimum and maximum non-detect are reported in the summary table. When calculating statistics, such as the mean and standard deviation, non-detects are usually replaced by either zero, half the detection limit or the detection limit itself for computational purposes. There are some concerns with each of these substitution methods. When substituting a zero, the estimate of the mean is biased low but the estimate of the standard deviation is biased high. When non-detects are replaced by the detection limit, the estimate of the mean is biased high but the estimate of the standard deviation is biased low. Therefore, it is usually recommended that half the detection limit is used for statistical calculations such as the mean and standard deviation.
QUANTILES: The kth quantile of a set of values divides the values so that k/100 of the values lies below and 1.0-(k/100) of the values lie above k. The 5th, 25th, 75th and 95th quantiles are reported in the summary table. Note that the 50th quantile is the median.
CENTRAL TENDENCY: The mean and the median are two measures of central tendency reported in the summary statistics table. Mathematically, the mean is the average of the values in the data set whereas the median is the value found at the exact midpoint of the ordered data values. While both measures are used to characterize the center of a data set, there can be differences between the reported mean and median for a given distribution.
A majority of the differences between the mean and median arise when the distribution is skewed. In general, as the amount of skewness increases, the mean is pulled further away from the median in the direction of the tail exhibiting the skewness. Therefore, by examining the relationship between the mean and the median it is possible to visualize the shape of the distribution. For example, if the mean is much less than the median, the distribution is likely left skewed. If the mean is much larger than the median, the distribution is likely right skewed.
The mean and median will be close to one another when the distribution is approximately symmetric. Moreover, when the mean and median are identical, the distribution is exactly symmetric.
MEASURES OF SPREAD: There is only one measure of spread directly reported in the summary statistics. That measure is the standard deviation. The standard deviation is a measure of how far the data observations are from the mean of the distribution. Therefore, a small standard deviation indicates many of the values are close to the mean (i.e. there is little spread in the data) whereas a large standard deviation indicates there are values far from the mean (i.e. there is a lot of spread in the data). Since the standard deviation is a function of the mean, it can be affected by outliers. Therefore, data sets with large outliers should use the Inter Quartile Range (described below) as its measure of spread.
Other measures of spread that may be calculated from the summary statistics include the overall range and the Inter Quartile Range (IQR). The range of the data is calculated as the maximum minus the minimum values in the data set. This number presents the complete spread of the data. The IQR is calculated as the 75th quantile minus the 25th quantile. This number presents the spread associated with the middle half of the data.
SHAPIRO-WILK: The Shapiro-Wilk test is used to determine if a set of data are normally distributed. The null hypothesis for this test is that the data are distributed normally, while the alternative hypothesis is that the data are not distributed normally. The p-value associated with the Shapiro-Wilk test is provided as one of the statistics in the summary table. Small p-values indicate there is enough evidence to reject the null hypothesis in favor of the alternative hypothesis.
UPPER CONFIDENCE LEVEL (UCL): A confidence interval is an interval estimate for the population parameter of interest. Confidence limits for the mean provide an upper and lower estimate for the mean. The interval estimate also gives an indication of how much uncertainty there is in our estimate of the true mean. The narrower the interval, the more precise is our estimate.
UPPER TOLERANCE LEVEL (UTL): A tolerance interval is an interval estimate for a certain proportion of the population. The width of the interval is dependent upon the population variance, sample size, desired proportion of the population and confidence level. More specifically, the width is large if the variance is large, the sample size is small, the proportion is large or the confidence level is large. In general, the upper tolerance limit provides the upper limit in which some percentage of all your data will lie.

Summary Statistics

Introduction

This module computes up to 18 summary statistics for the data. The data may be subsetted using the panel and group variables. Further discussion of the setting and assumptions is contained in the guidance section of the module.

Method

PURPOSE: Summary statistics are quantities computed from the data that describe distributional characteristics of the data and provide estimates of the underlying population. This module computes many summary statistics that describe distributional characteristics such as location and spread. Standard measures of location are the mean and median which describe the center of the distribution. Common measures of distributional spread are the standard deviation, the minimum, and the maximum. Percentiles are measures of location and also provide insight into the spread of the distribution.
DATA: A simple or systematic random sample, x1,..., xn, from the population of interest. The sample may or may not contain compositing. Several data sets from separate populations may be analyzed using the panel and group variables.
ASSUMPTIONS: Inference requires a random sample from the population of interest and for most applications, the data need to be independent.
LIMITATIONS AND ROBUSTNESS: Interepretation of population characteristics using summary statistics provides a quantitative picture of the data distribution. A more complete picture of the data can be seen by combining summary statistics and one or more exploratory data analysis plots such as boxplots, histograms, and intensity plots.
NON-DETECTS: A non-detect is an analytical sample where the concentration is deemed to be lower than could be detected using the method employed by the laboratory. The number of non-detects and the values for the minimum and maximum non-detect are reported in the summary table. When calculating statistics, such as the mean and standard deviation, non-detects are usually replaced by either zero, half the detection limit or the detection limit itself for computational purposes. There are some concerns with each of these substitution methods. When substituting a zero, the estimate of the mean is biased low but the estimate of the standard deviation is biased high. When non-detects are replaced by the detection limit, the estimate of the mean is biased high but the estimate of the standard deviation is biased low. Therefore, it is usually recommended that half the detection limit is used for statistical calculations such as the mean and standard deviation.
QUANTILES: The kth quantile of a set of values divides the values so that k/100 of the values lies below and 1.0-(k/100) of the values lie above k. The 5th, 25th, 75th and 95th quantiles are reported in the summary table. Note that the 50th quantile is the median.
CENTRAL TENDENCY: The mean and the median are two measures of central tendency reported in the summary statistics table. Mathematically, the mean is the average of the values in the data set whereas the median is the value found at the exact midpoint of the ordered data values. While both measures are used to characterize the center of a data set, there can be differences between the reported mean and median for a given distribution.
A majority of the differences between the mean and median arise when the distribution is skewed. In general, as the amount of skewness increases, the mean is pulled further away from the median in the direction of the tail exhibiting the skewness. Therefore, by examining the relationship between the mean and the median it is possible to visualize the shape of the distribution. For example, if the mean is much less than the median, the distribution is likely left skewed. If the mean is much larger than the median, the distribution is likely right skewed.
The mean and median will be close to one another when the distribution is approximately symmetric. Moreover, when the mean and median are identical, the distribution is exactly symmetric.
MEASURES OF SPREAD: There is only one measure of spread directly reported in the summary statistics. That measure is the standard deviation. The standard deviation is a measure of how far the data observations are from the mean of the distribution. Therefore, a small standard deviation indicates many of the values are close to the mean (i.e. there is little spread in the data) whereas a large standard deviation indicates there are values far from the mean (i.e. there is a lot of spread in the data). Since the standard deviation is a function of the mean, it can be affected by outliers. Therefore, data sets with large outliers should use the Inter Quartile Range (described below) as its measure of spread.
Other measures of spread that may be calculated from the summary statistics include the overall range and the Inter Quartile Range (IQR). The range of the data is calculated as the maximum minus the minimum values in the data set. This number presents the complete spread of the data. The IQR is calculated as the 75th quantile minus the 25th quantile. This number presents the spread associated with the middle half of the data.
SHAPIRO-WILK: The Shapiro-Wilk test is used to determine if a set of data are normally distributed. The null hypothesis for this test is that the data are distributed normally, while the alternative hypothesis is that the data are not distributed normally. The p-value associated with the Shapiro-Wilk test is provided as one of the statistics in the summary table. Small p-values indicate there is enough evidence to reject the null hypothesis in favor of the alternative hypothesis.
UPPER CONFIDENCE LEVEL (UCL): A confidence interval is an interval estimate for the population parameter of interest. Confidence limits for the mean provide an upper and lower estimate for the mean. The interval estimate also gives an indication of how much uncertainty there is in our estimate of the true mean. The narrower the interval, the more precise is our estimate.
UPPER TOLERANCE LEVEL (UTL): A tolerance interval is an interval estimate for a certain proportion of the population. The width of the interval is dependent upon the population variance, sample size, desired proportion of the population and confidence level. More specifically, the width is large if the variance is large, the sample size is small, the proportion is large or the confidence level is large. In general, the upper tolerance limit provides the upper limit in which some percentage of all your data will lie.

Boxplot

Introduction

This module draws an original-scale boxplot and a log-scale boxplot for each panel-group combination. Separate plots are drawn for each panel variable while side-by-side boxplots are drawn for each group variable within a panel plot (see the example plot in the guidance). Further description of the boxplot is given in the guidance.

Method

Box and Whisker plots, also known as boxplots, are useful in situations where a picture of the distribution is desired, but it is not necessary or feasible to portray all the details of the data. A boxplot (see the example figure) displays several percentiles of the data set. It is a simple plot, yet provides insight into the location, shape, and spread of the data and underlying population. A simple boxplot contains only the 0th (minimum data value), 25th, 50th, 75th and 100th (maximum data value) percentiles. A more complex version included here plots potential outliers as open circles. These are points that lie further than 1.5 times the interquartile range (75th percentile minus the 25th percentile) from either end of the central box. Since the boxplot is compact (essentially one-dimensional), several can be placed on a single graph. This allows the user to comparing the locations, spreads and shapes of several data sets or different groups within a single data set.
A boxplot divides the data into 4 sections, each containing 25% of the data. The length of the central box indicates the spread of the central 50% of the data, while the length of the whiskers shows the breadth of the tails of the distribution. The boxplot also demonstrates the shape of the data in the following manner. If the upper box and whisker are approximately the same length as the lower box and whisker, then the data are distributed symmetrically. If the upper box and whisker are longer than the lower box and whisker, then the data are right-skewed. If the upper box and whisker are shorter than the lower box and whisker, then the data are left-skewed.
Other options that can be displayed are non-detects (plotted as X's at detection limit), detects (plotted as solid circles), and sample sizes (plotted at the top of each boxplot). The example plot has arsenic as the only panel variable while there are two group subsets, PrimInv and SedBkg. Also displayed are the sample sizes and non-detects.

Example Boxplot

Histogram

Introduction

Histograms provide a visual method of assessing the location, shape and spread of the data. Also, extreme values and multiple modes can be identified. The details of the data are lost, but an overall picture of the data is obtained.

Method

A histogram is a visual representation of the data collected into groups. This graphical technique provides a method for identifying the underlying distribution of the data. The number of bins or classes is determined by the user or by using the range of data values. Once the number of bins is determined, the data is sorted into those bins. A histogram is a bar graph conveying the bins and the frequency of data points in each bin. Other forms of the histogram use a normalization of the bin frequencies for the heights of the bars. The two most common normalizations are relative frequencies (frequencies divided by sample size) and densities (relative frequency divided by the bin width). The figure below is an example of a histogram using frequencies.

Example Histogram of Frequencies

Histograms not only provide a means of identifying and characterizing the data, but they also make it possible to identify extreme values and multiple modes. In the process of creating the histogram, but an overall picture of the data is obtained. As seen in the example above, individual data points are not apparent. Other kinds of plots, specifically the stem-and-leaf plot described in the next section, offers the same insights into the data as a histogram, but the individual data values are retained on the plot. Therefore, stem-and-leaf plots can be more informative than histograms for smaller data sets.
On concern when creating a histogram is that the visual impression of a histogram is sensitive to the number of bins chosen for the plot. A large number of bins increases data detail while fewer bins increases the smoothness of the histogram. A good starting point when choosing the number of bins is the square-root of the sample size. Another factor to consider when constructing histogram bins is the choice of bin endpoints. Using convenient bin endpoints can improve the readability of the histogram. For an integer k, convenient endpoints would include multiples of 5k units. For example, 0 to <5, 5 to <10, etc., as the bins in the figure or 1 to <1.5, 1.5 to <2, etc. Finally, when plotting a histogram for a continuous variable, e.g., concentration, it is necessary to decide on an endpoint convention; that is, what to do with data points that fall on the boundary of a bin. With discrete variables, (e.g., family size) the intervals can be centered in between the variables. For the family size data, the intervals can span between 1.5 and 2.5, 2.5 and 3.5, and so on. Then the whole numbers that relate to the family size can be centered within the box.
As with all graphical techniques, the histogram is intended to increase understanding of the data. Since the histogram has subjective options, it is up to the user to select those that convey a clear picture of the data.

Normal QQ-Plot

Introduction

This module draws a Normal QQ-plot for each group-panel combination. The underlying population distribution of the data is approximately normal if the plotted points approximately follow a straight line. Further details can be found in the guidance section.

Method

A normal quantile-quantile plot or qq-plot (also known as a normal probability plot) is a graph of the quantiles of a data set against the quantiles of the standard normal distribution. This is a technique used to qualitatively determine if the underlying population of the data is normally distributed. A statistical hypothesis test of normality called the Shapiro-Wilk test can also be used to evaluate whether the underlying distribution is normal. This test can be found under the 'Verifying Assumptions' menu option.
If the plotted points in the normal qq-plot are approximately linear, then this is an indication that the data are normally distributed. If the graph is not linear, then the departures from linearity give important information about how the data distribution deviates from normality. A nonlinear normal probability plot may be used to interpret the shape and tail behavior of the data. The quartile line, the line through the first and third quartiles, is added to the plot to help determine departures from linearity. Examine the relationship of the tails of the normal probability plot to the quartile line. If the data are exactly normal, then the qq-plot will fall exactly upon the quartile line.
If the qq-plot has an upper-tail above the quartile line while the lower-tail is below, then the distribution is symmetric with tails that are heavier than a normal distribution. If the qq-plot has an upper-tail below the quartile line while the lower-tail is above, then the distribution is symmetric with tails that are lighter than a normal distribution. If both tails of the the qq-plot are above the quartile line, then the distribution is right-skewed. If both tails of the the qq-plot are below the quartile line, then the distribution is left-skewed.
A normal probability plot can also be used to identify potential outliers. A data value (or a few data values) much larger or much smaller than the rest of the data will appear distant from the other values concentrated in the center of the graph.

Example Normal QQ-plot

Intensity Plot

Introduction

Intensity plots are two-dimensional displays of the data values according to the spatial coordinates of the samples. Plots for subsets of the data can be obtained using the panel and group variables. Further details are available in the guidance section.

Method

Intensity plots offer displays not possible in other EDA plots. These plots allow the user to visualize the distribution of the data points and can be used to locate extreme values, view overall spatial trends, and view the degree of continuity among neighboring locations. An intensity plot is a scatterplot of the sample coordinates where a colored symbol is plotted at the set of coordinates rather than just a point. The color of the symbol is defined by a color rainbow that runs from green, to yellow, to orange, and finally to red. Green corresponds to the lowest values and red to the highest. Detects are plotted as circles and non-detects are plotted as squares.
Advanced options for the intensity plot are the minimum value, maximum value, and a threshold. All of these options modify the intensity color range. The minimum and maximum values will set the range of color values. This allows the user to compare more than one intensity plot on the same scale. Setting a threshold value (a UCL or some action level) modifies the plot in two ways. First, detected values below the threshold are plotted as circles while detected values above the threshold are plotted as triangles. Non-detects are still plotted as squares. Secondly, values below the threshold are plotted using a green to yellow palette (yellow being closer to the threshold) while values above the threshold are plotted using an orange to red palette (orange being closer to the threshold).

Example Intensity Plot

Control Charts

Introduction

Control charts provide a visual display of monitoring statistics that indicate if a process has changed modes or produced anomalous results.

Method

Control charts are used in monitoring for detecting a variety of behaviors. The X-bar chart is useful for detecting a batch with an anomolous mean. The cusum chart is useful for detecting a gradual change in the mean. The S chart is useful for detecting an anomaly in the variation of a batch.

One-Sample t-test

Introduction

This module performs a one-sample t-test comparing the population mean to a fixed threshold. The data may be subsetted using the panel and grouping variables (note that only one panel subset and one group subset can be selected). Further discussion of the test setting and assumptions is contained in the guidance section of the module.

Method

PURPOSE: To test for a difference between a population mean and a fixed threshold. This test should be used in conjunction with exploratory data analysis (EDA) so that a complete picture of the data can be viewed. If the assumptions for this test are not met, then nonparametric alternatives are the Sign test and the Wilcoxon Signed Ranks test. If the assumptions are met, then the t-test has more power than the nonparametric tests.
DATA: A simple or systematic random sample, x1,...,xn, from the population of interest. The sample may or may not contain compositing.
ASSUMPTIONS: A key assumption for the accurate computation of the t-test p-value is that the sample mean has an approximate normal distribution. If the data are normal or the sample size is large (approximately 30 or larger), then the distribution of the sample mean is approximately normal (the more skewed or non-normal the data, the larger the sample size must be). The Shapiro-Wilk test is used to test the normality of the data. Other key assumptions for the t-test are randomness and independence of the samples. Randomness means that sampling locations were selected at random from all possible sampling locations. Data independence means that the value of one sample does not effect the value of another. Review the sampling plan to ensure these assumptions are met.
LIMITATIONS AND ROBUSTNESS: One-sample t methods are robust against the population distribution deviating moderately from normality. However, these methods are not robust against outliers. In either case, nonparametric methods offer an alternative. Finally, the one-sample t methods have difficulty dealing with non-detects. Substitution methods are an alternative, but it would be best to use a nonparametric method. In general, very large sample sizes will produce small p-values. Therefore, a very large sample size may cause a statistically significant result that is not practically significant.
CHOICE OF SIGNIFICANCE LEVEL: The significance level (or type I error rate) is the chance of rejecting the null hypothesis given the null hypothesis is true. Typical values are 0.001, 0.01, 0.05, and 0.1. The specific significance level for a test is set by assessing the consequences of making this decision error in the context of the specific study. Generally speaking, the more severe the consequences of making this decision error, the lower the significance level. For example, suppose we were testing whether a plutonium clean-up level had been reached. Since the consequences of rejecting the null hypothesis when it is in fact true, i.e., conclude the site is clean when it is in fact dirty, are quite severe, the signicance level should be small, say 0.001 or 0.01. On the other hand, suppose we were testing whether a stream is contaminated with arsenic. Since the consequences of rejecting the null hypothesis when it is in fact true, i.e., concluding the stream is contaminated when it is not, are not so severe, the significance level should be set higher, say 0.05 or 0.10. In conclusion, setting the significance level is somewhat subjective and dependent upon your specific field and study goal.

One-Sample Sign Test

Introduction

This module performs a one-sample Sign test comparing the population median to a fixed threshold. The data may be subsetted using the panel and grouping variables (note that only one panel subset and one group subset can be selected). Further discussion of the test setting and assumptions is contained in the guidance section of the module.

Method

PURPOSE: To test for a difference between the population median and a fixed threshold. This test should be used in conjunction with exploratory data analysis (EDA) so that a complete picture of the data can be viewed. This test has few assumptions, however it has less power than the parametric t-test or the nonparametric Wilcoxon Signed Ranks test.
DATA: A simple or systematic random sample, x1,...,xn, from the population of interest. The sample may or may not contain compositing.
ASSUMPTIONS: The Sign test can be used for any underlying population distribution. Key assumptions for the Sign test are randomness and independence of the samples. Randomness means that sampling locations were selected at random from all possible sampling locations. Data independence means that the value of one sample does not effect the value of another. Review the sampling plan to ensure these assumptions are met.
LIMITATIONS AND ROBUSTNESS: The Sign test has less power than the one-sample t-test or the Wilcoxon Signed Rank test. However, the Sign test makes no distributional assumptions like the other two tests and it can handle non-detects if the detection limit is below the threshold.
CHOICE OF SIGNIFICANCE LEVEL: The significance level (or type I error rate) is the chance of rejecting the null hypothesis given the null hypothesis is true. Typical values are 0.001, 0.01, 0.05, and 0.1. The specific significance level for a test is set by assessing the consequences of making this decision error in the context of the specific study. Generally speaking, the more severe the consequences of making this decision error, the lower the significance level. For example, suppose we were testing whether a plutonium clean-up level had been reached. Since the consequences of rejecting the null hypothesis when it is in fact true, i.e., conclude the site is clean when it is in fact dirty, are quite severe, the signicance level should be small, say 0.001 or 0.01. On the other hand, suppose we were testing whether a stream is contaminated with arsenic. Since the consequences of rejecting the null hypothesis when it is in fact true, i.e., concluding the stream is contaminated when it is not, are not so severe, the significance level should be set higher, say 0.05 or 0.10. In conclusion, setting the significance level is somewhat subjective and dependent upon your specific field and study goal.

One-Sample Wilcoxon Signed Rank Test

Introduction

This module performs a one-sample Wilcoxon Signed Rank test that compares the population location (mean or median) to a fixed threshold. The data may be subsetted using the panel and grouping variables (note that only one panel subset and one group subset can be selected). Further discussion of the test setting and assumptions is contained in the guidance section of the module.

Method

PURPOSE: To test for a difference between the true location (mean or median) of a population and a fixed threshold. If the underlying population distribution is approximately normal, then the one-sample t-test will have more power (chance of rejecting the null hypothesis when it is false) than the Wilcoxon Signed Rank test. For symmetric distributions, the Wilcoxon Signed Rank test will have more power than the Sign test. If the sample size is small and the data is not normally distributed nor approximately symmetric, then the Sign test should be performed. If non-detects are present, then an alternative ranking system called Gehan ranking is used.
DATA: A simple or systematic random sample, x1,...,xn, from the population of interest. The sample may or may not contain compositing.
ASSUMPTIONS: The data set comes from an approximately symmetric distribution. Other key assumptions are randomness and independence of the samples. Randomness means that sampling locations were selected at random from all possible sampling locations. Data independence means that the value of one sample does not effect the value of another. Review the sampling plan to ensure these assumptions are met.
LIMITATIONS AND ROBUSTNESS: For large sample sizes (n > 50), the one-sample t-test is more robust against violations of its assumptions than the Wilcoxon Signed Rank test. The Wilcoxon Signed Rank test may produce misleading results if there are many tied data values. If possible, measurements should be recorded with sufficient accuracy so that a large number of tied values do not occur. Estimated concentrations should be reported for data below the detection limit, even if these estimates are negative, as their relative magnitude is of importance.
CHOICE OF SIGNIFICANCE LEVEL: The significance level (or type I error rate) is the chance of rejecting the null hypothesis given the null hypothesis is true. Typical values are 0.001, 0.01, 0.05, and 0.1. The specific significance level for a test is set by assessing the consequences of making this decision error in the context of the specific study. Generally speaking, the more severe the consequences of making this decision error, the lower the significance level. For example, suppose we were testing whether a plutonium clean-up level had been reached. Since the consequences of rejecting the null hypothesis when it is in fact true, i.e., conclude the site is clean when it is in fact dirty, are quite severe, the signicance level should be small, say 0.001 or 0.01. On the other hand, suppose we were testing whether a stream is contaminated with arsenic. Since the consequences of rejecting the null hypothesis when it is in fact true, i.e., concluding the stream is contaminated when it is not, are not so severe, the significance level should be set higher, say 0.05 or 0.10. In conclusion, setting the significance level is somewhat subjective and dependent upon your specific field and study goal.

Confidence Limits or Intervals for the Population Mean

Introduction

Computation of upper (UCL) or lower (LCL) confidence limits or confidence intervals for the population mean. A confidence limit or interval specifies a region that covers the population mean with a certain level of confidence.

Method

PURPOSE: Student's-t and bootstrap methods are employed to compute confidence limits or intervals. A confidence limit or interval specifies a region that covers the population mean with a certain level of confidence. For example, the statement, 'The 95% UCL for the population mean is 7.8.', is interpreted as, 'I can be 95% confident that the population mean lies below 7.8.'
DATA: A simple or systematic random sample, x1,..., xn, from the population of interest. The sample may or may not contain compositing.
ASSUMPTIONS: A key assumption for the accurate computation of the student's-t confidence limit or interval is that the sample mean has an approximate normal distribution. If the data are normal or the sample size is large (approximately 30 or larger), then the distribution of the sample mean is approximately normal (the more skewed or non-normal the data, the larger the sample size must be). The bootstrap methods provide an alternative if it is believed that the sample mean is not normally distributed. For any of the methods, it is assumed that the data are randomly and independently selected. Randomness means that sampling locations were selected at random from all possible sampling locations. Data independence means that the value of one sample does not effect the value of another. Review the sampling plan to ensure these assumptions are met.
LIMITATIONS AND ROBUSTNESS: The student's-t method is robust against the distribution of the sample mean deviating moderately from normality. In this case, the nonparametric bootstrap method offers an alternative. However, none of the methods is robust against outliers. Finally, all of the methods have difficulty dealing with non-detects. Substitution methods are the offered alternative.
STUDENT'S-t COMPUTATIONS:
NONPARAMETRIC BOOTSTRAP COMPUTATIONS:
STUDENTIZED BOOTSTRAP COMPUTATIONS:

Tolerance Limit

Introduction

Computation of upper (UTL) or lower (LTL) tolerance limit. A tolerance limit specifes a region that covers a certain portion of the population with a certain level confidence. Tolerance limits can also be viewed as confidence limits on population percentiles.

Method

PURPOSE: Normal-based and nonparametric bootstrap methods are employed to compute upper (UTL) or lower (LTL) tolerance limits. A tolerance limit specifes a region that covers a certain portion of the population with a certain confidence. For example, the statement, 'The 95% UTL for 90% of the population is 7.8.', is interpreted as, 'I can be 95% confident that 90% of the population lies below 7.8.'
DATA: A simple or systematic random sample, x1,..., xn, from the population of interest. The sample may or may not contain compositing.
ASSUMPTIONS: A key assumption for the accurate computation of the normal-based tolerance limit is that the underlying distrubution is approximately normally distributed. In either the normal or bootstrap case, it is assumed that the data are randomly and independently selected. Randomness means that sampling locations were selected at random from all possible sampling locations. Data independence means that the value of one sample does not effect the value of another. Review the sampling plan to ensure these assumptions are met.
LIMITATIONS AND ROBUSTNESS: The normal-based method is robust against the population distribution deviating moderately from normality. This deviation can lead to the normal-based LTL method producing negative estimates. In this case, the nonparametric bootstrap method offers an alternative. However, neither the normal or bootstrap method is robust against outliers. Finally, both methods have difficulty dealing with non-detects. Substitution methods are the offered alternative.
NORMAL-BASED COMPUTATIONS:
NONPARAMETRIC BOOTSTRAP COMPUTATIONS:

Two-Sample t-test

Introduction

This module performs a test for a difference between two population means. Select only a single 'sample 1' subset and a single 'sample 2' subset. The data can be further subsetted using the panel and group variables. Further discussion of the test setting and assumptions is contained in the method section of the module.

Method

PURPOSE: The purpose of this module is to test for a difference between two population means. This test should be used in conjunction with exploratory data analysis (EDA) so that a complete picture of the data can be viewed. If the assumptions for this test are not met, then a nonparametric alternative is the Wilcoxon Rank Sum test. If the assumptions are met, then the t-test has more power than the nonparametric test.
DATA: A simple or systematic random sample x1, x2, . . . , xm from one population, and an independent simple or systematic random sample y1, y2, . . . , yn from the second population.
ASSUMPTIONS: The two populations are independent. If not, then it is possible that a paired method could be used. Both are approximately normally distributed or the sample sizes are large (m and n both at least 30). If this is not the case, then a nonparametric procedure is an alternative. Finally, this test does not assume the populations have equal variances.
LIMITATIONS AND ROBUSTNESS: The two-sample t-test is robust to moderate violations of the assumption of normality. The t-test is not robust against outliers because sample means and standard deviations are sensitive to outliers.
CHOICE OF SIGNIFICANCE LEVEL: The significance level (or type I error rate) is the chance of rejecting the null hypothesis given the null hypothesis is true. Typical values are 0.001, 0.01, 0.05, and 0.1. The specific significance level for a test is set by assessing the consequences of making this decision error in the context of the specific study. Generally speaking, the more severe the consequences of making this decision error, the lower the significance level. For example, suppose we were testing whether a plutonium clean-up level had been reached. Since the consequences of rejecting the null hypothesis when it is in fact true, i.e., conclude the site is clean when it is in fact dirty, are quite severe, the signicance level should be small, say 0.001 or 0.01. On the other hand, suppose we were testing whether a stream is contaminated with arsenic. Since the consequences of rejecting the null hypothesis when it is in fact true, i.e., concluding the stream is contaminated when it is not, are not so severe, the significance level should be set higher, say 0.05 or 0.10. In conclusion, setting the significance level is somewhat subjective and dependent upon your specific field and study goal.

Wilcoxon Rank Sum Test

Introduction

This module performs a test for a difference between two population locations (means or medians). This is a nonparametric test, therefore knowledge of the precise form of the population distributions is not necessary. Select only a single 'sample 1' subset and a single 'sample 2' subset. The data can be further subsetted using the panel and group variables. Further discussion of the test setting and assumptions is contained in the method section of the module.

Method

PURPOSE: The Wilcoxon Rank Sum test performs a test for a difference between two population means locations (means or medians). This is a nonparametric method that relies on the relative rankings of data values. Knowledge of the precise form of the population distributions is not necessary. The Wilcoxon Rank Sum (WRS) test has less power than the two-sample t-test, but the assumptions are not as restrictive. This version of the WRS test uses the Mantel approach which is equivalent to using the Gehan ranking system.
DATA: A random sample x1, x2,..., xm from one population, and an independent random sample y1, y2, ..., yn from the second population.
ASSUMPTIONS: The validity of the random sampling and independence assumptions should be verified by review of the procedures used to select the sampling points. The two underlying distributions are assumed to have approximately the same shape (variance) and that the only difference between them is a shift in location. A qualitative test of this assumption can be done by comparing boxplots or histograms. A quantititative test is contained in the 'Compare Variances' module under the 'Verifying Assumptions' menu.
TEST STATISTIC: The test statistic is computed using Mantel's approach which is equivalent to the Gehan approach. For each data point in 'sample 1', compute the number of pooled data values that are definitely less than the data point in question. This value is 0 for all non-detects. Also compute the number of pooled data values that are definitely greater that the data point in question. Now, for each point in 'sample 1', compute the difference of the less than total and the greater than total. Finally, the test statistic is the sum of all the differences. If this value is a large positive value, then 'sample 1' is elevated over 'sample 2'. If this value is a large negative value, then 'sample 2' is elevated over 'sample 1'.
LIMITATIONS AND ROBUSTNESS: The Wilcoxon Rank Sum test may produce misleading results if there are many tied data values. When many ties are present, their relative ranks are the same, and this has the effect of diluting the statistical power of the Wilcoxon test. If possible, results should be recorded with sufficient accuracy so that a large number of tied values do not occur. Estimated concentrations should be reported for data below the detection limit, even if these estimates are negative, as their relative magnitude to the rest of the data is of importance.
CHOICE OF SIGNIFICANCE LEVEL: The significance level (or type I error rate) is the chance of rejecting the null hypothesis given the null hypothesis is true. Typical values are 0.001, 0.01, 0.05, and 0.1. The specific significance level for a test is set by assessing the consequences of making this decision error in the context of the specific study. Generally speaking, the more severe the consequences of making this decision error, the lower the significance level. For example, suppose we were testing whether a plutonium clean-up level had been reached. Since the consequences of rejecting the null hypothesis when it is in fact true, i.e., conclude the site is clean when it is in fact dirty, are quite severe, the signicance level should be small, say 0.001 or 0.01. On the other hand, suppose we were testing whether a stream is contaminated with arsenic. Since the consequences of rejecting the null hypothesis when it is in fact true, i.e., concluding the stream is contaminated when it is not, are not so severe, the significance level should be set higher, say 0.05 or 0.10. In conclusion, setting the significance level is somewhat subjective and dependent upon your specific field and study goal.

Quantile Test

Introduction

This module performs a test for a difference between the right-tails of two distributions. Further discussion of the test setting and assumptions is contained in the method section of the module.

Method

PURPOSE: This method performs a test for a shift to the right in the right-tail of the site or tested population versus the reference population. This may be regarded as being equivalent to detecting if the values in the right-tail of site or tested distribution are generally larger than the values in the right-tail of the reference distribution. The tested location in the right-tail is determined by the quantile value. Values closer to 0.5 will result in a test similar to a test of center like the two-sample t-test or the Wilcoxon Rank Sum test. Values closer to 1 will be testing the extreme right-tails of the distributions and will be similar the Slippage test. The desired quantile should be between 0.5 and 1. If the chosen quantile is 0.5, then the test is equivalent to the Median Test described by Conover.
DATA: A simple or systematic random sample, x1, x2,..., xn, from the reference population and an independent simple or systematic random sample, y1, y2,..., ym, from the site population.
ASSUMPTIONS: It is assumed that the populations have approximately the same shape (variance). The validity of the random sampling and independence assumptions is assured by using proper randomization procedures, which can be verified by reviewing the procedures used to select the sampling points.
LIMITATIONS AND ROBUSTNESS: Since the Quantile test focuses on the right-tail, large outliers will bias results. Also, the Quantile test says nothing about the center of the two distributions. Therefore, this test should be used in combination with a location test like the t-test (if the data are normally distributed) or the Wilcoxon Rank Sum test. If the percentage of non-detects in the pooled data is larger than the desired quantile percentage or if there is a non-detect larger than the quantile of the pooled data, then the test cannot be performed.
CHOICE OF SIGNIFICANCE LEVEL: The significance level (or type I error rate) is the chance of rejecting the null hypothesis given the null hypothesis is true. Typical values are 0.001, 0.01, 0.05, and 0.1. The specific significance level for a test is set by assessing the consequences of making this decision error in the context of the specific study. Generally speaking, the more severe the consequences of making this decision error, the lower the significance level. For example, suppose we were testing whether a plutonium clean-up level had been reached. Since the consequences of rejecting the null hypothesis when it is in fact true, i.e., conclude the site is clean when it is in fact dirty, are quite severe, the signicance level should be small, say 0.001 or 0.01. On the other hand, suppose we were testing whether a stream is contaminated with arsenic. Since the consequences of rejecting the null hypothesis when it is in fact true, i.e., concluding the stream is contaminated when it is not, are not so severe, the significance level should be set higher, say 0.05 or 0.10. In conclusion, setting the significance level is somewhat subjective and dependent upon your specific field and study goal.

Slippage Test

Introduction

This module performs a test for a difference between the right-tails of two distributions. Further discussion of the test setting and assumptions is contained in the method section of the module.

Method

PURPOSE: Test for a shift to the right in the extreme right-tail of population 1 (reference) versus population 2 (site). This is equivalent to asking if a set of the largest values of the site distribution are larger than the maximum value of the background distribution.
DATA: A simple or systematic random sample, x1, x2,..., xn, from the reference population and an independent simple or systematic random sample, y1, y2,..., ym, from the site population.
ASSUMPTIONS: The validity of the random sampling and independence assumptions is assured by using proper randomization procedures, which can be verified by reviewing the procedures used to select the sampling points.
LIMITATIONS AND ROBUSTNESS: Since the Slippage test focuses on the right-tail, large outliers will bias results. Also, the Slippage test says nothing about the center of the two distributions. Therefore, this test should be used in combination with a location test like the t-test (if the data are normally distributed) or the Wilcoxon Rank Sum test. Non-detects larger than the maximum background detect are eliminated from the analysis.
CHOICE OF SIGNIFICANCE LEVEL: The significance level (or type I error rate) is the chance of rejecting the null hypothesis given the null hypothesis is true. Typical values are 0.001, 0.01, 0.05, and 0.1. The specific significance level for a test is set by assessing the consequences of making this decision error in the context of the specific study. Generally speaking, the more severe the consequences of making this decision error, the lower the significance level. For example, suppose we were testing whether a plutonium clean-up level had been reached. Since the consequences of rejecting the null hypothesis when it is in fact true, i.e., conclude the site is clean when it is in fact dirty, are quite severe, the signicance level should be small, say 0.001 or 0.01. On the other hand, suppose we were testing whether a stream is contaminated with arsenic. Since the consequences of rejecting the null hypothesis when it is in fact true, i.e., concluding the stream is contaminated when it is not, are not so severe, the significance level should be set higher, say 0.05 or 0.10. In conclusion, setting the significance level is somewhat subjective and dependent upon your specific field and study goal.

Paired t-test

Introduction

Paired t-tests are used to test for a difference between paired population means. The data from these studys are often collected on the same sampling units (individuals or objects) at different times, for example, heart rate measurements taken before and after a stress test. These measurements performed on the same sampling units are correlated, so they are handled differently than independent measurements taken from two sampling units.

Method

PURPOSE: To test for a difference between paired population means.
DATA: Two data sets x1,..., xn and y1,..., yn that are paired, i.e. x1 is paired with y1, x2 is paired with y2, etc.
ASSUMPTIONS: The main assumption associated with the Paired t-test is that the two data sets come from approximately normal distributions. As a result of the Central Limit Theorm, this assumption may be relaxed if the number of paired observations is greater than 30 .
LIMITATIONS AND ROBUSTNESS: Since there is really only one sample, a sample of differences, the limitations for the Paired t-test are the same as those for the one-sample t-test. These methods are robust against population distributions deviating moderately from normality. However, they are not robust against outliers. When the distribution is determined to be distinctly non-normal or there are outliers in the distribution the nonparametric Sign Test or the Wilcoxon Signed Rank Test are valid alternative tests for paired data.

Testing Normality

Introduction

This module uses the Shapiro-Wilk test to determine if a population is normally distributed. While other tests for normality are available, the Shapiro-Wilk test is one of the most powerful and is therefore the only one presented in this module.

Method

PURPOSE: To determine if a single population is normally distributed.
SHAPIRO-WILK TEST: The Shapiro-Wilk W test is one of the most powerful and widely used statistical tests for normality. This test is similar to computing a correlation between the quantiles of the standard normal distribution and the ordered values of a data set. If the points on a normal probability plot lie approximately along a straight line, the W test statistic will be relatively large and the data are considered to be normal. However, systematic deviations from a straight line indicate the distribution is non-normal. In this case, the W test statistic will be relatively small. Tables of critical values for sample sizes up to 50 have been developed for determining the significance of the W test statistic. However, this module can perform the W test for data sets with sample sizes as large as 5000.
OTHER NORMALITY TESTS: There are several other statistical tests for normality that are related to the Shapiro-Wilk test. D'Agostino's test for sample sizes between 50 and 1000 and Royston's test for sample sizes up to 2000 are two such tests that approximate some of the key quantities or parameters of the Shapiro-Wilk W test. The Filliben statistic, also called the probability plot correlation coefficient, also is a test for normality and measures the linearity of the points on the normal probability plot. Similar to the W test, if the normal probability plot is approximately linear, then the correlation coefficient will be relatively large.

Background Comparison

Introduction

This module compares the distributions of two data sets to determine if one is significantly elevated over the other. This problem typically arises when comparing a potentially contaminated site to background levels. Multiple sites may be compared to a single background data set. Subsets of the data may be obtained by using the panel and group variables. However, only one panel subset and one group subset can be analyzed at a time.

Method

A common problem encountered in environmental statistics involves comparing concentrations from a particular location to background or reference concentrations. The procedure described in this section is appropriate for characterization purposes. It is not intended to demonstrate a location has met clean-up standards. This section provides a useful method for determining if the location is different from background, implying that a contamination source may currently exist or may have existed at one time.
The data consists of a simple or systematic random samples from the site (may be more than one) and background populations of interest. The samples may or may not contain compositing. While subsets of the data may be obtained using the panel and group variables, only a single panel subset and a single group subset can be analyzed at any one time.
The comparison procedure involves simultaneously running four two-sample hypothesis tests each at a specified significance level. The tests are the t-test, the Wilcoxon Rank Sum test, the Quantile test and the Slippage test. The t-test and the Wilcoxon Rank Sum test compare the centers of the underlying populations while the Quantile test and the Slippage test compare the right-tails of the two distributions. A final conclusion should be made using exploratory data analysis (EDA) along with the results of all the tests. To that end, summary statistics and boxplots also are presented with the test results.
Care needs to be taken when setting the significance level for an individual test because it is necessary to control the false rejection rate for the entire procedure. In this case, we are finding 4 conclusions at the individual significance level so the overall false rejection rate is much higher. Suppose an overall false rejection rate of at most 0.05 is desired. One method to set the significance level is to divide 0.05 by the number of tests performed. However, since the t-test and the Wilcoxon Rank Sum test are similar and the Quantile test and the Slippage test are similar, this leads to a conservative overall rate. Therefore, as a compromise between 0.05 and 0.0125, a significance level of 0.02 is suggested for each test to attain an overall false rejection rate of 0.05.

Time Plot

Introduction

This module generates a time plot of the data. A time plot is simple the data plotted in chronological order. Detects are plotted as solid circles and non-detects are plotted as open circles. Further discussion of the setting and assumptions is contained in the guidance section of the module.

Method

PURPOSE: A time plot provides insight into how the data changes over time. This plot type aides in identifying trends or outliers. A trend is an increasing or decreasing change in the data and indicates a change in the process that creates the data. Detects are plotted as solid circles and non-detects are plotted as open circles. The module on trend estimation uses a nonparametric method to quantitatively estimate the trend.
DATA: A simple or systematic random sample over n time points, x1,..., xn, from the population of interest. The sample may or may not contain compositing. Several data sets from separate populations may be analyzed using the panel and group variables.

Trend Estimation

Introduction

This module estimates a trend in temporal data. Several summary and inferential statistics for trend are computed and a graphical display of the confidence interval for slope is presented. Further discussion of the setting and assumptions is contained in the guidance section of the module.

Method

PURPOSE: Summarize the trend (if any) in monitoring data. Several nonparametric methods are used. The summary table includes the samples size, number of detects, Conover's estimate for the intercept, the Theil-Sen estimate of slope, Gilbert's confidence interval for slope, and the Mann-Kendall test for trend p-value. Note that these are all nonparametric methods. Sources include Hollander and Wolfe (1973) and Gilbert (1987).
DATA: A simple or systematic random sample over n time points, x1,..., xn, from the population of interest. The sample may or may not contain compositing. Several data sets from separate populations may be analyzed using the panel and group variables.
ASSUMPTIONS: That the data does not contain serial correlation. That is, values close together temporal are not correlated.
LIMITATIONS AND ROBUSTNESS: Currently, the nonparametric methods require the date/time points be equally spaced. The reported value is used for a non-detect.

Sampling -- One Sample

Introduction

This module provides guidance in setting up sampling plans for single populations; that is, when the interest is comparing a site mean to a pre-defined threshold (action level). An appropriate sample size is calculated, and samples can be located on a map in ESRI shapefile format.

Method

PURPOSE: The purpose of this module is to provide guidance in setting up sampling plans for testing the mean of a single population. This scenario should be utilized when the mean is to be compared to a pre-determined value (sometimes called a threshold or action level). An appropriate sample size is returned, as well as sampling locations if a map is provided.
MAP: If a map is provided, sampling locations will be returned. Two types of maps can be provided. A map of the sampling region (provided as a set of polygons) will determine the region(s) in which sampling is of interest. An overlay map can also be provided, to provide a better visual context for viewing. If only an overlay map is provided, then the extent of the overlay map is considered the sampling region.
DATA: Sampling plans can utilize current data, if it is available. If data is available, it can be used to estimate the site mean, proportion, or standard deviation. Also, site-specific information may indicate need to sample pre-specified certain locations. If sampling locations are provided, then they will be included in the sampling plan.

Sampling -- Two Sample

Introduction

This module provides guidance in setting up sampling plans for two populations; that is, when the interest is comparing a site mean to a pre-defined threshold (action level). An appropriate sample size is calculated, and samples can be located on a map in ESRI shapefile format.

Method

PURPOSE: The purpose of this module is to provide guidance in setting up sampling plans for testing the mean of two populations. This scenario should be utilized when the mean is to be compared for two different sampling regions, typically site versus background. Appropriate sample sizes are returned, as well as sampling locations if appropriate maps are provided.
MAP: If maps are provided, sampling locations will be returned. Two types of maps can be provided. A map of a sampling region (provided as a set of polygons) will determine the region(s) in which sampling is of interest. An overlay map can also be provided, to provide a better visual context for viewing. At least two maps must be provided. If a single sampling region is supplied, then the extent of the overlay map (excluding the sampling region) is assumed to define the other sampling region.
DATA: Sampling plans can utilize current data, if it is available. If data is available, it can be used to estimate the site mean, proportion, or standard deviation. Also, site-specific information may indicate need to sample pre-specified certain locations. If sampling locations are provided, then they will be included in the sampling plan.