Statistics Toolbox
Data Analysis Process
Typically, the first step in data analysis is to get a picture of the data
both quantitatively and qualitatively. This involves the computation of
summary statistics and plotting of the data. This process is often referred
to as Exploratory Data Analysis, or EDA. Summary statistics provide a
numerical description of the data, using a few numbers to describe the
charactersistics of the entire data set. But a picture is often worth
a thousand words (or numbers), so EDA employs many graphical techniques to
depict the data in picture form.
Often, the goal of interest is to assess whether a hypothesis is supported
by the data. The most common hypothesis testing techniques provide a
statistical basis for choosing between a null hypothesis [the hypothesis
to which one would default if no data were available] and a single
alternative hypothesis. Hypothesis testing techniques calculate a
test statistic from the data, then use that statistic to determine
whether to accept or reject the null hypothesis.
More informative than
a simple accept/reject conclusion, however, is the reported p-value, which
gives a notion of how strong the weight of evidence is. If the p-value is
smaller than a designated significance level [typical significance levels
being 0.05, 0.10, or 0.01], then the null hypothesis is rejected. A very
small p-value thus indicates that the null hypothesis would be rejected
regardless of the chosen significance level and gives a greater weight of
evidence against the null hypothesis.
There are many different types of hypothesis tests. The most common
are one-sample or two-sample test. A one-sample test compares (some
attribute of) a set of data against some pre-defined value (e.g. the
mean of a sample versus a regulation threshold). A two-sample
test, on the other hand, compares the attributes of one data set versus
the attributes of another data set (e.g. the mean for a potentially
contaminated site versus the mean of a background data set).
There is also a distinction between parametric and non-parametric tests.
Parametric tests are generally more powerful but rely on knowledge of the
statistical distribution of the data. Non-parametric tests are generally
less powerful but have less stringent requirements on the data distribution.
When parametric techniques are employed, it is important to check
the assumptions about the data distribution. EDA graphs can be useful
in assessing these assumptions, though more formal tests exist for
some assumptions.
