Statistics Toolbox

Data Analysis Process

Typically, the first step in data analysis is to get a picture of the data both quantitatively and qualitatively. This involves the computation of summary statistics and plotting of the data. This process is often referred to as Exploratory Data Analysis, or EDA. Summary statistics provide a numerical description of the data, using a few numbers to describe the charactersistics of the entire data set. But a picture is often worth a thousand words (or numbers), so EDA employs many graphical techniques to depict the data in picture form.
Often, the goal of interest is to assess whether a hypothesis is supported by the data. The most common hypothesis testing techniques provide a statistical basis for choosing between a null hypothesis [the hypothesis to which one would default if no data were available] and a single alternative hypothesis. Hypothesis testing techniques calculate a test statistic from the data, then use that statistic to determine whether to accept or reject the null hypothesis.
More informative than a simple accept/reject conclusion, however, is the reported p-value, which gives a notion of how strong the weight of evidence is. If the p-value is smaller than a designated significance level [typical significance levels being 0.05, 0.10, or 0.01], then the null hypothesis is rejected. A very small p-value thus indicates that the null hypothesis would be rejected regardless of the chosen significance level and gives a greater weight of evidence against the null hypothesis.
There are many different types of hypothesis tests. The most common are one-sample or two-sample test. A one-sample test compares (some attribute of) a set of data against some pre-defined value (e.g. the mean of a sample versus a regulation threshold). A two-sample test, on the other hand, compares the attributes of one data set versus the attributes of another data set (e.g. the mean for a potentially contaminated site versus the mean of a background data set).
There is also a distinction between parametric and non-parametric tests. Parametric tests are generally more powerful but rely on knowledge of the statistical distribution of the data. Non-parametric tests are generally less powerful but have less stringent requirements on the data distribution.
When parametric techniques are employed, it is important to check the assumptions about the data distribution. EDA graphs can be useful in assessing these assumptions, though more formal tests exist for some assumptions.