1. The file PLAY5.SYD contains numeric data on two variables (X, Y) along with an alphabetical label for each case (OBJECT$). Your job is to do a principal components analysis of these data (Statistics | Data Reduction | Factor Analysis) in order to see what happens.
2. Now let's look at some real archaeological data. The file
GSTONE.SYD contains chemical data on greenstones from Alabama (the data
are from Dan Gall's Ph.D dissertation). It contains quantitative
data on 13 elements: sodium (Na), calcium (Ca), scandium (Sc), chromium
(Cr), iron (Fe), cobalt (Co), zinc (Zn), lanthanum (La), cerium (Ce), samarium
(Sm), europium (Eu), ytterbium (Yb), and lutetium (Lu). (The last
six elements in the list, the so-called "rare earths," tend to be strongly
associated when they occur in rocks.) SAMPLE$ contains a unique catalog
number for each specimen. AREA$ codes the provenience of the specimen
as follows: "Moundville" indicates the specimen is a greenstone celt found
at Moundville; "Northern," "Central," and "Southern" indicate geological
samples from three distinct source areas. Your mission is to determine
the source areas from which the raw materials for Moundville celts were
obtained.
Here are some helpful hints for Part 2:
All the information necessary to determine sources is contained in the first two principal components. Your interpretations of sources should focus only on these two.
In order to produce a biplot, you'll have to run the same principal components analysis twice, saving scores the first time (Save | Factor scores), and saving loadings the second time (Save | Factor loadings).
You'll want to set AREA$ as your ID variable before each principal components analysis. That way, you can use the SYMBOL=AREA$ option when you plot your principal component scores.
In examining the scores on the first two principal components, it might help to plot only the geological specimens first, then to look a plot showing both the geological and the Moundville specimens. Both these plots should be scaled in the same way in order to see the patterns clearly. Specify the scale limits by using the XMIN, XMAX, YMIN, and YMAX options. You can select what to plot by using the SELECT command. For example, you can select only the geological specimens by typing SELECT AREA$<>"Moundville" (or clicking on Data | Select Cases).
In order to plot the loadings on the first two components, first you'll have to TRANSPOSE the loadings file saved in the FACTOR procedure. Then, when you plot the transposed data, be sure to use the LABEL=LABEL$ option so that the element names appear next to each plotting symbol on the graph.
In the biplot, the scores plot should be scaled from -4 to +4 on both axis in order to include all the data points; the loadings plot should be scaled from -1 to +1 on both axes. And each plot should have dotted lines indicating the origin. The scores plot and loadings plot should appear on the same page, the former above the latter. (A sample command file that creates the two graphs needed in a biplot, called BIPLOT.SYC, can be found on the course web site.)
Datasets and command files for this exercise (right-click to download):