Installing PCL Analysis (Menu interface)

UNIX

1) Make sure Perl is installed in /usr/bin/perl

a. If it isn’t, go to http://www.perl.com/ to find out how to install Perl

2) Download and unpack the UNIX distribution

3) Go to the directory containing PCL_Analysis and type:

chmod u+x PCL_Analysis.pl

4) Run the program by typing:

PCL_Analysis.pl

5) If you want to use cluster, knnimpute, classminer or svd go to the Stanford Microarray Database software page at http://genome-www5.stanford.edu/MicroArray/SMD/restech.html and download the appropriate programs. You will need to place the executables in your working directory and make sure they have the following names for them to work properly.

CLUSTER: cluster

KNNIMPUTE: knnimpute

SVD (Contact Gavin Sherlock): svd.out

CLASSMINER (Contact Olga Troyanskaya): classminer

WINDOWS

1) Install ActiveState Perl

a. Go to http://www.activestate.com/Products/ActivePerl/

b. Click “Download”, then “Next”, then “MSI” (next to Windows)

c. When prompted, choose “Open” and follow the installation instructions

2) Download and unpack PCL_Analysis (Windows distribution)

3) To install clustering software:

a. Install the software from:

http://genome-www5.stanford.edu/MicroArray/SMD/restech.html,

b. Put the cluster program in the same directory as PCL_Analysis and rename it to “cluster.exe” if necessary

4) cluster.exe, PCL_Analysis.pl and PCL_Analysis.pm must all be in the same directory

5) To run the software, double click on PCL_Analysis.pl

(if this doesn’t work, make sure that Perl is associated with the .pl file extension).

MACINTOSH (OS X)

1) Follow the UNIX installation instructions

Author contact info:

John Isaac Murray

murray@genome.stanford.edu

To inquire about KNNimpute or Classminer

Olga Troyanskaya:

oft@princeton.edu

To inquire about XCluster or svd.out

Gavin Sherlock

sherlock@genome.stanford.edu

Using PCL Analysis (Menu interface)

The menu interface is the simplest method for using PCL Analysis. Run the program “PCL_Analysis.pl” from the UNIX or Mac OSX command line or double click on “PCL_Analysis.pl” in Windows. The following menu screen will appear:

Each “command” is accessed by typing its number, followed by the enter key. In most cases, the commands are self-explanatory (for example “Load data from a file”). Unless otherwise specified, files must be in “PCL” format:

http://genetics.stanford.edu/~sherlock/cluster.html#formats

Descriptions of individual menu options

Options for adding data or annotations

1	Load Data From File	Prompts the user for a filename. The file must be in extended PCL format (see Table 1). The data is loaded into memory and then can be manipulated using the various other functions. The first column should contain a unique identifier that specifies the identity of the data (e.g. gene) on each row.
4	Add Additional Data	Used when a dataset is already in memory, and the user wants to add additional experiments from another file to the current dataset – the file must be in extended PCL format and will be added after the last column of the current data. The first column unique identifiers should be the same as for the data already in memory.
19	Insert From File at Specified Column	This is like “Add Additional Data” (4), but the user can specify an arbitrary column number after which to insert the data, or insert the data at the beginning of the dataset (by specifying column ‘-1’).
17	Add Blank Columns	Allows the user to add blank columns (data columns containing no data values) after the specified columns, which is useful during visualization for making the boundaries of related experiment sets apparent.
40	Load Annotation from File	Allows the user to add additional annotations for each gene (from a file). The first column of the file should contain the same identifiers as the data already in memory. These new annotations do not have to be unique (for example they could contain Gene Ontology category names).
60	Remove Annotation Column	Allows the user to remove an annotation column that is no longer needed
47	Move Experiments	Allows the user to move a set of data columns to a different position in the dataset

Options for printing information about the current dataset

2	Print All Experiment Names	Prints the names of each experiment, and their column numbers, one per line. This is useful in determining the column number corresponding to a specific experiment
3	Print One Experiment Name
6	Print Annotation Names	Prints the names of each annotation column, and their column numbers
7	Print One Annotation Name
26	Print Score Column Names	Prints the names of each score column, and their column numbers
30	Print Number of Experiments	Prints the total number of experiments in the dataset
31	Print Number of Genes	Prints the total number of rows (usually genes) in the dataset (useful after filtering)

Options for changing array cluster weighting information

5	Set all EWEIGHTs to 1	self explanatory (see Table 1 for definition of EWEIGHT)
18	Change some EWEIGHTs	Allows the user to set the EWEIGHTs for a specified set of data columns to a user-specified number
21	Set EWEIGHTs to 1 / magnitude of column vector	Automatically normalizes the EWEIGHT values to 1 / the square root of the sum-squared expression values for each data column. This has the effect of equalizing the impact of each data column on clustering (normally data columns with large expression values have the greatest impact in clustering)

Options for transforming the data

12	Zero Transform (or center) (mean)	The “zero transform” functions allow the user to (for each gene) subtract the average expression from one set of experiments from the expression in each of another set of experiments. For example, subtracting the average of the zero-time point experiments from all experiments in a time course (“zero transforming”) or subtracting the average of a set of related experiments from each of those experiments (“centering”). This option subtracts the mean.
25	Zero Transform (or center) (median)	Like Zero Transform (mean) except it subtracts the median expression from a set of experiments from another set of experiments
10	SVD Center the data	Requires that the program svd.out is in the current directory. It uses the functionality of this program to subtract the most prominent eigengene (derived from a single value decomposition of the data) from the data matrix (which is usually similar to mean centering). See (Alter et al., 2000) for more information on SVD. This requires a full data matrix
24	Negate Data for arrays	Flips the sign of each specified data column (useful for fluor-reversed data)
16	KNN Impute data	Requires that the program knnimpute be in the current directory. It uses knnimpute (Troyanskaya et al., 2001) to estimate values for any missing datapoints in the dataset (it is recommended that these values be reverted – see next entry – when they are no longer needed and that they not be used to draw direct inferences)
15	Revert Missing Values	Looks in a user specified PCL file (which should have the same genes and experiments as the file currently in memory) and for each blank datapoint in the file, it makes the equivalent point in memory blank
41	Average by annotation column	Creates a new dataset containing one row for each unique annotation in the user-specified annotation column; the data in each row of the new dataset is the average (mean) of all rows in the original dataset with that annotation

Options for normalizing the experiments

48	Mean Center Arrays	Subtracts the mean expression for all genes in an array from each data point in that array (normalizing in log space)
49	Median Center Arrays	Subtracts the median expression for all genes in an array from each data point in that array (normalizing in log space)
50	Mean Normalize (divide) Arrays	Divides each expression value by the mean for that array (normalizing in ratio or linear space)
51	Median Normalize (divide) Arrays	Divides each expression value by the median for that array (normalizing in ratio or linear space)
52	Log Transform	Log transform each data value (linear spaceàlog space)
53	Exponential Transform	Exponentially transform each data value (log spaceàlinear space)
54	Linear Transform	Multiply each data value by a constant
57	Round up low data	Makes all data values below a specified constant equal to that constant. Useful for negative data in linear or ratio space which cannot be log-transformed otherwise
42	Average Experiments	Creates a new data column containing the average data from the specified experiments.
43	Average Time Courses	The user specifies two sets of data columns (which are directly comparable, e.g. replicates), and the program creates an equal number of new data columns containing the average from each pair of samples from the two input sets

Options for generating and computing on gene scores

38	Generate Score using script
27	Generate sum-squared score	Makes a new score column containing the sum of the squares of the expression values in a specified set of columns for each gene
32	Make classminer score	Requires that the program classminer (Troyanskaya et al., 2002) is in the current directory. Calculates the significance of any differences in the expression in one set of experiments versus the other experiments, using a rank-sum test, and puts the p-values in a new score column
34	Make 2-class classminer score	Same as classminer score, but looks at the difference between one arbitrary set of experiments and a second arbitrary set of data (rather than all the remaining experiments)
35	Make SingleCorrelation score	Requires that the program singlecorrelation be in the current directory. The user specifies the name of a small PCL file, and for each data row in the new file, the program creates a new score column in the file in memory containing the Pearson correlation between that row and each gene in the dataset.
39	Load Score from file	User specifies a file with the same unique first column as the current data and program creates a new score column and fills it with a specified column of the file.
45	Change score to annotation	Turns a score column into an annotation column
46	Change annotation to score	Turns an annotation column into a score column. The default when loading an extended PCL file is for all the columns before the GWEIGHT column to be annotation columns, so this is needed to allow the user to load a file with scores

Options for filtering the data

9	Filter by percent good data	Removes all genes that do not have at least a specified percent of their datapoints present
13	Remove Experiments	Removes the specified data columns
14	Keep subset of experiments	Keeps only the specified data columns
55	Filter n fold in m arrays	Removes all genes that do not have at least m experiments with an absolute value expression greater than specified (n)
56	Filter outside specified bounds	Removes all genes within or outside of two arbitrary boundary values (for example keep genes with expression greater than 500 or less than 20)
28	Pick ## of genes by score rank	Picks the top ## genes based on the highest values in the specified score column
29	Pick genes above a score	Removes all genes that do not have a score greater than the specified value in a specified score column
33	Pick genes below a score	Removes all genes that do not have a score less than the specified value in a specified score column

Options for saving data

S or s	Save the data	Saves all of the data from memory to an extended .PCL formatted file (see Table 1) with a user-specified filename. The column order is Annotations, then Scores, then GWEIGHT, then Data. All annotations, scores, data are saved.
59	Save in cluster-ready format	Saves the data from memory in “strict .PCL” format (see Table 1), suitable for input into Gavin Sherlock’s XCluster software (Sherlock, 1999)
11	Save a subset of experiments	Saves data from a user-specified subset of the experiments to an extended .PCL file
22	Save a subset of genes	Saves data for only the genes in a specified genelist file (containing one gene identifier per line) to an extended pcl file
44	Save genes not in genelist	Saves data for genes that are NOT in a specified genelist file to an extended pcl file
23	Save in GCT format	Saves all the data in .GCT format (see Table 2)

Other options

20	Cluster genes	Requires that the progam cluster be in the current directory. Makes a hierarchical cluster (.cdt, .gtr files) from the data in the specified set of columns. (see http://rana.lbl.gov/EisenSoftware.htm for information in cluster and treeview)
36	Randomize Rows	Randomizes the order of the data values in each row
37	Randomize Columns	Randomizes the order of the data values in each column
58	Run batch file	Runs a batch file, which contains a sequence of commands, as they would be entered manually into the program, one per line.
M	Print Menu
Q	Quit

Using PCL Analysis (Descriptions of functions)

Individual functions and syntaxes will be described here eventually.