Installing PCL Analysis (Menu interface)



1)                  Make sure Perl is installed in /usr/bin/perl

a.       If it isn’t, go to to find out how to install Perl


2)                  Download and unpack the UNIX distribution


3)                  Go to the directory containing PCL_Analysis and type:

chmod u+x


4)                  Run the program by typing:


5)                  If you want to use cluster, knnimpute, classminer or svd go to the Stanford Microarray Database software page at and download the appropriate programs.  You will need to place the executables in your working directory and make sure they have the following names for them to work properly.


CLUSTER:                                                                  cluster

KNNIMPUTE:                                                            knnimpute

SVD (Contact Gavin Sherlock):                                    svd.out

CLASSMINER (Contact Olga Troyanskaya):  classminer




1)                  Install ActiveState Perl

a.       Go to

b.      Click “Download”, then “Next”, then “MSI” (next to Windows)

c.       When prompted, choose “Open” and follow the installation instructions


2)                  Download and unpack  PCL_Analysis (Windows distribution)


3)                  To install clustering software:


a.       Install the software from:,


b.      Put the cluster program in the same directory as PCL_Analysis and rename it to “cluster.exe” if necessary


4)                  cluster.exe, and must all be in the same directory

5)                  To run the software, double click on


(if this doesn’t work, make sure that Perl is associated with the .pl file extension).



1)                  Follow the UNIX installation instructions



Author contact info:

John Isaac Murray


To inquire about KNNimpute or Classminer

Olga Troyanskaya:


To inquire about XCluster or svd.out

Gavin Sherlock


Using PCL Analysis (Menu interface)


The menu interface is the simplest method for using PCL Analysis.  Run the program “” from the UNIX or Mac OSX command line or double click on “” in Windows.  The following menu screen will appear:




Each “command” is accessed by typing its number, followed by the enter key. In most cases, the commands are self-explanatory (for example “Load data from a file”).  Unless otherwise specified, files must be in “PCL” format:


Descriptions of individual menu options


Options for adding data or annotations




Load Data From File

Prompts the user for a filename.  The file must be in extended PCL format (see Table 1).  The data is loaded into memory and then can be manipulated using the various other functions.  The first column should contain a unique identifier that specifies the identity of the data (e.g. gene) on each row.


Add Additional Data

Used when a dataset is already in memory, and the user wants to add additional experiments from another file to the current dataset – the file must be in extended PCL format and will be added after the last column of the current data.  The first column unique identifiers should be the same as for the data already in memory.


Insert From File at Specified Column

This is like “Add Additional Data” (4), but the user can specify an arbitrary column number after which to insert the data, or insert the data at the beginning of the dataset (by specifying column ‘-1’).


Add Blank Columns

Allows the user to add blank columns (data columns containing no data values) after the specified columns, which is useful during visualization for making the boundaries of related experiment sets apparent.


Load Annotation from File

Allows the user to add additional annotations for each gene (from a file).  The first column of the file should contain the same identifiers as the data already in memory.  These new annotations do not have to be unique (for example they could contain Gene Ontology category names).


Remove Annotation Column

Allows the user to remove an annotation column that is no longer needed


Move Experiments

Allows the user to move a set of data columns to a different position in the dataset



Options for printing information about the current dataset




Print All Experiment Names

Prints the names of each experiment, and their column numbers, one per line.  This is useful in determining the column number corresponding to a specific experiment


Print One Experiment Name



Print Annotation Names

Prints the names of each annotation column, and their column numbers


Print One Annotation Name



Print Score Column Names

Prints the names of each score column, and their column numbers


Print Number of Experiments

Prints the total number of experiments in the dataset


Print Number of Genes

Prints the total number of rows (usually genes) in the dataset (useful after filtering)





 Options for changing array cluster weighting information




Set all EWEIGHTs to 1

self explanatory (see Table 1 for definition of EWEIGHT)


Change some EWEIGHTs

Allows the user to set the EWEIGHTs for a specified set of data columns to a user-specified number


Set EWEIGHTs to 1 / magnitude of column vector

Automatically normalizes the EWEIGHT values to 1 / the square root of the sum-squared expression values for each data column.  This has the effect of equalizing the impact of each data column on clustering (normally data columns with large expression values have the greatest impact in clustering)




Options for transforming the data




Zero Transform (or center) (mean)

The “zero transform” functions allow the user to (for each gene) subtract the average expression from one set of experiments from the expression in each of another set of experiments.  For example, subtracting the average of the zero-time point experiments from all experiments in a time course (“zero transforming”) or subtracting the average of a set of related experiments from each of those experiments (“centering”).  This option subtracts the mean.


Zero Transform (or center) (median)

Like Zero Transform (mean) except it subtracts the median expression from a set of experiments from another set of experiments


SVD Center the data

Requires that the program svd.out is in the current directory.  It uses the functionality of this program to subtract the most prominent eigengene (derived from a single value decomposition of the data) from the data matrix (which is usually similar to mean centering).  See (Alter et al., 2000) for more information on SVD.  This requires a full data matrix


Negate Data for arrays

Flips the sign of each specified data column (useful for fluor-reversed data)


KNN Impute data

Requires that the program knnimpute be in the current directory.  It uses knnimpute (Troyanskaya et al., 2001) to estimate values for any missing datapoints in the dataset (it is recommended that these values be reverted – see next entry – when they are no longer needed and that they not be used to draw direct inferences)


Revert Missing Values

Looks in a user specified PCL file (which should have the same genes and experiments as the file currently in memory) and for each blank datapoint in the file, it makes the equivalent point in memory blank


Average by annotation column

Creates a new dataset containing one row for each unique annotation in the user-specified annotation column; the data in each row of the new dataset is the average (mean) of all rows in the original dataset with that annotation



Options for normalizing the experiments




Mean Center Arrays

Subtracts the mean expression for all genes in an array from each data point in that array (normalizing in log space)


Median Center Arrays

Subtracts the median expression for all genes in an array from each data point in that array (normalizing in log space)


Mean Normalize (divide) Arrays

Divides each expression value by the mean for that array (normalizing in ratio or linear space)


Median Normalize (divide) Arrays

Divides each expression value by the median for that array (normalizing in ratio or linear space)


Log Transform

Log transform each data value (linear spaceàlog space)


Exponential Transform

Exponentially transform each data value (log spaceàlinear space)


Linear Transform

Multiply each data value by a constant


Round up low data

Makes all data values below a specified constant equal to that constant.  Useful for negative data in linear or ratio space which cannot be log-transformed otherwise


Average Experiments

Creates a new data column containing the average data from the specified experiments.


Average Time Courses

The user specifies two sets of data columns (which are directly comparable, e.g. replicates), and the program creates an equal number of new data columns containing the average from each pair of samples from the two input sets



Options for generating and computing on gene scores




Generate Score using script



Generate sum-squared score

Makes a new score column containing the sum of the squares of the expression values in a specified set of columns for each gene


Make classminer score

Requires that the program classminer (Troyanskaya et al., 2002) is in the current directory.  Calculates the significance of any differences in the expression in one set of experiments versus the other experiments, using a rank-sum test, and puts the p-values in a new score column


Make 2-class classminer score

Same as classminer score, but looks at the difference between one arbitrary set of experiments and a second arbitrary set of data (rather than all the remaining experiments)


Make SingleCorrelation score

Requires that the program singlecorrelation be in the current directory.  The user specifies the name of a small PCL file, and for each data row in the new file, the program creates a new score column in the file in memory containing the Pearson correlation between that row and each gene in the dataset. 


Load Score from file

User specifies a file with the same unique first column as the current data and program creates a new score column and fills it with a specified column of the file.


Change score to annotation

Turns a score column into an annotation column


Change annotation to score

Turns an annotation column into a score column.  The default when loading an extended PCL file is for all the columns before the GWEIGHT column to be annotation columns, so this is needed to allow the user to load a file with scores


Options for filtering the data



Filter by percent good data

Removes all genes that do not have at least a specified percent of their datapoints present


Remove Experiments

Removes the specified data columns


Keep subset of experiments

Keeps only the specified data columns


Filter n fold in m arrays

Removes all genes that do not have at least m experiments with an absolute value expression greater than specified (n)


Filter outside specified bounds

Removes all genes within or outside of two arbitrary boundary values (for example keep genes with expression greater than 500 or less than 20)


Pick ## of genes by score rank

Picks the top ## genes based on the highest values in the specified score column


Pick genes above a score

Removes all genes that do not have a score greater than the specified value in a specified score column


Pick genes below a score

Removes all genes that do not have a score less than the specified value in a specified score column



Options for saving data


S or s

Save the data

Saves all of the data from memory to an extended .PCL formatted file (see Table 1) with a user-specified filename.  The column order is Annotations, then Scores, then GWEIGHT, then Data.  All annotations, scores, data are saved.


Save in cluster-ready format

Saves the data from memory in “strict .PCL” format (see Table 1), suitable for input into Gavin Sherlock’s XCluster software (Sherlock, 1999)


Save a subset of experiments

Saves data from a user-specified subset of the experiments to an extended .PCL file


Save a subset of genes

Saves data for only the genes in a specified genelist file (containing one gene identifier per line) to an extended pcl file


Save genes not in genelist

Saves data for genes that are NOT in a specified genelist file to an extended pcl file


Save in GCT format

Saves all the data in .GCT format (see Table 2)




Other options



Cluster genes

Requires that the progam cluster be in the current directory.  Makes a hierarchical cluster (.cdt, .gtr files) from the data in the specified set of columns. (see for information in cluster and treeview)


Randomize Rows

Randomizes the order of the data values in each row


Randomize Columns

Randomizes the order of the data values in each column


Run batch file

Runs a batch file, which contains a sequence of commands, as they would be entered manually into the program, one per line.


Print Menu









Using PCL Analysis (Descriptions of functions)


Individual functions and syntaxes will be described here eventually.