Discriminative Margin Clustering

Kamesh Munagala, Rob Tibshirani and Patrick O. Brown

Home
Paper Home  
Figures
Paper figures  
Data Set
Tumor and Normal data set
Analysis
Results of the clustering method on the data 
Code
C code for the clustering and expansion methods
Authors
People who contributed to the project

Code
    C code for discriminative margin clustering. The I/O consists of the following files:
    1. "tumor.txt" contains the gene expression data for the tumor samples in the following format. Column number j in line number i  contains the expression value of gene with ID j-1 in tumor with ID i-1. Please ensure there are no missing values.

    2. "normal.txt" contains  the gene expression data for the normal samples in the following format. Line number i and Column number j contains the expression value of gene with ID j-1 in normal sample with ID i-1. Please ensure there are no missing values.

    3. "tumornames.txt" contains the list of names of the tumors, one name in each line. Line number i contains the name of tumor with ID i-1.

    4. Command line input: Number of tumors, Number of normal samples, Number of genes.

    5. Standard Output: The postscript output representing the tree. Please save it to <filename>.ps, and use ps2pdf to convert it to pdf format

    6. "cluster.txt": Output of genelists for the various cluster nodes. For each cluster node, we indicate the IDs of the tumors at  the leaf node, the combined margin of the cluster, as well as the list of gene IDs along with their weights in the feature set at that node.

    The software requires the CPLEX libraries. CPLEX is a commercially available linear optimization software from iLOG.

C code for finding expanded feature sets. The I/O consists of the following files:

    1. "tumor.txt" contains the gene expression data for the tumor samples in the following format. Row i and Column j contains the expression value of gene with ID j-1 in tumor with ID i-1. Please ensure there are no missing values.

    2. "normal.txt" contains  the gene expression data for the normal samples in the following format. Row i and Column j contains the expression value of gene with ID j-1 in normal sample with ID i-1. Please ensure there are no missing values.

    3. "node.txt" contains a list of sets of tumor IDs (one set on each line) for which we need expanded feature sets. For example: "1 2<newline>4 5<newline>9 15"  would indicate 3 sets of tumors, (1,2), (4,5) and (9,15).

    4. "val.txt" contains the combined margin for each of the sets in "node.txt", one margin value on each line. The number of lines in "node.txt" and "val.txt" are therefore identical.

    5. Command line input: Number of tumors, Number of normal samples, Number of genes.

    6. Standard Output: For each set of tumors, the expanded feature set produced by the quadratic programming method. We output the ID of the gene along with its weight. The output can be saved to a text file.

    This software also requires the CPLEX set of libraries. The code contains a parameter EPS, which specifies how much the margin is relaxed from the maximum margin. The default value is 0.4, but it can be changed by editing the line "#define EPS 0.4".

    The maximum allowed number of tissues is 300, and the maximum number of genes is 15000.