搜档网
当前位置:搜档网 › Using Formal Concept Analysis for Microarray Data Comparison

Using Formal Concept Analysis for Microarray Data Comparison

Using Formal Concept Analysis for Microarray Data Comparison
Using Formal Concept Analysis for Microarray Data Comparison

July15,200612:7Proceedings Trim Size:9.75in x6.5in Choi-Microarray USING FORMAL CONCEPT ANALYSIS FOR MICROARRAY DATA

COMPARISON

V.CHOI?

Department of Computer Science,

Virginia Tech,

660McBryde Hall,

Blacksburg,VA24061,USA

E-mail:vchoi@https://www.sodocs.net/doc/b618057245.html,

Y.HUANG

Department of Computer Science,

Rutgers University,

110Frelinghuysen Road,

Piscataway,NJ08854,USA

E-mail:yahuang@https://www.sodocs.net/doc/b618057245.html,

https://www.sodocs.net/doc/b618057245.html,M,D.POTTER,https://www.sodocs.net/doc/b618057245.html,UBENBACHER,K.DUCA

Virginia Bioinformatics Institute,

Washington Street,MC0477

Virginia Tech,Blacksburg,VA24061

Blacksburg,VA24060,USA

E-mail:{vlam,reinhard,kduca}@https://www.sodocs.net/doc/b618057245.html,

Microarray technologies,which can measure tens of thousands of gene expression values simultane-

ously in a single experiment,have become a common research method for biomedical researchers.

Computational tools to analyze microarray data for biological discovery are needed.In this paper,we

investigate the feasibility of using Formal Concept Analysis(FCA)as a tool for microarray data analy-

sis.The method of FCA builds a(concept)lattice from the experimental data together with additional

biological information.For microarray data,each vertex of the lattice corresponds to a subset of genes

that are grouped together according to their expression values and some biological information relating

to gene function.The lattice structure of these gene sets might re?ect biological relationships in the

dataset.Similarities and differences between experiments can then be investigated by comparing their

corresponding lattices according to various graph measures.We apply our method to microarray data

derived from in?uenza infected mouse lung tissue and healthy controls.Our preliminary results show

the promise of our method as a tool for microarray data analysis.

?Corresponding author

1

July15,200612:7Proceedings Trim Size:9.75in x6.5in Choi-Microarray

2

1.Introduction

Microarray technologies,which can measure tens of thousands of gene expression values simultaneously in a single experiment,across different conditions and over time,have been widely used in biomedical research.They have found many applications,such as classi?-cation of tumors,assigning functions to previously unannotated genes,grouping genes into functional pathways etc(see16for a review).A large collection of database is available in the public domain(e.g.see8,15).A wealth of methods13,5have been proposed for analyz-ing these datasets to gain biological insights.A main method for analyzing these microar-ray data is based on clustering,which groups set of genes,and/or groups of experimental conditions,that exhibit similar expression patterns.These include single clustering algo-rithms,such as hierarchical clustering,k-means,self-organizing map(SOM)algorithms (see10for a review and references therein);and biclustering algorithms(see12and refer-ences therein).However,the challenge to derive useful knowledge from microarray data still remains.For example,see3for a recent biclustering algorithm that is based on non-smooth non-negative matrix factorization.In this paper,we propose another method that is based on Formal Concept Analysis(FCA)18,7as an alternative to the clustering approach.

Our Approach.The method of FCA builds a(concept)lattice from the experimental data together with additional biological information.Each vertex of the lattice corresponds to a subset of genes that are grouped together according to their expression values and some biological information relating to gene function.See Section2for the background of FCA.The lattice structure of these gene sets might re?ect biological relationships in the dataset.Similarities and differences between experiments can then be investigated by comparing their corresponding lattices according to various graph measures.In the high level description,our method consists of the following three main steps:

(1)Build a binary relation(cross-table)for each experiment.The objects of the binary

relation are genes;and there are two types of attributes:gene expression attributes

and biological attributes.The gene expression attributes are obtained by a dis-

cretization procedure on gene expression values.The biological attributes can be

any biological properties relating to gene function.

(2)Construct a Galois/concept lattice for each experiment’s binary relation using the

ef?cient Galois/concept lattice algorithm described in4.

(3)De?ne a distance measure and compare the lattices.

Note that the biological attributes of genes are invariant/constant for all experiments and they can be preprocessed.The ability to integrate these constant biological attributes is one of the advantages of our method over clustering methods.This is because the constant in-formation will be canceled out in clustering methods and thus do not add any contributions.

Related work of using FCA for microarray data https://www.sodocs.net/doc/b618057245.html,ing FCA for microarray data comparison was proposed in D.Potter’s thesis14.Based on the framework proposed there,we more rigorously develop each step.In particular,we more carefully discretize the

July15,200612:7Proceedings Trim Size:9.75in x6.5in Choi-Microarray

3 gene expression values to suit our purposes,namely,close gene expression values should share the same attribute;and distance measure is also more rigorously de?ned and bet-ter results are obtained.Also,our concept lattice construction algorithm is very ef?cient (within1second)while our data sets were too large for the program in14to handle.We should also mention that using FCA or Concept lattice approach to mine microarray data were also studied in1,2.The goal there was to extract local patterns in the microarray data and no biological attributes were employed.

Outline.The paper is organized as follows.In Section2,we review some background and notation on FCA.In Section3,we describe our method in details.In Section4,we describe our data and present our preliminary results applying our method to the data.We conclude with future work in Section5.

2.Background on FCA

Formal Concept Analysis(FCA)is a method that is based on lattice theory for the analysis of binary relational data.It was introduced by Rudolf Wille in1980s.Since its introduction, FCA7has found many applications in data mining,knowledge discovery and machine learning etc18.

The input of FCA consists of a triple(O,M,I),called context,where O= {g1,g2,...,g n}is a set of n elements,called objects;M={1,2,...,m}is a set of m elements,called attributes;and I?O×M is a binary relation.The context is often represented by a cross-table as shown in Figure1.A set X?O is called an object set,and

a set J?M is called an attribute set.Following the convention,we write an object set

{a,c,e}as ace,and an attribute set{1,3,4}as134.

For i∈M,denote the adjacency list of i by nbr(i)={g∈O:(g,i)∈I}.Similarly, for g∈O,denote the adjacency list of g by nbr(g)={i∈M:(g,i)∈I}.

De?nition2.1.The function attr:2O?→2M maps a set of objects to their common attributes:attr(X)=∩g∈X nbr(g),for X?O.The function obj:2M?→2O maps a set of attributes to their common objects:obj(J)=∩j∈J nbr(j),for J?M.

It is easy to check that for X?O,X?obj(attr(X)),and for J?M,J?attr(obj(J)).

De?nition2.2.An object set X?O is closed if X=obj(attr(X)).An attribute set J?M is closed if J=attr(obj(J)).

The composition of obj and attr induces a Galois connection between2O and2M.

Readers are referred to7for properties of the Galois connection.

De?nition2.3.A pair C=(A,B),with A?O and B?M,is called a concept if A=attr(B)and B=obj(A).

For a concept C=(A,B),by de?nition,both A and B are closed.The object set A is called the extent of C,written as A=ext(C),and the attribute set B is called the intent of

July 15,200612:7Proceedings Trim Size:9.75in x 6.5in Choi-Microarray

4

C ,and written as B =int (C ).The set of all concepts of the context (O ,M ,I )is denoted by B (O ,M ,I )or simply B when the context is understood.

Let (A 1,B 1)and (A 2,B 2)be two concepts in B .Observe that if A 1?A 2,then B 2?B 1.We order the concepts in B by the following relation ?:

(A 1,B 1)?(A 2,B 2)??A 1?A 2(B 2?B 1).

It is not dif?cult to see that the relation ?is a partial order on B .In fact,L =is a complete lattice and it is known as the concept or Galois lattice of the context (O ,M ,I ).For C,D ∈B with C ?D ,if for all E ∈B such that C ?E ?D implies that E =C or E =D ,then C is called the successor a (or lower neighbor )of D ,and D is called the predecessor (or upper neighbor )of C .The diagram representing an ordered set (where only successors/predecessors are connected by edges)is called a Hasse diagram (or a line diagram).See Figure 1for an example of the line diagram of a Galois lattice.

When the binary relation is represented as a bipartite graph (see Figure 1),each concept corresponds to a maximal bipartite clique (or maximal biclique).There is also a one-one correspondence of a closed itemset 17studied in data mining and a concept in FCA.The one-one correspondence of all these terminologies –concepts in FCA,maximal bipartite cliques in theoretical computer science,and closed itemsets in data mining –was known,e.g.17.There is extensive work of the related problems in these three communities,see 4for related literature.The current fastest algorithm given in 4takes O ( a ∈ext (C )|cnbr (a )|)polynomial delay for each concept C ,where cnbr (a )is the reduced adjacency list of a .Readers are referred to 4for the details.1234a

x x b

x x x c

x x d x x (abc,1)(ac,13)(abcd, ?)(?, 1234)

(bd,24) (b,124) a

1b

2c

3d 4Figure 1.Left,a context (O ,M ,I )with O ={a,b,c,d }and M ={1,2,3,4}.The cross ×indicates a pair in the relation I .Middle,the bipartite graph corresponding to the context.Right,the corresponding Galois/concept lattice.

a Some authors called this as immediate successor.

July15,200612:7Proceedings Trim Size:9.75in x6.5in Choi-Microarray

5

3.Methods

3.1.Building Binary Relations

In this section,we describe how to the construct the context(O,M,I)for each experiment.

Here the object set O consists of a set of genes.There are two types of attributes in the attribute set M:biological attributes and gene expression attributes.For a gene g∈O and an attribute a∈M,(g,a)∈I if g has the attribute a.

3.1.1.Biological Attributes

Any gene function related properties can be used as biological attributes.For instance, one can use protein motif families as such attributes:a gene has the attribute if its cor-responding protein belongs to the motif family.As an example,the protein motif family oxidoreductases can be such an attribute,and a gene has this attribute if it is one of the oxidoreductases.There are many other possible biological attributes,such as functional characteristics of a gene,chromosomal location of a gene,known association with disease states etc.

3.1.2.Discretization of Gene Expression Values

The data obtained from a microarray experiment consists of a set of genes and each gene has

a gene expression value.The gene expression values are continuous real numbers.In order

to represent microarray data in a binary relation,it requires to discretize the continuous gene expression values into a?nite set of values that correspond to attributes.Intuitively, we would like to discretize the gene expression values such that two close values share the same attribute.The straightforward method will be dividing the gene expression values according to large“gaps”.First,we sort all the expression values in the increasing order.

Let the ordered values be y1

Recall that our idea is to discretize the gene expression values such that close gene expression values share the same attribute(or a subinterval).However,our even parti-tion might not achieve this goal,for example,two close genes might belong to the same subinterval in one experiment but belong to two consecutive subintervals in another exper-iment.To overcome this problem,instead of assigning the gene expression value to only one subinterval,if the gene expression value from one of the even subintervals is within 50%of the neighbor subinterval,we assign both subintervals to this gene expression value.

July15,200612:7Proceedings Trim Size:9.75in x6.5in Choi-Microarray

6

We illustration the discretization procedure by an example in Figure2.In this?gure, t=5and s=4.That is,we?rst partition the gene expression values into?ve subintervals I1,I2,...,I5according to the four largest gaps.And then partition the largest subinter-val(I2in this example)into four even subintervals.There are total of eight subintervals:

I1,I2

1,I2

2

,I2

3

,I2

4

,I3,I4,I5,and each subinterval in the order corresponds to one gene

expression attribute a i,for i=1to8.The gene expression value that falls in the subin-tervals I1(I3,I4,I5respectively)is assigned to its corresponding attribute a1(a6,a7,a8 respectively).The gene expression value that falls in I2is assigned to one or two consec-utive attributes depending the region it falls.If it falls in the subintervals J ii+1(50%from the two neighboring subintervals)as shown in Figure2,then it is assigned to two attributes a i and a i+1.For example,if a gene expression value falls in the subinterval J23then it gets both attributes a2and a3.

I1I2I3I4

a1a2a3a4a5a6a7a8

I5

J23J34J45

Figure2.Discretization example.We partition the sorted gene expression values into?ve subintervals, I1,I2,...,I5according to the four largest gaps.We further partition the largest subinterval(I2in this example) into four even subintervals.There are total of eight disjoint subintervals and each subinterval corresponds to one gene expression attribute a i.The gene expression value that falls in the subintervals I1(I3,I4,I5respectively)is assigned to its corresponding attribute a1(a6,a7,a8respectively).The gene expression value that falls in I2is assigned to one or two consecutive attributes depending on the region it falls.If it falls in the subintervals J ii+1, then it is assigned to two attributes a i and a i+1,for i=2,3,4.

3.2.Concept Lattices Construction

Once we have the binary relations,we then build a concept lattice for each binary relation, using the ef?cient algorithm described in4.

3.3.Distance Measure for Lattices Comparison

Given two lattices L1=(V1,E1)and L2=(V2,E2),there are many possible distance measures that one can de?ne to measure the similarities or differences of these two lattices. The simplest distance maybe is the one based on common subgraphs.Recall that each vertex in our lattice is a concept and it is labeled by a subset of genes and a subset of attributes.For our purpose,we will ignore all attributes,that is,each vertex is labeled by its object set of genes only.See Figure3for an example.A vertex v is called a common vertex if v appears in both V1and V2.Let V C=V1∩V2be the set of common vertices. For u,v∈V C,if e={u,v}is in both E1and E2.Then e is called a common edge.Let E C be the set of common edges.The distance distance(L1,L2)is then de?ned by

|L1\L2|+|L2\L1|

|L1∪L2|

July15,200612:7Proceedings Trim Size:9.75in x6.5in Choi-Microarray

7 where L1\L2=(V1\V C)∪(E1\E C),L2\L1=(V2\V C)∪(E2\E C)and L1∪L2=L1∪(L2\L1).

abcdefg

abcd bcd bfg cd bc bg

c b

?

abcdefg

abcd bceg cd ac bc bg

c b

?

https://www.sodocs.net/doc/b618057245.html,ttices Comparison.How similar/different are these two lattices?

Since the data is not perfect,we can relax our requirement of the de?nition of a common vertex.Instead of requiring the exact matching of the gene sets of the vertices,v1and v2 are considered the same if their gene sets share more thanξof the maximum size of the two gene sets,i.e.,|obj(v1)∩obj(v2)|≥ξmax(|obj(v1)|,|obj(v2)|).

Many other possible distance measures will be investigated in the future.For example, spectral distance or maximal common sublattice distance that were also mentioned in14.

4.Our Experiments

4.1.Our Data

Our microarray data11were derived from the lung tissue of mouse under four different con-ditions:(1)Control:the mouse was normal and healthy;(2)Flu:the mouse was infected by in?uenza(H3N1);(3)Smoking:the mouse was forced to smoke for four consecutive days,with nine packs cigarette per day;(4)SmokeFlu:the mouse was both infected with ?u and smoking.For each condition,the gene expression values on6different time points —6hours,20hours,30hours,48hours,72hours and96hours—were measured.At each time point,there were three replicates,which were used to clean up the noise in the data. Also,the expression values were measured on probes and several probes can corresponding to a same https://www.sodocs.net/doc/b618057245.html,ing our clean-up procedure(to be described in the full version of this paper),one gene expression value was obtained for each gene for each sample.There are total of11,051genes for all24samples(=6time points×4conditions).

4.2.Applying Our Method to the Data

In this section,we describe the parameters used in our method when applying to the above data.

Biological Attributes.We used the protein motif families obtained from PROSITE9as our biological attributes.In particular,for our gene sets,we used the stand-alone tools

July15,200612:7Proceedings Trim Size:9.75in x6.5in Choi-Microarray

8

from PROSITE6and identi?ed21PROSITE families for our gene set.That is,we have21 biological attributes.Note these biological attributes are experiment independent and thus they only need to be computed once for all experiments.

Discretization of Gene Expression Values.We performed the discretization procedure for the gene expression values of each sample.The parameters t was set to5and s was set to4.That is,there were total of8gene expression attributes.We have tested different parameter values and found that increasing the values(and thus number of attributes)did not signi?cantly change the results.

After discretization,we had total of24binary relations over the11051genes and the 29attributes.We then applied our lattice construction algorithm on these binary relations to construct the corresponding lattices.Each lattice consists of about530vertices and1500 https://www.sodocs.net/doc/b618057245.html,ing the program in4,it took less than one second to construct each lattice on a Pentium IV3.0GHz computer with2.0G memory running under Fedora Core2Linux OS.

Distance for Comparing Lattices.The parameter in de?ning common verticesξwas set to70%.That is,v1∈V1and v2∈V2,v1and v2were considered the same if|obj(v1)∩obj(v2)|≥ξmax(|obj(v1)|,|obj(v2)|),whereξ=70%.We have tested various relaxation of This value seems to be best in our current application,that is,it clearly separates different conditions(see Section4.3for details).

4.3.Our Results

First,using Control sample at6hours as a reference,we computed the distances between all other Control samples and all Flu samples to this reference sample.The results are shown in Figure4.Since the distance measure is not transitive,we also computed all distances when one of the Control samples was taken as a reference.See Figure5.The results showed that there was a clear separation between Flu samples from Control samples (regardless which time point of Control was taken as the reference).This is a encouraging result and we are currently investigating what substructures in the lattices that contribute to the differences which might shed some new biological insights.The results of comparing other2conditions(Smoking/SmokeFlu)is shown in Figure6(in the appendix).

5.Conclusion and Future Work

Our current preliminary results showed the promise of using FCA approach for microarray data analysis.The distance measure we employed is quite basic and it has not utilized the properties of the lattice structure.Currently,we are investigating other possible distance measures,such as spectral distance,and distance based on maximal common sublattice.

These distances will take advantage of the lattice structures and might provide better dis-tinction to aid analyzing the differences between experiments.Beside the global lattice comparison,we will also investigate the local structures of the lattices.These local struc-tures,or sublattices,can be obtained by context decomposition or lattice decomposition.A study of sublattices may assist identi?cation of particular biological pathways or substruc-

July15,200612:7Proceedings Trim Size:9.75in x6.5in Choi-Microarray

9

Figure4.Distance to Control sample at6hours(the reference sample).Each data point represents the distance from a sample(Control shown in red,Flu shown in green)to the reference sample.There are small distances from other Control samples to the reference sample.And there is a clear separation between Flu samples from the Control sample at6hours.

tures of functional importance.

References

1.J Besson,C Robardet,J-F Boulicaut.Constraint-Based Mining of Formal Concepts in Transac-

tional Data.PAKDD2004:615-624.

2.J Besson,C Robardet,J-F Boulicaut,S.Rome.Constraint-based concept mining and its applica-

tion to microarray data analysis.Intell.Data Anal.9(1):59-82(2005).

3.P.Carmona-Saez,R.D.Pascual-Marqui,F.Tirado,J.M Carazo and A.Pascual-Montano.Biclus-

tering of gene expression data by non-smooth non-negative matrix factorization.BMC Bioinfor-

matics2006,7:78.

4.V.Choi.Faster Algorithms for Constructing a Galois/Concept Lattice.Submitted to CLA’06.

Available at arXiv:cs.DM/0602069.

5.R.C.Deonier,S.Tavare,https://www.sodocs.net/doc/b618057245.html,putational Genome Analysis:An Introduction.

Springer,2005.

6. A.Gattiker,E.Gasteiger and A.Bairoch.ScanProsite:a reference implementation of a PROSITE

scanning tool Applied Bioinformatics1:107-108(2002).

7. B.Ganter,R.Wille.Formal Concept Analysis:Mathematical Foundations.Springer Verlag,1996

(Germany version),1999(English version).

8.LL Hsiao,F Dangond,T Yoshida,R Hong,RV Jensen,J Misra,et al.A compendium of gene

expression in normal human tissues.Physiol Genomics.20017:97–104.

July15,200612:7Proceedings Trim Size:9.75in x6.5in Choi-Microarray 10

Figure5.Distance to Control at each time point.There is a clear separation between Flu samples from Control samples regardless which time point of Control is taken as the reference..

9.N.Hulo,A.Bairoch,V.Bulliard,L.Cerutti,et al.The PROSITE database.Nucleic Acids Res.

34:D227-D230(2006).

10. A.Kjersti.Microarray data mining:a survey.SAMBA/02/01.Availalble at

http://nr.no/?les/samba/smbi/microarraysurvey.pdf

https://www.sodocs.net/doc/b618057245.html,m,K.Duca.Mouse gene expression data(Unpublish data).

12.S.C.Madeira and A.L.Oliveira.Biclustering Algorithms for Biological Data Analysis:A Sur-

vey.IEEE/ACM Transactions on Computational Biology and Bioinformatics,VOL1,NO.1,

pp.24-45January-March2004.

13.G.Piatetsky-Shapiro and P.Tamayo.Microarray Data Mining:Facing the Challenges.SIGKDD

Explorations,Dec2003.

14. D.P.Potter.A combinatorial approach to scienti?c exploration of gene expression data:An

integrative method using Formal Concept Analysis for the comparative analysis of microarray

data.Thesis dissertation,Department of Mathematics,Virginia Tech,August2005.

15.R Shyamsundar,YH Kim,JP Higgins,K Montgomery,M Jorden,et al.A DNA microarray

survey of gene expression in normal human tissues.Genome Biol2005,6:R22.

16.RB.Stoughton.Applications of DNA Microarrays in Biology.Annu Rev Biochem.2004

17.M.J.Zaki,M.Ogihara.Theoretical foundations of association rules.Proc.3rd ACM SIGMOD

Workshop on Research Issues in Data Mining and Knowledge Discovery,p1–7,1998.

18.A Formal Concept Analysis Homepage.https://www.sodocs.net/doc/b618057245.html,/fca/fca.html

July15,200612:7Proceedings Trim Size:9.75in x6.5in Choi-Microarray

11

Figure6.Distance to Control from Flu/Smoking/SmokeFlu.

相关主题