ICEIS 2012

8/13/2019 ICEIS 2012

1/7

BRAINMAPPreli minary studies for a Navigation Support System to Explor e Graphs of

Document Corr elations

Lus F. S. Teixeira1, Rita A. Ribeiro2 , Gabriel P. Lopes1 and Ricardo Raminhos31 CITI, Dep. Informtica, FCT/UNL, 2829-516 Caparica, Portugal2 CA3-Uninova, Campus FCT/UNL, 2829-516 Caparica, Portugal3 ViaTecla - Estrada da Algazarra, 72, 2810-013 Almada, Portugal

[email protected],[email protected],[email protected],[email protected]

Keywords: Complex Networks, Weighted Network, Navigation Support System, Graphs, Document Correlations,Jaccard similarity metric, Cosine similarity metric.

Abstract: Todays overwhelming amount of information available inside companies and corporations can makesearching and browsing, for a rather specific topic or information, within a large collection of documents, avery hard job. Therefore, it is of paramount importance to develop tools to make easier the retrieval ofspecific information and to support the exploration by users of corporate intranets (composed of severalhundreds of gigabytes of documents). In this work we present preliminary studies aiming at building anavigation support system to explore graphs of document correlations, using concepts from the weightedcomplex network field, to correlate documents just using their unstructured textual content.

1 INTRODUCTION

Nowadays, corporate intranets are gaining more andmore importance in corporate work environment, by providing easier exchange of information/knowledgeand faster learning experience. The exchange ofknowledge and information may improve workersefficiency and consequently increase the companiescompetitive advantage.

In this work we present results of our preliminarystudies towards building a navigation support systemto explore graphs of document correlations. Our proposed approach aims at allowing users to searchin a corporate intranet and present relatedinformation in a way that will allow them to devise

the relevance and similarity of different documentsor pages and the relation between them.The rest of this paper is structured as follows:

section two describes our design approach andimplementation methodology and an illustrativeexample of how the method works; in section threewe present preliminary results and evaluations of theresults obtained. In section four we present somerelated work and in section five we present ourconclusions and future work. Acknowledgementsappear in the last section.

2 NSS PROPOSED DESIGN The objective of the Navigation Support System(NSS) in development is to allow users to exploreinformation in an intranet and freely mining thecorrelations among documents in the intranet. Todevelop a versatile NSS we propose a method withfour general steps briefly described below.

1. Data preparation : A company orcorporation documentation has a large variety offormats, ranging from emails, project reports,studies reports, etc. In order to deal with this fact,normalization is required. All requireddocumentation is transformed into plain text (txt)files.2. Document Representation : Documentshave to be represented in some form, thus werepresent them, as a bags of words.3. Calculate the weights of documentswords: For being able to mine any kind ofrelation between documents, their words (as weare using the bag of words representation) must be in some way weighed.

8/13/2019 ICEIS 2012

2/7

4. Mine Document Correlation: Ourapproach to this step is to mine documentrelations where they do not explicitly exist, just based on documents textual content, requiring no

semantic knowledge, and applying similaritymetrics.

Now lets go into greater detail on how to deal withthe steps mentioned above. Step one, is relativelystrait forward to address. Any programminglanguage allows us to transform documents in agiven format into a simple text file. Apache Tika is a java toolkit for extracting content from a variety ofdocument formats. In step two, documentrepresentation is tackled. It was used bag of wordsrepresentation of documents in the collection(corpus) under study, it is frequently used in works

of this nature(Hulth and Megyesi, 2006). In stepthree document words are scored, assigning bestscores to best words, i.e. best topic descriptors of thedocument content will have best scoring values.According to the work of(Lus F. S. Teixeira et al.,2012) the best metrics to accomplish these are theones based on Tf-Idf (Term Frequency InverseDocument Frequency) and Phi-Square metrics.

The Tf-Idf metric, used by(Salton and Buckley,1987) gives relevance to a term, occurring in adocument, in relation to other terms in thatdocument as well as in the other documents belonging to the corpus. It is composed of two parts,Tf (Term Frequency) and Idf (Inverse DocumentFrequency). Formally, Tf-Idf for a term t in adocumentd j may be defined as in equations (1), (2)and (3).

( ) ( ) ( ) (1)

( ) ( ) (2)

( ) ( { } ) (3)

Notice that, in (1), instead of using the usual term

frequency factor, we use the probability( ) whose definition is shown in equation (2).There ( ), denotes the frequency of termt(word) in documentd j, and Nd j refers to the numberof words contained in document . The totalnumber of documents included in the corpus is given by . The use of a probability in (1) normalizesthe Tf-Idf metric, making it independent of the sizeof the document under consideration.

The Phi Square Metric (Everitt and Skrondal,2010) is a variant of the known measure Chi-

Square, allowing a normalization of the resultsobtained with the Chi Square, and is given by thefollowing expression:

2 = (N . (AD-CB) 2 / (A+C).(B+D).(A+B).(C+D)) /M .

(4)

Where M represents the total number of terms present in the corpus (i.e. the sum of number ofterms from each document that belong to thecollection). The letter A represents the number oftimes term t and document d co-occur; B the numberof times that term t occurs without being indocument d ; C stands for the number of timesdocumentd occurs without termt ; D is the numberof times that neither documentd or termt occur; and

N the total number of documents. At this momentexperiments are done for Tf-Idf, Phi-Square is planned to be also used and results compared.

In order to count the frequencies of the words inthe documents, and therefore in the corpus, we haveused a Suffix Array(Yamamoto and Church, 2001).The use of a suffix arrays allows us to rapidlyquantify word frequency in documents and clearlyidentify from what document it occurred in, enablingus to calculate Term Frequency - Inverse documentFrequency and Phi-Square values.

The scoring of words has two purposes. One is toallow a user to query about a word, and get a ranked

list of documents ordered by the score of thesearched word in each document where it appears.Second, word scoring allows the creation of a graphthat represents a network of correlated documents.That is the fourth and final step, and it is describedin the following sub-section.

2.1 The Core of NSS

The fourth and final step is the core of the NSS. Togenerate relationships between documents based onthe score of their words. To better clarify theconcepts involved in this step we first introduce thenotions of crisp and fuzzy relations, and theirapplication to graph theory and complex networks.In the studied literature, the terminology of networksand graph are used interchangeably, in this work wefollow the same definition as presented in(Mihalceaand Radev, 2011) where network indicates arelationship between objects, in our case free textdocuments, and graph for a relationships generatedthrough an automatic process. According to(Zimmermann, 2001) crisp relations, are relations ofdichotomous type, like in binary logic, where a

8/13/2019 ICEIS 2012

3/7

statement can be true or false and nothing else. Thesame applies in classical set theory, where anelement can either belong to a set or not. The maindifference between crisp versus fuzzy or weighted

graphs is that any element belongs to a set with amembership value, within the interval [0,1]. Further,from a crisp or fuzzy bipartite graph it is possible tocreate fuzzy weighted graphs where each elementhas a degree of belongingness.

Other important concepts involved in this stepare graphs and weighed graphs. A graph(Gross andYellen, 2004) is a Boolean relation between twoobjects or entities of the same type. Formally wedefine as a finite relation betweenentities. Where V is the set of nodes , and E is the set of arcs { }, where represents the arc between nodes and . Further, the set of arcstakes either the value 0 or 1, meaning the existenceor not of an arc, .

Conversely, a weighted graph(Lee, 2005) isdefined as where V is the set ofnodes , and E is the set of arcs

{ } in the relation In thiscase the takes a value between 0 and 1, [ ], being these values the weight of the arc,which expresses the level of correlation of eachrelation.

At his moment we need to introduce the notion

of similarity or proximity between elements of acorpus to learn how related they are and to constructthe weighted graph. Similarity measures betweenobjects, in our case documents of unstructuredcontents, can be classified as geometric or set-theoretical(Miyamoto, 1990). Geometric measuresare based on distance measures and represent proximity between objects. Instead set-theoreticalsimilarity measures, based on operation such asunion and intersection, translate the degree to whichtwo objects are equal. One of the most used set-theoretical similarity measures is the Jaccard indexor Jaccard similarity measure(Miyamoto, 1990),

defined by: | | | | (5)

Since the Jaccard similarity measure is flexible,simple to implement and is easily generalized toweighted graphs (also called fuzzy graphs) we choseit, but any other similarity measure could have beenused (see, for example,(Shyi-Ming et al., 1995) foran overview on similarity measures). Mostcorrelation operations for determining the

relationships between documents, use either thecosine function(Amit et al., 1996) or Jaccard metric(Sharma et al., 2007). On document correlation, anoverview of similarity measure functions can also be

found in the work of(Spertus et al., 2005).Further, by using the minimum (min) andmaximum (max) operators from the T-norms(Zimmermann, 2001) the generalization of theJaccard measure to the fuzzy interval of[ ] isdone with the following formulation(Rocha et al.,2005):

( ) ( )

(6)

Where the indexes i and j stand for the rows ofthe Tf Idf Matrix (rows stand for documents).The

means the column k (a Tf-Idf score of word)from the row i (a document). The definition for is similar but for line j (another document).

In summary, in this step we start with the initialTf-Idf matrixes, that has the scores of the words bydocument, and by using the generalized Jaccardsimilarity (equation. 6) we construct the correlationmatrix, which corresponds to the fuzzy (weighted)network. At this moment experiments are done forJaccard metric, the usage and comparison withcosine is planned.

2.2 Illustrative ExampleIn this sub-section we illustrate an example of howthe proposed system would work for a corpus of twodocuments.Consider two documents composed by the following bag of words.

Doc1 = [word1, word 2, word 3, word4]; Doc2 = [word1, word 2, word 3, word 4].

Then we will generate a matrix containing the Tf-Idfvalues of the words in their documents.

Table 1 The Tf Idf Matrix for the example documents.word 1 word 2 word 3 word 4Doc 1 0.7 0.3 0.5 0.4Doc 2 0.6 0.8 0.4 0.6

Now we want to find the Jaccard Value betweenDoc 1 and Doc 2 inTable 1. To calculate thecorrelation between lines 1 and 2, we start by settingi =1 and j = 2 in equation (6), determining themaximum and minimum for each word. Forinstance, in column 1 (word 1 ) the maximum of

8/13/2019 ICEIS 2012

4/7

(0.7, 0.6) is 0.7 and the minimum is 0.6. We do thisfor all words and then using equation 6, we calculatethe Jaccard measure:

The result of 0.65 depicts the correlation betweenDoc 1 and Doc 2 using the Jaccard metric. Thisresult means that Doc1 is similar to Doc2 in 0.65. Notice that 0 means completely dissimilar and 1means total similarity (this result only happens whencorrelating the same document with itself).

3 RESULTS AND EVALUATION

The results presented were obtained on a randomlyselected subset of documents of Reuters corpora(Lewis et al., 2004). For a first experiment, as thereare several topic classes (above 120) within thiscorpus, a random class selector was developed inorder to give us five classes from the corpus. Each ofthese selected classes is originally composed byseveral thousands of documents. Due to timelimitations, only a subset of documents of each classwas used (150 in this paper).

Experiments were performed as follows, aftertreating and transforming the xml files into plain textfiles, word queries were selected randomly andselected one (health) to perform a search in thesystem. All words with length greater than threecharacters were processed and used. We did not useany stop-word list. Then querying the system withthe selected word quer y (health) a list ranked bythe importance of the word query is returned.

For each ranked list returned, the first 4documents of that list were chosen, and then foreach of these documents, the 10, 20 and 30 mostcorrelated documents were retrieved by the system.For evaluating the results obtained by the system, wehave chosen to use Precision metric.

In this work we consider precision as the fractionof the interception between the documents retrieved(correlated documents) obtained using the systemand the documents from the same class as theselected document from the initial the ranked listdivided by the number of retrieved correlateddocuments returned by the system. In the followingformulation read#relv.docs as the number ofrelevant documents, i.e., the number of correlateddocuments from the same class of the selecteddocument from the ranked list and#retr.docs as thenumber of correlated documents.

| | | | (7)

In Table 2we can see the names of the four bestranked documents(column Documents) and theclasses (column Class of the document) eachdocument is assigned to in Reuters corpora. In thiscase each document has exactly one class assigned.In column three (column #Docs in Class as selecteddocument ), we show the number of documents thatcompose those classes.

Table 2 - Class Information about returned ranked listdocuments for searchhealth .

DocumentsClass of thedocument

#Docs in Classas selecteddocument

11373newsML.txt Economics 150

108664newsML.txt Health 150



In Table 3, we show the number of correlateddocuments that belong to the same class as theselected document from the ranked list (columnidentified by Documents ). The correlated documentsare returned when consulting the graph resulting ofthe analysis of the Jaccard Matrix. In Columnidentified by#10 , it is depicted the number ofdocuments of the same class as the one in thecolumn (Documents) from the 10 most correlateddocuments to the selected one in the ranked listreturned by the system. And equivalent reading can be made for columns#20 and#30.

Table 3 - Number of correlated documents that belong tothe same class of the selected document for the 10, 20 and30 most correlated documents.

Documents #10 #20 #30

11373newsML.txt 10 16 18

108664newsML.txt 3 6 10101083newsML.txt 5 8 9

14095newsML.txt 8 11 15

As expected for each of the initially selecteddocuments the number of correlated documents inthe same class augments as we observe more relatedocuments retrieved. For those results obtained precision and recall can be seen in the followingtables.

8/13/2019 ICEIS 2012

5/7

In Table 4 it is presented the precision obtainedwhere we can see that it gets lower as the number ofcorrelated documents returned increases. Notice alsothat the precision for the 10 and 20 more correlated

documents are in average is above 0.5.Table 4 - Precision results for the correlated documentsthat belong to the same class of the selected document forthe 10, 20 and 30 most correlated documents.

Prec.#10

Prec.#20

Prec.#30

11373newsML.txt 1 0.80 0.60

108664newsML.txt 0.3 0.3 0.33

101083newsML.txt 0.5 0.4 0.3

14095newsML.txt 0.8 0.55 0.5

Avg. 0.65 0.51 0.43

4 RELATED WORK

In (Viji, 2002) term and document correlation isaddressed using similarities between the terms anddocuments. The author uses term vectors forrepresenting web documents, which results from aquery to a search engine. Then uses the cosine between the term vector of two different webdocuments and the resulting angle gives the degree

of similarity between documents. This is then usedto produce a Physical Model based on springs for projecting in space the correlation betweendocuments.

Also in (Xiangfeng et al., 2008) the topic ofcreating semantic relation between documents isaddressed, in this case the authors first create adomain knowledge background on association rulesat keyword level. Then the authors apply thoseassociation rules to generate and calculate thedocuments semantic relations and their strengths atdocument level. In(Klose et al., 2000) documentcorrelation is established using a self-organizing

map (SOM) to cluster documents by cluster oftopics.Our work mainly differs from the above in that

we perform searches within an intranet usingunstructured text documents. And having in mindthe development of a NSS, it will be designed andimplemented having the user as the base of thedesign, i.e. the user will not adapt himself to theinterface, the interface will be designed to suit theuser.

There have been attempts at visualizing thedocuments retrieved in a search result set in anintuitive manner. This visualization is taken inconsideration by(Viji, 2002) and (Klose et al.,

2000). In the first work the author projects in space agraph of correlated web documents and in thesecond the authors represent the relationships between documents as a SOM (self-organizingmap), to visually allow the user to observe thedocuments within the cluster and their neighbors.

5 CONCLUSIONS AND FUTUREWORK

Results obtained in the experiments were not soencouraging in the sense that precision was not sohigh as we were expecting, although in average forthe 10 and 20 most correlated documents theaverage precision obtained was above 0,50.

It must be noticed the particular precision andrecall values for the document 11373newsML.txt where all of the 10 correlated documents belong tothe same class, giving maximum precision possiblein the experiment. Considering this result, on a textclassified as dealing with economic matters, after aquery on health. Visual inspection of thecorrelated documents returned by the system, despitethe not so high precision, showed similar contents

that would allow a user to navigate throw a graphand possibly allowing him to reach a document thatotherwise would not be found, e.g. does not containthe initial searched word. It must be said that a better planned validation task must be performed. Insteadof querying the system about a random word, wemight have selected the words with the greater score(Lus F. S. Teixeira et al., 2012) from the corpusdocuments.

As Future work, we plan to use Phi-SquareMetric with the purpose of comparing the resultswith the ones obtained with Tf-Idf, using the Jaccardsimilarity metric. Also plan to apply the cosine

similarity metric instead of Jaccard and compare theresults obtained for both statistical metrics.It is our objective to developed a functional

prototype application that proves the concept of the Navigation Support System to support intranetcorporate users in the search for either/or keywordsand documents, and also enable the user to navigateacross a network of related documents. We believethis type of application can be very useful forexploring and finding knowledge and informationwithin the massive databases of corporations.

8/13/2019 ICEIS 2012

6/7

Moreover, it can help novices to learn about thebusiness in a faster and user -friendlier way. Thedevelopment of this prototype will be accomplishedusing the more recent ergonomics metrics in the area

of developing software tools, to improve userexperience and efficiency. We will also considernew media, such as touch-enabled devices, a newchallenge to design user interfaces without losingany functionality. And in line with user experience,with the objective of producing full usableinterfaces, studies like user surveys, user experiencemonitoring will be taken in account.

Other test scenarios will be taken inconsideration. For instance, in the development ofcode following aspect oriented methodology, when adeveloper has to find where in the developmentworkspace a particular aspect (e.g. security) is

used, or where this aspect has impact, A navigationsupport system as we propose can be an addedvalue. Also a good corpus to test our approachwould be a lawyers office company.

Another future approach to this study is to helpselecting correlated documents to aid in documenttranslations tasks, by selecting correlated documents,already translated, and not necessarily parallel to thecurrent document (i.e. not translations of eachother).

ACKNOWLEDGEMENTS

The work on this paper was developed in the contextof the project BrainMap lead by the companyViaTecla (Portugal) in collaboration withUNINOVA and University of vora (both inPortugal). This project is financed by the PortugueseQREN - Quadro de Referncia Estratgico

Nacional; Programa Operacional de Lisboa .This was also supported by the Portuguese

Foundation for Science and Technology(FCT/MCTES) through funded research projectsISTRION (ref. PTDC/EIA-EIA/114521/2009) andVIP-ACCESS (ref. PTDC/PLP/71142/2006).

REFERENCES

AMIT, S., CHRIS, B. & MANDAR, M. 1996. Pivoteddocument length normalization. Proceedings of the19th annual international ACM SIGIR conference on

Research and development in information retrieval. Zurich, Switzerland: ACM.

EVERITT, B. S. & SKRONDAL, A. 2010.TheCambridge Dictionary of Statistics , CambridgeUniversity Press.

GROSS, J. L. & YELLEN, J. 2004. Handbook of GraphTheory (Discrete Mathematics and Its Applications) ,CRC Press.

HULTH, A. & MEGYESI, B. B. 2006. A study onautomatically extracted keywords in textcategorization. Proceedings of the 21st InternationalConference on Computational Linguistics and the 44thannual meeting of the Association for Computational

Linguistics. Sydney, Australia: Association forComputational Linguistics.

KLOSE, A., NRNBERGER, A., KRUSE, R.,HARTMANN, G. & RICHARDS, M. 2000.Interactive text retrieval based on documentsimilarities. Physics and Chemistry of the Earth, Part

A: Solid Earth and Geodesy, 25, 649-654.LEE, K. 2005. Fuzzy Graph and Relation. First Course on

Fuzzy Theory and Applications. Springer Berlin /Heidelberg.

LEWIS, D. D., YANG, Y., ROSE, T. G. & LI, F. 2004.RCV1: A New Benchmark Collection for TextCategorization Research. J. Mach. Learn. Res., 5, 361-397.

LUS F. S. TEIXEIRA, GABRIEL PEREIRA LOPES &RITA A. RIBEIRO. Year. An Extensive Comparisonof Metrics for automatic extraction of Key Terms. In:JOAQUIM FILIPE & ANA FRED, eds. Proceedingsof the 4th International Conference on Agents andArtificial Intelligence (ICAART 2012 ), February 6-82012 Algarve, Portugal. SciTePress Science andTechnology Publications, 55-63.

MIHALCEA, R. & RADEV, D. 2011.Graph-based Natural Language Processing and Information Retrieval, New York, NY 10013-2473, CambridgeUniversity Press.

MIYAMOTO, S. 1990. Fuzzy Sets in Informational Retrieval and Cluster Analysis , Springer.

ROCHA, L. M., SIMAS, T., RECHTS, A., GIACOMO,M. D. & LUCE, R. E. 2005. MyLibrary at LANL:

proximity and semi-metric networks for acollaborative and recommender Web service. In The2005 IEEE/WIC/ACM International Conference onWeb Intelligence (WI 2005. IIEEE Press.

SALTON, G. & BUCKLEY, C. 1987. Term WeightingApproaches in Automatic Text Retrieval. CornellUniversity.

SHARMA, A., PUJARI, A. K. & PALIWAL, K. K. 2007.Intrusion detection using text processing techniqueswith a kernel based similarity measure.Computers &Security, 26, 488-495.

8/13/2019 ICEIS 2012

7/7

SHYI-MING, C., MING-SHIOW, Y. & PEI-YUNG, H.1995. A comparison of similarity measures of fuzzyvalues. Fuzzy Sets Syst., 72, 79-89.

SPERTUS, E., SAHAMI, M. & BUYUKKOKTEN, O.

2005. Evaluating similarity measures: a large-scalestudy in the orkut social network. Proceedings of theeleventh ACM SIGKDD international conference on

Knowledge discovery in data mining. Chicago,Illinois, USA: ACM.

VIJI, S. 2002. Term and Document Correlation andVisualization for a set of Documents. StanfordUniversity.

XIANGFENG, L., GUONING, L. & SHIJUN, L. Year.Generating Associated Relation between Documents.

In: High Performance Computing andCommunications, 2008. HPCC '08. 10th IEEEInternational Conference on, 25-27 Sept. 2008 2008.831-836.

YAMAMOTO, M. & CHURCH, K. W. 2001. UsingSuffix Arrays to Compute Term Frequency andDocument Frequency for All Substrings in a Corpus.Computational Linguistics, 27, 1-30.

ZIMMERMANN, H.-J. 2001. Fuzzy Set Theory and its Applications , Springer.

ICEIS 2012

Documents

Transcript of ICEIS 2012