WWW.THESES.XLIBX.INFO
FREE ELECTRONIC LIBRARY - Theses, dissertations, documentation
 
<< HOME
CONTACTS



Pages:     | 1 || 3 | 4 |

«Abstract In Cross-Language Information Retrieval (CLIR), queries in one language retrieve relevant documents in other languages. Machine-Readable ...»

-- [ Page 2 ] --

3.3 Two-Phase Method To reduce the ambiguity of the EM method, but to loosen the inherent restrictions of the FM method, we introduce a method for Arabic-English CLIR that uses some, but not all of the translations of a given Arabic term. The underlying assumption behind the Two1 ( f ( x)) = x, namely, the translation of the translation of the term Phase method is that f should yield the original term. If this condition holds, the translation is valid and does not introduce drift or noise.

Let A represent the original Arabic terms.

Let E represent the translated English terms of A using the Every-Match method.

Let A’ represent the translated Arabic terms of E using Every-Match method.

Then, the Two-Phase method can be implemented as follows.

Translate original Arabic terms A into English terms E using the Every-Match method via a machine-readable dictionary.

Translate the English terms E to the Arabic terms A’ using the Every-Match method via English-Arabic machine-readable dictionary.

Return the original Arabic terms A and the translated Arabic terms A’ to their infinitive form.

A candidate English term of E is one that it yields to its original Arabic term based on the comparison between A and A’.

In the rare case when the original terms do not yield a candidate translation term, the

following modification is incorporated into the algorithm:

1. If an English term in E does not yield its original Arabic term in A, then :

Find the synonyms of the English term; translate them using the Every-Match method, each translated synonym that matches the original Arabic term A is selected as candidate translation.

2. If neither the English term nor its synonyms in E yield the original term, use the first match term in E as a candidate translation The Two-Phase method can likewise be illustrated as shown in Figure 1. Circled terms are assumed to be the most appropriate translation of the original Arabic terms. For term a2 in original Arabic terms, the English translation is e3 and e4. Neither e3 nor e4 yield the Arabic term a2 in the original Arabic terms set. To overcome this situation, the approach is to find the synonyms of the term that do not yield the original translation after the second phase. For example, e3, does not yield to the original Arabic term a2, then the synonyms of e3 are translated to Arabic, every synonym that yields the original Arabic term is chosen as candidate for translation. The first match approach is applied when no synonyms yield to the original Arabic terms.

–  –  –

In Table 4, we illustrate an example of the original Arabic query ( ‫,) ﻣﺼﺒﺎح ﺿﻮﺋﻲ وهـــﺎج‬ as translated by the Two-Phase method.

–  –  –

Table 4. Terms of the original Arabic query, and Two-Phase (TP) technique As shown in Tables 1 and 4, the Two-Phase method removes 13 terms from all possible translations found in the dictionary.

The term burner results from the translation process of the original Arabic term (‫ ) ﺿﻮﺋﻲ‬using the machine-readable dictionary. This term is a noise term since it is irrelevant to original query. Similarly, the terms “brightness gleam glow illumination white-hot brilliant bright resplendent dazzling glittering glistening sparkling” are filtered out reducing the extraneous terms. The retained terms via applying the TP method are subset of the translated terms via EM method. There are overlaps between term translations via FM and TP methods. In most cases the first match in the dictionary is retained in the translation process of the TP method. The average query length after translation using TP method is 6 and 12 terms for TREC7 and TREC9, respectively.

3.4 Experimental Approach Some of the Arabic complexities that impact the query term translation are described in Section 3.4.1. In Section 3.4.2, we describe the resources that we used to conduct the experiments.

3.4.1 Pre-processing of the Source Arabic Terms Unlike the English language, in the Arabic language, nouns can be masculine or feminine. The nouns can be definite as in (‫ )اﻟﻤﻌﻠﻢ‬or in indefinite as in (‫.)ﻣﻌﻠﻢ‬Adding the prefix (‫ )ال‬makes the difference. Plurals in Arabic are three kinds: the masculine plural, the feminine plural, and the broken plural. The plural is formed via suffixes or via pattern modification of the nouns. In the first case, the suffix ~een for the accusative (‫ )ﻣﻌﻠﻤﻴﻦ‬and genitive or ~oon for the nominative (‫ )ﻣﻌﻠﻤﻮن‬is appended to the masculine noun. While ~aat (‫)ﻣﻌﻠﻤﺎت‬ is appended to the plural feminine noun and the letter “h” is attached to the end of the word to form singular feminine noun (‫.)ﻣﻌﻠﻤﺔ‬The dual is formed by adding "‫"ان‬ or “‫ ”ﻳﻦ‬at the end of the noun as in (‫.)ﻣﻌﻠﻤﺎن‬In the third case, often referred to as broken plurals, the pattern of the singular noun is dramatically altered. We can recognize these plurals from the patterns. There are 27 patterns for most of the broken nouns.

Another kind of suffixation is the personal pronouns. The personal pronoun can appear as an isolated form or as suffixes attached to the nouns, verbs, or prepositions. Certain suffixes are attached at the end of words to make them possessive pronouns. The attached can be one letter, for example (‫ )ﺑﻴﺘﻲ‬when the letter "‫ "ي‬is attached to the end of the word (‫ )ﺑﻴﺖ‬to form “my house” in English. For plural, two letters are attached to the end of the word, for the masculine, the letters "‫ "هﻢ‬are attached (‫,)ﺑﻴﺘﻬﻢ‬and the letters "‫ "هﻦ‬for the feminine nouns (‫. )ﺑﻴﺘﻬﻦ‬These are the most common modifications to Arabic words.





Dictionaries do not store every form of regular words. Most of the dictionary entries are stored in singular form except the words that are usually used in the plural like (‫ )آﻤﺎﻟﻴﺎت‬which means “luxuries” in English. The verbs are stored in perfect form.

Therefore, before matching the Arabic terms in the dictionary, some of the nouns must be returned to their singular form by removing all suffixes and prefixes. The procedure of removing the affixes is performed when the process of matching fails to find the source terms in the dictionary.

To conduct the Two-Phase method as described in Section 3.3, the verbs are returned to their infinitive form. The infinitive form is a noun that derived from the verbs without connected to the time. In our example, it becomes (‫ )آﺘﺎﺑﺔ‬in English “Writing”. This infinitive form is implemented as a base of comparison in the Two-Phase method.

Similarly, for English-Arabic CLIR, the source English queries are normalized to match them in the dictionary. For example, the terms “performing” and “performance” are normalized to “perform”.

3.4.2 Experimental Environment For Arabic-English CLIR, we evaluated all the three dictionary-based approaches using our search engine AIRE (Chowdhury, et al, 2000) on both the commonly used 2 GB subset of the TIPSTER (Disks 4 and 5) collection and the 10 GB web data from TREC.

Each of these collections contains over 500,000 documents. To obtain a set of standard test queries, we manually translated the English TREC topics into Arabic TREC topics.

The Text Retrieval Conference (TREC) has three distinct parts to the collections used in TREC: the documents, the topics, and the relevance judgments. For queries, we used a human translation of the TREC7 (topics 351-400) and TREC9 (topics 451-500) queries as our original Arabic queries. Since in practice most queries are only a few words long, we used the query titles representation of the 351-400 and 451-500 topics.

A native Arabic speaker manually translated the 100 queries from English into Arabic, and we used these translated versions as our original Arabic queries issued against the TREC English collection. The Arabic queries were translated back to English by means of dictionaries. This approach is often used in dictionary-based CLIR studies (Pirkola, 1998). To compare the effectiveness of the translated queries, our Arabic-English CLIR system compares the results of the translated queries to the performance of the monolingual information retrieval. The dictionary provides words and some phrases as keyword entries. Phrase based translations were used as appropriate. In the translation process, we start to match the phrases in the query to the phrases in the dictionary, if match then the result is retained. If not, then word-by-word translation basis is performed by applying the proposed dictionary-based methods.

For English-Arabic CLIR, we used the Arabic collection that consists of 383,872 documents provided by Linguistic Data Consortium (LDC). TREC provided 25 topics in three parallel languages; Arabic, English and French. Our focus is on English-Arabic CLIR, so we used the English queries as our source queries against the Arabic collection.

The titles of the TREC Arabic topics are used.

We chose the Al-Mawrid Arabic-English and English-Arabic Dictionary (aDawliah Universal Electronics) in the translation process. Al-Mawrid is a bilingual dictionary with two sections: English-Arabic which has more than 100,000 entries and Arabic-English which has more than 67000 entries; it is considered the most comprehensive and accurate Arabic-English bilingual dictionary. Al-Mawrid is the official dictionary used by the United Nations (UN) as well as most academic institutions. It is specially designed for human understanding. We converted a portion of Al-Mawrid to a transfer dictionary suitable for information retrieval. The dictionary includes word-based and some multiword expression as a keyword entry. The process of extracting the term lists from the dictionary involved the removal of a large amount of excess information, such as examples and descriptions.

3.5 Results Using the TREC data and queries described earlier, we evaluated our Arabic-English CLIR approaches. In all cases, the translated Arabic to English queries resulted in low retrieval accuracy (as measured by the average precision and recall) as compared with that of the original English queries. The results using the original and the translated queries for titles of TREC topics 351-400 and 451-500 are shown in Tables 5 and 6. As shown, for both data sets, the Every-Match consistently performed the poorest while the Two-Phase Method was consistently the best. Note that no relevance feedback was used in any of the runs.

–  –  –

In Table 7, we summarize the statistical significant test interpretation of our experiments.

The evaluation is conducted using the paired t-test (Wonnacott, R. and Wonnacott, T, 1990). The obtained α values demonstrate that the performance differences of the TP and FM methods over the EM method are significant at a 99% confidence interval for both the TREC-7 and TREC-9 datasets. Less significant are the performance differences between the TP and FM methods that are significant at an 86% (α = 0.1404) and 89% (α = 0.1090) confidence interval for the TREC-7 and TREC-9 datasets, respectively.

–  –  –

0.3 0.15 0.2 0.1 0.1 0.05

–  –  –

As shown in Tables 8 and 9, again, the Two-Phase method outperforms the Every-Match and the First-Match methods at 5, 10, 15, 20, and 30 top retrieved documents. A comparison of the retrieval performance of the three runs is shown in Figures 2 and 3. As shown, the Two-Phase approach outperforms all the other methods. At the higher precision-lower recall levels (recall up to 0.3), the difference between the Two-Phase method and the other methods is even more noticeable. Since it is unrealistic to expect the user to read many retrieved documents that are expressed in a language other than the user’s native language, the higher precision region is of greater interest. Therefore, higher precision results obtained by the Two-Phase method are even more significant. As measured in average precision, the First-Match method improves the effectiveness over the Every-Match method by 33.7% and 42.9% for TREC topic 351-400 and TREC topics 451-500, respectively. The Two-Phase method outperforms the First-Match method by 3.8% and 6.5% for TREC topics 351-400 and TREC topics 451-500, respectively.

Table 10 summarizes the results of using the FM method for English-Arabic CLIR. The FM method achieved 60.2% of the monolingual run for English-Arabic CLIR.

Applying the two-phase method in English-Arabic direction requires finding a generalized form of English terms for more chances of matching. In fact, such characteristics of Arabic words make the two-phase more practical since words in Arabic are primarily based on three letter roots. The three letter roots dramatically simplifies the formation of word classes making it practical to use the two-phase method.

–  –  –

As described in Table 11, feedback after translation improved the effectiveness by 16.5% over the FM method without post-translation expansion. The differences between the MRD on title and post-translation expansion of the translated Arabic queries are statistically significant at 95% confidence level.

Since it is not realistic for the foreign users to read many retrieved documents, we demonstrate the effects on the precision-recall measure for the MRD and the MRD+post translation approaches at lower levels of recall, up to 1000 documents retrieved. In Table 12, column one corresponds to MRD without expansion. As shown in column three, the MRD augmented with query expansion via PRF outperforms the MRD without expansion.

–  –  –



Pages:     | 1 || 3 | 4 |


Similar works:

«MONTGOMERY COUNTY, STATE OF MARYLAND DEBORAH BEEBE, : COMMISSION ON COMMON : OWNERSHIP COMMUNITIES Complainant, : : Case No. 41-09 v. : : Hearing Date: February 25, 2010 ORANGE’S HOMEOWNERS : ASSOCIATION, INC, : : Respondent. : Decision Issued: July 23, 2010 : (Panel: Burgess, Molloy, and Garcia) : Memorandum Decision and Order By: Ursula Koenig Burgess MEMORANDUM DECISION AND ORDER The above-captioned case came before a Hearing Panel of the Commission on Common Ownership Communities for...»

«1 William Shakespeare. Love’s Labour’s Lost Bibliographie établie par Sophie Chiari * Une étoile signale un article ou un ouvrage particulièrement utile dans le cadre de la préparation au concours. ** Deux étoiles indiquent les textes à consulter en priorité. I. Bibliographie HARVEY, Nancy Lenz et Anna Kirwan Carey, Love’s Labor’s Lost: An Annotated Bibliography New York, Garland, 1984. II. Éditions (19e et 20e siècles) Note : l’in-quarto de la pièce (1598) est consultable...»

«Bits of Travel at Home (1878) by Helen Hunt Jackson Helen H. Jackson Copyright, By Roberts Brothers. 1878. Bits of Travel at Home (1878) by Helen Hunt Jackson Table of Contents Bits of Travel at Home (1878) by Helen Hunt Jackson Ahwahnechee Place Names About the Author Bibliographical Information BITS OF TRAVEL AT HOME By H. H., CONTENTS. CALIFORNIA BITS OF TRAVEL AT HOME FROM CHICAGO TO OGDEN. SALT LAKE CITY. FROM OGDEN TO SAN FRANCISCO THE GEYSERS HOLY CROSS VILLAGE AND MRS. POPE’S THE...»

«My Utmost For His Highest Instant apps will well inform those and would be of that card. How you fail a party as prices of the post at people, concrete estate My Utmost for His Highest will download prompt to the something My Utmost for His Highest how they have around losing prices or dealing first shoppers, that will especially wish yet get their development. Home comes if the smartest practice in cheaper of also for friends who need now lead a office to worry forefront in holder. The...»

«Individual Development Account Handbook and Tribal IDA Program Profiles A guide to IDA programs in Native Communities Developed by Alisa Larson Revised September 2003 Prepared for: Affiliated Tribes of Native Assets Research Center Northwest Indians First Nations Development Institute 1827 NE 44th Ave., Suite 2300 Fall Hill Ave., Suite 412 130 Fredericksburg, VA 22401 Portland, Oregon 97213 540-371-5615 503/249-5770 www.firstnations.org http://www.atnitribes.org/ General Information Author...»

«Continuous Auditing & Continuous Monitoring in a Broader Perspective The Performance Management Potential of CA & CM Master Thesis K.H. (Koen) klein Tank s0211931 University of Twente, the Netherlands & KPMG, the Netherlands February 18, 2011 Contact Author Name: Koen klein Tank ; Student number: s0211931; Function: Graduate University of Twente, KPMG, IT Advisory; Address: Nieuwstraat 12, 7137 MJ, Lievelde, The Netherlands; Phone: +31-62-467-7392; Email: kleintank.koen@kpmg.nl. Graduation...»

«Revista Latina de Comunicación Social # 071 – Páginas 413 a 427 Investigación financiada | DOI: 10.4185/RLCS-2016-1102 | ISSN 1138-5820 | Año 2016 Cómo citar este artículo / Referencia normalizada JM Ramírez Hurtado, C Paralera Morales (2016): “Preferencias de los estudiantes universitarios en la elección del proveedor de Internet”. Revista Latina de Comunicación Social, 71, pp. 413 a 427 http://www.revistalatinacs.org/071/paper/1102/22es.html DOI: 10.4185/RLCS-2016-1102...»

«GRADO EN CIENCIAS Y TECNOLOGÍAS DE LA EDIFICACIÓN TRABAJO FINAL DE GRADO LA BURBUJA INMOBILIARIA ESPAÑOLA: CAUSAS Y CONSECUENCIAS Proyectista/s: Javier Bertolín Mora Director/es: Juan Manuel Soriano Llobera Convocatoria: Noviembre/Diciembre 2014 La burbuja inmobiliaria española: causas y consecuencias 1 RESUMEN Durante el período comprendido entre los años 1997 y 2006 o 1998 y 2005, según el autor, tuvo lugar en España una burbuja del mercado inmobiliario. Esta situación hizo que el...»

«Ellen White Spoke to Her Dead Husband Jud Lake, Th.D., D.Min. In the article, Do God's Prophets take advice from the Dead?, Sidney Cleveland attempts to show that Ellen White spoke with her dead husband and received guidance from him in a dream. In this charge, Cleveland cites a published portion of Letter 17, 1881, written by Ellen White to her son, Willie, on September 12, 1881, five weeks after James White had died. In this letter Mrs. White describes a dream about James. Cleveland claims...»

«      Emergency Accommodation for Homeless Persons in the Town of Tillsonburg Prepared by Tim Welch Consulting Inc. for Tillsonburg Emergency Accommodation Management (TEAM) June 2010 ACKNOWLEDGEMENTS The Tillsonburg Emergency Accommodation Management (TEAM) wishes to extend their sincerest gratitude to the County of Oxford for their continued support in TEAM’s goal of creating needed emergency housing in the Town of Tillsonburg. Without the support of the County, this study would not be...»

«2nd Oversight Report On Public Departments: Education, Social Welfare & Youth Affairs For the period 1st July 2014 30th September 2014 September 2014 This report was made possible with support from the American people through the U.S. Agency for International Development (USAID). The contents is the responsibility of Civil Society Support Program and do not necessarily reflect the opinion of USAID or the U.S. Government. List of Abbreviations CSSP Civil Society Support Program CVP Citizen`s...»

«Cosmos www.librosmaravillosos.com Carl Sagan 1 Preparado por Patricio Barros Cosmos www.librosmaravillosos.com Carl Sagan Agradecimientos Agradezco a las siguientes instituciones el permiso concedido para reproducir materiales publicados con anterioridad: · American Folklore Society: Fragmentos de Chukchee Tales, de Waldemar Borgoras, en Journal of american folklore, volumen 41 (1928). Publicado con permiso de la American Folklore Society. · Ballantine Books: Ilustración de Darrell K. Sweet...»





 
<<  HOME   |    CONTACTS
2016 www.theses.xlibx.info - Theses, dissertations, documentation

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.