FREE ELECTRONIC LIBRARY - Theses, dissertations, documentation

Pages:     | 1 | 2 || 4 |

«Abstract In Cross-Language Information Retrieval (CLIR), queries in one language retrieve relevant documents in other languages. Machine-Readable ...»

-- [ Page 3 ] --

4. MT-based approach We explore the retrieval effectiveness of Machine Translation (MT) systems for ArabicEnglish and English-Arabic Cross-Language Information Retrieval (CLIR), as well as what factors affect performance, and to what extent. As mentioned in Section 2.2, one of the approaches being tested for CLIR makes use of existing machine translation systems to provide automatic translation of the queries or documents, from one language to another. The basic task of any machine translation system is to analyze the source text, including morphological, syntactic, and semantic analysis using bilingual dictionaries or special purpose lexicons, and target language generation. Therefore, a machine translation strategy for CLIR might allow the researchers to take advantage of the extensive research on machine translation and the availability of commercial products.

There are two basic approaches to MT, translating the documents or the queries. The drawbacks of the document translation approach, as compared to translating the queries, are the extensive processing required to translate very large amount of data, and in the case of multiple query languages, the need to duplicate the documents in all of the query languages. In the case of translating the queries, Oard (Oard, 1998) discussed the technique and concluded that it is less costly than translating the documents. This provides an obvious approach to query translation.

Many researchers criticize the MT-based CLIR approach. The reasons behind their criticisms mostly stem from the fact that the current translation quality of MT is poor. In particular, typical search terms lack the context necessary for the MT system to correctly perform proper syntactic and semantic analysis of the source text. Another reason is that MT systems are expensive to develop, and their application degrades the retrieval efficiency (run time performance) due to the cost of the linguistic analysis. A study by (Radwan and Fluhr, 1995) compared the retrieval effectiveness of the French-English CLIR using SYSTRAN machine translation system with the effectiveness of their EMIR dictionary-based query translation. They determined that the EMIR was more effective than their MT-based query translation technique using SYSTRAN.

Other researchers, in contrast, showed that machine translation approaches could achieve reasonable effectiveness. Jones, et al. (1999), showed that full disambiguation by a MT system outperforms dictionary lookup methods that include several terms as candidates in the query. Also, many participants in the TREC-8 CLIR track (Braschler et al., 1999) concluded that MT-based CLIR is an effective strategy. Another advantage of using MT systems for CLIR is that if L1-L2 MT and L2-L3 MT systems are available, it is possible to construct a L1-L3 CLIR system without developing a L1-L3 MT system, where L1, L2, and L3 are three different languages (Kwok, 1999).

Our experiments provide insight into the performance of the MT-based query translation approach on a large document collection described in Section 3.4.2. The machine translation systems that we adapted for our experiments are commercial products that are designed to assist humans by automatically translating full sentences, or even a paragraph. For higher accuracy, if the query terms are formulated as phrases, we can apply MT systems as well. However, experience shows that users typically prefer to give isolated words, or at best, short phrases to an information retrieval system. Therefore, we are considering short queries directed at the titles of TREC-7, TREC-9 and Arabic TREC-10 topics to experiment with this situation.

4.1 Experimental Approach In Arabic-English CLIR, presently, no benchmark data are available for Arabic-English CLIR. To provide a means to compare our efforts with future Arabic-English CLIR efforts, we used readily available English benchmark document collections and provide our Arabic queries, a translation of the National Institute of Science and Technology, Text Retrieval Conference (TREC) queries on our web site at www.ir.iit.edu. We used these 100 translated versions as our original Arabic queries issued against the TREC English collection.The Arabic queries were translated back to English using the ALKAFI MT system. Indexing is done using the Porter and K-stem algorithms after eliminating the stop-words. Similarly, querying is done after stemming and eliminating the stopwords of the translated target English queries. The ALKAFI Arabic-English MT system is a commercial system developed by CIMOS Corporation and it is the first Arabic to English machine translation system.

Usually, the Arabic text is not vocalized; so ALKAFI can add vowels internally. But sometimes, the user must vocalize some consonants to help ALKAFI at lexical and syntactic analysis. Vocalization is crucial step since word sense depends on vocalization and on word position in context. The system attempts to analyze words in context and then builds semantic relations. Then, the English text is generated by a transfer method

according to English language grammar rules. ALKAFI uses five dictionaries:

The TREC queries (or topics in the TREC vernacular) consist of three fields: title, description, and narrative. The title is considered short; it consists of one, two or three concept terms. In Table 13, we illustrate an example of the original Arabic title and its translation. The description field is of medium length; it consists of one or two sentences.

In Table 14, we provide an example of the description field and its translation. The longest part is the narrative field; in Table 15, we show an example of the narrative field and its translation using the ALKAFI MT system. To measure the effectiveness of an MT system for CLIR, we experimented using all three-query types to determine the effects of query length (short, medium, and long) on the performance of the MT-based method for CLIR.

–  –  –

Table 15. The narrative of the original Arabic and the translated English query using the ALKAFI MT system For English-Arabic CLIR, we conducted the experiments using Al-Mutarjim Al-Arabey English to Arabic commercial system (ATA Software Technology).

The titles of the source Arabic queries are translated to English by the Al-Mutarjim Al-Arabey MT system. The average length of the titles of Arabic TREC topics is 6.2 words. The minimum speed of translation is 1000 words per minute on a system with just the basic hardware requirements. The translation result of query’s title AR23 using Al-Mutarjim Al-Arabey system is shown in Table 16.

–  –  –

Table 16. English query terms and their translation using Al-Mutarjim Al-Arabey MT system

4.2 Results We use three performance measures. The first uses the recall-precision scores at 11 standard points. In CLIR systems, given the expenses of the translation, a user is most likely to be interested in only the top few retrieved Web pages. Thus, we provide measures for the top n retrieved documents. We also provide the overall average of precision of each run. We evaluate the effects of the MT system in Arabic-English CLIR.

As described earlier, we used both the TREC-7 and TREC-9 topics and TREC-9 collections. For TREC-7, as shown in Table 17, the machine translation achieved 61.8%, 64.7%, and 60.2% for title, description, and narrative fields, respectively. The 11-point average recall-precision for TREC-7 topics is shown in Figures 4, 5, and 6 for the title, description, and narrative fields, respectively. As shown, the MT-based approach on description is more effective than title and narrative. In each figure, we also illustrate the “ideal” system score, which is represented by the monolingual query. At the higher precision-lower recall levels, the difference is even more noticeable. The degraded effectiveness of the machine translation on title is that the ALKAFI machine translation system is designed to perform best on well-formed sentences or at least on a sequence of words that form a context. However, the titles of topics 351-400 are all three words or less; thus, no substantive context is formed.

For the narrative run results shown in Figure 6, the MT system is unable to preserve its accuracy when extra, potentially noise, terms are presented in the source query. The greater the number of source query terms, except for, of course, keywords or words of high query disambiguation content, the greater is the performance degradation of a CLIR system. These additional, potentially noise, terms do not provide a strong basis of the source query. The ALKAFI MT system, however, is still capable of maintaining 60.2% of the monolingual retrieval. At the higher precision-lower recall levels, the narrative run is more effective than the title. At the higher recall level (up to 0.8), the title run is more effective than the narrative run. As measured by average precision, there is a slight difference between the narrative and the title runs. It is not surprising that the narrative run is strictly worse in accuracy then the descriptive run since the MT system achieves its best performance on the fewest sequence of words that still provides a full context.

–  –  –

0.3 0.2 0.1

–  –  –

0.4 0.3 0.3 0.2 0.2 0.1 0.1

–  –  –

In Table 19, we illustrate the average precision of TREC-9 topics. Our CLIR approach using the ALKAFI MT system achieves 58.4%, 57.1%, and 53.4% for title, description, and narrative fields, respectively. The 11-point average recall-precision for TREC-9 topics is shown in Figures 7, 8, and 9 for the title, description, and narrative fields, respectively. Again, the “ideal” monolingual run is likewise illustrated in each figure.

–  –  –

0.35 0.5 0.3

–  –  –

In Tables 20, we illustrate the results up to 1000 documents retrieved for TREC-9. As shown, again, the description run consistently outperforms both the title and narrative runs. However, as shown in Table 20, the percentage of degradation of the title run from the “ideal” monolingual title run is less that that of the descriptive run. This result is seemingly inconsistent with the results obtained for the machine translation on titles run for queries 351-500 as presented in Table 17. The reason behind this seeming contradiction in accuracy performance is that the titles of query 451-500 are actually quite long. The average title query length for queries 351-400 is 2.72 word per query while the average length for queries 451-500 is 3.46 words. This 27% difference in query length was sufficient to provide our MT system with the possibility to form a proper context for many more queries in the TREC-9 query set as compared to the TREC-7 query set. This is especially so considering that for the TREC-9 query set had 16 queries with 4 or more words as compared to the only 6 queries of similar length in

the TREC-7 query set. For example, the title of the query number 482 is:

–  –  –

The translated query using ALKAFI MT system is:

“Where is he possible that I find the rates of the growth for the tree of the pine? ” This query provides a full context to make the ALKAFI machine translation produces the most accurate translation. Adding more contexts to that query does not help the MT system to provide better translation accuracy.

Finally, for completeness, we provide a brief overview of efficiency results. In Table 21, we summarize the efficiency (run time performance) of the ALKAFI MT system to translate the titles, descriptions and narratives fields of topics TREC-7 and TREC-9.

–  –  –

The narrative fields as described in Tables 18 and 20, which represent the long queries, are not effective compared to the description fields, which represents the medium length queries. According to theses findings, the fewer terms provided in the original query that form a context to obtain unambiguous representation, the better running time as well as the better retrieval effectiveness. As presented in Tables 21 and 22, the total running time for the description and narrative runs of TREC-7 is 6194.79 and 1990.25 seconds, respectively. The running time of the narrative is 211% of the running time of the description. In fact, the difference of the running time degrades the performance of our CLIR system without any improvement on the effectiveness. These findings are consistent with TREC-9 topics and collection as presented in Tables 21 and 22.

The description runs perform 340% much more time compared to title runs of TREC-7 dataset. Accordingly, the achieved performance of the description run is more effective than the title run. Thus, choosing few terms that form a full context achieves better accuracy at the expense of efficiency, a trade-off whose merits are application dependant.

Similar findings exist for the TREC-9 queries.

As shown in Table 23, the MT system achieved 70.2% of the monolingual retrieval. The MT system is capable to preserve its accuracy since most of the titles of the Arabic topics are quite long to form a context.

–  –  –

The post-translation expansion technique improved the performance by 15.6%, the difference between the MT and MT+post is statistically significant at 98% confidence level. Table 25, describes the runs at lower level of recalls, up to 1000 retrieved documents. As sown, the MT with query expansion after translation consistently outperforms the MT approach without query expansion.

–  –  –

5. Conclusions Our results demonstrate the potential Arabic-English and English-Arabic CLIR.

Automatic dictionary translation is cost effective as compared to the other methods such as parallel corpus, and Latent Semantic Indexing (LSI). The resources needed are readily available. The ambiguity introduced by the Every-Match (EM) method yields poor effectiveness; it achieved roughly half of the performance of the monolingual retrieval.

The factor affecting this is the transfer of too many senses that are inappropriate to the source query.

It is common for a single word to have several translations, some with different senses.

Pages:     | 1 | 2 || 4 |

Similar works:

«(ORDER LIST: 567 U.S.) FRIDAY, JUNE 29, 2012 CERTIORARI SUMMARY DISPOSITIONS 09-10231 TURNER, DANNY V. UNITED STATES The motion of petitioner for leave to proceed in forma pauperis and the petition for a writ of certiorari are granted. The judgment is vacated, and the case is remanded to the United States Court of Appeals for the Seventh Circuit for further consideration in light of Williams v. Illinois, 567 U.S. _ (2012). 10-8835 GREINEDER, DIRK K. V. MASSACHUSETTS The motion of petitioner...»

«Interim Report: Phase 1 Development of a New Methodology to Characterize Truck Body Types along California Freeways Contract Number: 11-316 Principal Investigator: Stephen G. Ritchie, Ph.D.Prepared for: California Air Resources Board Prepared by: Stephen G. Ritchie, Ph.D. Director, Institute of Transportation Studies Professor, Department of Civil and Environmental Engineering University of California, Irvine, 92697-3600 March 20, 2013 Disclaimer: The statements and conclusions in this report...»

«mapa de cercedilla mapa de cercedilla Callejero de Cercedilla Planos y Mapas de Callejero de la Ciudad de Cercedilla. En Callejero.net encontrarás callejeros, planos y mapas de ciudades Españolas, incluida la ciudad de Cercedilla. Visit Cercedilla, Spain Cercedilla Tourism Cercedilla Tourism: TripAdvisor has 1,388 reviews of Cercedilla Hotels, Attractions, and Restaurants making it your best Cercedilla resource. Mapa de Cercedilla Hoteles, hostales y Mapa de hoteles en la zona de Cercedilla:...»

«Ben Nevis : 4408 feet E RE E TH TH C H A L L E N Scafell Pike : 3209 feet G Snowdon : 3560 feet E CK PA ON E TI G EN AL A L CH M R N PE O OF IN The Three Pe aks Challeng E e involves cli highest mou mbing the ntains in Sco NG tland (Ben Ne England (Sca vis: 4408 feet) fell Pike: 320, 9 feet) and W (Snowdon: 3 LE ales 560 feet) AL CH E in just 24 hours. TH NGE? ALLE HE CH PT T E U ACC O YO D 1. Ben Nevis 2. Scafell Pike 3. Snowdon FIT FOR THE CHALLENGE? You should understand from the start that...»

«TICs en las PYMES de Centroamerica This page intentionally left blank TICs en las PYMES de Centroamerica Impacto de la adopcion de las tecnologfas de la informacion y la comunicacion en el desempeho de las empresas Ricardo Monge-Gonzalez Cindy Alfaro-Azofeifa Jose I. Alfaro-Chamberlain IDRC CRDI Editorial Tecnologica de Costa Rica Primera edition copublicada por: Editorial Tecnologica de Costa Rica y el Centre Internacional de Investigaciones para el Desarrollo, 2005 338.642 M743t Monge...»

«Prostitución bajo el prisma de la Legislación Internacional de Derechos Humanos: análisis de las obligaciones de los Estados y de las mejores practicas de implementación CAP Internacional, febrero 2016 www.cap-international.org Autor: Grégoire Théry, Director Ejecutivo de CAP Internacional Traducido por Rita María Hernández con ediciones de Marta Torres Herrero Diseño Gráfico: micheletmichel.com Prostitución bajo el prisma de la Legislación Internacional de Derechos Humanos:...»

«FAQ ON EX ANTE CONDITIONALITIES RELATING TO TRANSPORT This list of frequently asked questions is based on comments received from Member States (MS) on Part II of the Guidance on ex ante conditionalities as regards transport issues. It is also based on questions raised by REGIO's geographical units.  Why is rationale for a comprehensive investment plan only very general? Commission's reply: In the framework of Thematic Objective 7, the purpose of a comprehensive transport plan,, is mainly to...»

«Visual Language and Converging Technologies in the Next 10-15 Years (and Beyond) A paper prepared for the National Science Foundation Conference on Converging Technologies (Nano-Bio-Info-Cogno) for Improving Human Performance Dec. 3-4, 2001 by Robert E. Horn Visiting Scholar Stanford University Background Introduction Visual Language is one of the more promising avenues to the improvement of human performance in the short run (the next 10 to 15 years). The current situation is one of...»

«1 “Buffy vs. Dracula”’s Use of Count Famous (Not drawing “crazy conclusions about the unholy prince”) Tara Elliott [Tara Elliott is a PhD candidate in English at York University. Her dissertation examines the potential of the genre of speculative fiction for feminist readings.] Buffy the Vampire Slayer’s episode “Buffy vs. Dracula” is not an attempt to portray Bram Stoker’s 1897 novel Dracula accurately, but instead to comment upon the ways that it is already being used in the...»

«Gaining Weight 101 Tips To Gain Weight for The Skinner Guy Legal Notice:The author and publisher of this Ebook and the accompanying materials have used their best efforts in preparing this Ebook. The author and publisher make no representation or warranties with respect to the accuracy, applicability, fitness, or completeness of the contents of this Ebook. The information contained in this Ebook is strictly for educational purposes. Therefore, if you wish to apply ideas contained in this Ebook,...»

«Unresponsive, Unpersuaded: The Unintended Consequences of Voter Persuasion Efforts∗ Michael A. Bailey† Daniel J. Hopkins‡ Todd Rogers§ July 17, 2013 Can randomized experiments at the individual level help assess the persuasive effects of campaign tactics? In the contemporary U.S., vote choice is not observable, so one promising research design to assess persuasion involves randomizing appeals and then using a survey to measure vote intentions. Here, we analyze one such field...»

«AMARABAC, Journal American Arabic Academy for Sciences and Technology, www.amarabac.com AMARABAC, Journal of of American Arabic Academy for Sciences and Technology, 3 (6), 2012 Volume 3, Number 6, (2012), PP. 139-147 Strategies for Translating Idioms from Arabic into English and Vice Versa Muna Ahmad Al-Shawi(*) Tengku Sepora Tengku Mahadi(**) Abstract: An idiom is a form of speech or an expression that is peculiar to itself. Grammatically, it cannot be understood from the individual meanings...»

<<  HOME   |    CONTACTS
2016 www.theses.xlibx.info - Theses, dissertations, documentation

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.