FREE ELECTRONIC LIBRARY - Theses, dissertations, documentation

Pages:   || 2 | 3 | 4 |

«Abstract In Cross-Language Information Retrieval (CLIR), queries in one language retrieve relevant documents in other languages. Machine-Readable ...»

-- [ Page 1 ] --

On Bidirectional English-Arabic Search

M. Aljlayl, O. Frieder, & D. Grossman

Information Retrieval Laboratory

Illinois Institute of Technology

{aljlayl, frieder, grossman}@ir.iit.edu


In Cross-Language Information Retrieval (CLIR), queries in one language retrieve

relevant documents in other languages. Machine-Readable Dictionaries (MRD) and

Machine Translation (MT) systems are important resources for query translation in CLIR.

We investigate the use of MT systems and MRD to Arabic-English and English-Arabic CLIR. The translation ambiguity associated with these resources is the key problem. We present three methods of query translation using a bilingual dictionary for Arabic-English CLIR. First, we present the Every-Match (EM) method. This method yields ambiguous translations since many extraneous terms are added to the original query. To disambiguate query translation, we present the First-Match (FM) method that considers the first match in the dictionary as the candidate term. Finally, we present the Two-Phase (TP) method. We show that good retrieval effectiveness can be achieved without complex resources using the Two-Phase method for Arabic-English CLIR. We also empirically evaluate the effectiveness of the Arabic-English MT approach using short, medium, and long queries of TREC7 and TREC9 topics and collections. The effects of the query length to the quality of the MT-based CLIR are investigated. English-Arabic CLIR is evaluated via MRD and English-Arabic MT. The query expansion via posttranslation approach is used to de-emphasize the extraneous terms introduced by the MRD and MT for English-Arabic CLIR.

1. Introduction With the rapid growth of the Internet, the World Wide Web (WWW) has become one of the most popular mediums for the dissemination of multilingual Web pages. Automatic mediation of access to foreign Web pages is becoming an increasingly important problem. Therefore, the importance of CLIR is noticeable. Arabic-English CLIR means the retrieval of documents based on queries formulated by a user in the Arabic language, and the documents are in the English language. In contrast, English-Arabic CLIR is the retrieval of Arabic documents based on queries in the English language.

In a dictionary-based approach, translation is performed by looking up the terms in the bilingual dictionary and forming a target query by considering one or more than one translation per query term. To achieve this goal, we investigate three approaches associated with Machine-Readable Dictionaries (MRD) for Arabic-English CLIR. The Every-Match method considers all the translations found in a bilingual dictionary. This leads to ambiguous translation because it introduces extraneous terms to the target query and yields relatively poor effectiveness. Another method is the First-Match method.

Instead of considering all the target language equivalents in the bilingual dictionary, we use the first match in the bilingual dictionary as the candidate translation of the source query term. This approach takes advantage of the fact that dictionaries typically present the translations in the order of their common use. The First-Match method ignores some of the less common translations of the source language, and thus, potentially improves the retrieval effectiveness. The third method is the Two-Phase method. This method considers all the translations found in the bilingual dictionary as candidate terms then it removes the translated candidate terms that do not return its original source query term.

We also empirically evaluate the effectiveness of the Arabic-English MT-based approach using short, medium, and long queries of TREC topics and collections. The effects of the query length to the quality of the MT-based CLIR are likewise investigated.

English-Arabic CLIR is evaluated via MRD and English-Arabic MT system. The posttranslation expansion approach is used to de-emphasize the extraneous terms introduced by the MRD and MT. We found that query expansion after translation via posttranslation approach yields significant improvement on the performance of the EnglishArabic CLIR.

Arabic, one of the six official languages of the United Nations, is the mother tongue of 300 million people (Egyptian Demographic Center, 2000). Unlike the Latin-based alphabets, the orientation of writing in Arabic is from right-to-left. The Arabic alphabet consists of 28 letters. As discussed in (Tayli and Al-Salamah, 1990), the Arabic alphabet can be extended to ninety elements by additional shapes, marks, and vowels. Most Arabic words are morphologically derived from a list of roots. The root is the bare verb form; it can be triliteral, quadriliteral, or pentaliteral. Most of these roots are made up of three consonants. Arabic words are classified into nouns (adjectives and adverbs), verbs, and particles. In formal writing, Arabic sentences are delimited by commas and periods as in English, for instance.

In Section 2, we review the prior work in CLIR and also specifically on Arabic information retrieval. The proposed dictionary-based methods for Arabic-English and English-Arabic CLIR are presented in section 3. The effects of the MT-based approach on Arabic-English and English-Arabic CLIR are investigated in Section 4. We conclude our study in Section 5.

2. Prior work We begin with an overview of prior work done in Arabic information retrieval. We continue with a review of other prior CLIR efforts because some of this prior work can be easily adapted to Arabic language processing, and in fact, part of our work includes this adaptation.

2.1 Arabic Information Retrieval In the MICRO-AIR system (Al-Kharashi and Evens, 1994), using only document titles, the authors compared three options for indexing: words, stems, and roots. Three similarity measures were used: the cosine measure, the Dice, and the Jaccard coefficient.

The result of these experiments showed that using roots as index terms was more efficient than using words or stems. A similar study was conducted by (Abu-Salem, et al., 1999).

The authors attempted to improve the effectiveness of Arabic information retrieval by weighing a query term depending on the importance of the word, the stem, and the root of the query term in the collection. The weights were calculated using the standard tf-idf measures. The proposed method, which is called mixed stemming, showed an improvement over the word indexing method using both the binary and tf-idf weighting schemes. Improvements over the stemming index approach were noted only in the case of binary weighting.

Hasnah (1996) investigated full text processing, and passage retrieval for Arabic documents. Hasnah concluded that passage retrieval improves the retrieval precision.

Beesley (1998) described a morphological analyzer system of the modern Arabic standard words. These were Arabic monolingual retrieval efforts only. No cross-lingual experiments were performed. Recent Arabic monolingual and English-Arabic CLIR resources are found in the TREC web site (TREC, 2001).

2.2 Cross-Language Information Retrieval (CLIR) The rapid growth of the Internet has created worldwide multilingual document collections. Accordingly, IR research has begun to pay attention to CLIR systems. In CLIR, either documents or queries are translated. The research has focused on the accuracy of query translation since document translation is computationally expensive (Hull and Grefenstette, 1996).

Machine Translation (MT) systems seek to translate queries from one human language to another by using context. Disambiguation in machine translation systems is based on syntactic analysis. Usually, user queries are few words without proper syntactic structure (Pirkola, 1998). Therefore, the performance of current machine translation systems in general language translations make MT less than satisfactory for CLIR (Radwan and Fluhr, 1995; Hull and Grefenstette, 1996). Another study by Oard (1998), however, did confirm that machine translation does yield reasonable effectiveness in the case of long queries.

In corpus-based methods, queries are translated on the basis of the terms that are extracted from parallel or comparable document collections. Dunning and Davis (1993) suggested parallel and aligned corpus techniques. They used a Spanish-English parallel corpus and evolutionary programming for query translation (Davis and Dunning, 1995).

Landauer and Littman (1990) introduced another method for which no query translation is required. Their method is called Cross-Language Latent Semantic Indexing (CL-LSI), and requires a parallel corpus. Unlike parallel collection, comparable collections are aligned based on a similar theme (Sheridan and Ballerini, 1996).

Dictionary-based methods perform query translation by looking up terms on a bilingual dictionary and building a target language query by adding some or all of the translations.

Dictionary-based translation is very practical with the increasing availability of machinereadable bilingual dictionaries (MRD). Moreover, the topic coverage of this technique is less limited than that of parallel corpus since a dictionary typically contains a wider variety of terms than a parallel corpus (Adriani and Croft, 1997). Ballesteros and Croft (1996) developed several methods using MRD for Spanish-English CLIR. The first experiment was designed to test the effects of word-by word translation using the MRD on retrieval performance. Each query word was replaced by the corresponding word or words in the dictionary. The average precision dropped 50-60%. The reason behind the low effectiveness is that many noise terms were added. To improve the effectiveness, they introduced the notion of pre-translation and post-translation methods.

Ballesteros and Croft (1997) also investigated the effect of phrasal translation in improving effectiveness. In their study, they investigated the role of phrases in query translation via local context analysis (LCA) (Xu and Croft, 1996) that uses global and local document analysis, and local feedback (LF). They concluded that combining pre and post translation expansion is more effective and improves precision and recall. As an extension of (Ballesteros and Croft, 1997), Ballesteros and Croft (1998) proposed new methods to disambiguate the terms translation via MRD. A Co-Occurrence statistics (CO) method was used to resolve the ambiguity. They assumed that the correct translation of query terms co-occur in target language documents and incorrect translation tend not to co-occur. A combined approach of pre and post translation yielded better effectiveness.

Pirkola (1998) studied the effects of the query structure and setups in a dictionary-based approach. The effectiveness of the English queries against English documents was compared to the performance using translated Finnish queries. Pirkola used a general dictionary and a domain specific (medical) dictionary. Hull and Grefenstette (1996) performed experiments at Xerox to build a multilingual IR system to understand the

factors that drive effectiveness. The percentages of the original English queries are:

automatic word-based dictionary 59%, manual word-based dictionary 68%.

Our initial efforts in Arabic-English CLIR are described in (Aljlayl and Frieder, 2001).

Here, we extend the results presented in there, and also address the reverse problem of English-Arabic CLIR. Given the shear number of previous efforts using MRD and MT approaches, one tends to believe in their practicality. Furthermore, the topic coverage is wider than that of parallel corpus. The effectiveness of these methods depends on the ability to choose the right term from many possible terms.

3. Dictionary-based Approaches Unlike others, our efforts target the Arabic language. We adapt some of the prior dictionary-based CLIR approaches, particularly those of Ballesteros and Croft, to the Arabic language as well as develop an additional approach for Arabic-English CLIR. It is common for a single word to have several translations, some with very different senses.

Removing the noise terms increases the retrieval performance; so taking this into account, we designed and implemented three dictionary-based query translation methods for Arabic-English CLIR and one approach for English-Arabic CLIR.

3.1 Every-Match Method The Every Match (EM) method is designed to study the effects of simple word-by-word translation on retrieval performance and to determine the factors that produce these effects. The Arabic queries are translated word by word via a MRD. Dictionary definitions often provide many senses for a single word. In this method, we retain every possible translation when more than one alternative is present in the term list in the MRD.

We replace each term with every exact term match in the bilingual term list (Oard, 1998).

For example, query number 468 (incandescent light bulb) after translation into Arabic appears as ( ‫.) ﻣﺼﺒﺎح ﺿﻮﺋﻲ وهـــﺎج‬Now we apply the EM method to this query. The Arabic query words are translated by replacing them by their target English language equivalents. As shown in Table 1, the simple dictionary translation via MRD yields ambiguous translations. It is obvious that the number of word senses increases when the Arabic language word is translated to a target English language by all the equivalents.

The average query length after translation via EM method is 10 and 28 terms for TREC-7 and TREC-9, respectively.

–  –  –

Table 1. Terms of the original Arabic query, and the result of the Every-Match (EM)

3.2 First-Match Method In the First-Match (FM) method (Oard 1998; Ballesteros and Croft, 1997; Ballesteros and Croft, 1998), only the first match translation per query term is retained instead of using all of the listed translations. In Table 2, we illustrate an example of the Arabic query ( ‫,) ﻣﺼﺒﺎح ﺿﻮﺋﻲ وهـــﺎج‬and the translations obtained using the FM method. As illustrated, in this case, the translations obtained by the FM method appear more precise than those obtained via the EM approach. The retained terms via applying the FM method are subset of the retained terms via EM method.

–  –  –

Pages:   || 2 | 3 | 4 |

Similar works:

«r 2010 Public Financial Publications, Inc. Prudent Public Sector Investing and Modern Portfolio Theory: An Examination of Public Sector Defined Benefit Pension Plans ODD J. STALEBRINK, KENNETH A. KRIZ, and WEIYU GUO This research examines the extent to which public pension programs allocate assets in a manner that is consistent with an optimal portfolio, as defined by Modern Portfolio Theory (MPT). The examination is pursued by way of a statistical analysis, using a portfolio optimization...»

«catamaranes ocasión catamaranes ocasión Grupo Catamaranes.com. Venta de Barcos Catamaranes.com le ofrece catamaranes o yates de ocasion en venta. MANTENIMIENTO DE BARCOS tc.share MANTENIMIENTO DE BARCOS tc.share-with.net barcos.com Barcos y Yates de ocasión, directorio de barcos y náutica. barcos.com. Catamaranes a Vela Nuevos (3) Neumáticas Nuevas (211) Yates a Motor Nuevos (2) Jaguar Catamarans | Catamaran For Sale, Bateaux doccasion; Nouvelles et événements; Services. Vendez votre...»

«An Oracle White Paper April 2011 Information Security: A Conceptual Architecture Approach Oracle White Paper— Information Security: A Conceptual Architecture Approach Disclaimer The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development,...»

«FAMSI © 2007: Erick T. Rochette Investigación sobre Producción de Gienes de Prestigio de Jade en el Valle Medio del Motagua, Guatemala With contributions by co-director: Licenciada Mónica Pellecer Alecio Traducido del Inglés por Eduardo Williams Año de Investigación: 2005 Cultura: Maya Cronología: Clásico Tardío Ubicación: Valle Medio del Motagua, Departamento de El Progreso, Guatemala Sitios: Guaytán, Magdalena, El Terrón, Vargas, Valle Medio del Motagua Tabla de Contenidos...»


«Minutes of Sheltered Housing Forum held at Cox's Close, Stapleford, 14 January 2016 Attendees (Tenants and Attendees (Tenants and Leaseholders) Leaseholders) Balsham Adrian Prentis (V-Ch)(AP) Cottenham Stevens Wendy Head (Chair)(WH) Cottenham Franklin J Hall Great Shelford Patti Hall (PH) Cottenham Stevens Betty Murphy (BM) Melbourn Vicarage Monica Connolly (MC) Sawston Plantation Brian Stratford Stapleford J Carter Waterbeach Marion Harwood Sawston Chapelfield Bill Bullivant (BB) Sawston Uffen...»

«PROSPECTUS PUBLIC OFFER FOR THE SALE AND SUBSCRIPTION OF SHARES IN FLUIDRA, S.A. Initial number of shares offered: 44,082,943 Extendable up to a maximum of: 4,898,106 additional shares 11 October 2007 This prospectus was approved and registered by the Comisión Nacional del Mercado de Valores (Spanish Securities Commission) on 11 October 2007 The Registration Document and the Securities Note contained in this Prospectus have been drawn up in accordance with the forms specified in Annexes I and...»

« A Festive Evening of Song • HUMERAL • MUSCULAR DYSTROPHY • FAC I O • S C A P U L O • Event Program & Silent Auction Guide  THANK YOU for supporting our Event this evening! *** With your help, we are able to fund critical research and provide fundamental support to patients of FSHD and their families! *** Gratitude is not only the greatest of virtues, but the parent of all the others. ~ Cicero  Event Schedule 6:00pm – 7:30pm Silent Auction Dinner by the Bite 7:30pm SILENT AUCTION...»

«Installation and Care Guide Guide d’installation et d’entretien Guía de instalación y cuidado Integral Apron Baths Baignoires à bandeau intégré Bañeras con faldón integral K-1649 K-1650 K-1670 M product numbers are for Mexico (i.e. K-12345M) Los números de productos seguidos de M USA: 1-800-4-KOHLER corresponden a México (Ej. K-12345M) Canada: 1-800-964-5590 México: 001-877-680-1310 kohler.com 1084202-2-A ©2007 Kohler Co. Thank You For Choosing Kohler Company Thank you for...»

«2015 AGHS Ulster Houses and Garden Tour 10 days /10 nights 15 –25 June 2015 This tour takes in the most important private and public houses and gardens on Northern Ireland and Donegal. Many of the houses have been lived in by the same family since they were built, in some cases for over 400 years. The houses and gardens have evolved over the generations, and we will have the privilege of meeting many of the owners who have made their own mark on their much-loved gardens. We will also visit...»

«Unión Internacional de Te l e c o m u n i c a c i o n e s CUESTIÓN 20-2/2 Informe Final UIT-D COMISIÓN DE ESTUDIO 2 4. o PERIODO DE ESTUDIOS (2006-2010) CUESTIÓN 20-2/2: Examen de las tecnologías de acceso para las telecomunicaciones de banda ancha LAS COMISIONES DE ESTUDIO DEL UIT-D De acuerdo con lo dispuesto en la Resolución 2 (Doha, 2006), la CMDT-06 mantuvo dos Comisiones de Estudio y determinó las Cuestiones que éstas habrían de tratar. Los procedimientos de trabajo que han de...»

«CONTRITION The Golden Key of Paradise INTRODUCTION At first sight of this little book, that bears the high-sounding title of “The Golden Key of Paradise,” perhaps, dear reader, you will be somewhat curious to know whether its contents are as good as its title. Perhaps you arc inclined to shrug your shoulders and feel as you do when you see advertised marvellous and infallible cures for all the ills that flesh is heir to. No—be not deceived; this is a genuine key, and one you can easily...»

<<  HOME   |    CONTACTS
2016 www.theses.xlibx.info - Theses, dissertations, documentation

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.