3.3 Two-Phase Method To reduce the ambiguity of the EM method, but to loosen the inherent restrictions of the FM method, we introduce a method for Arabic-English CLIR that uses some, but not all of the translations of a given Arabic term. The underlying assumption behind the Two1 ( f ( x)) = x, namely, the translation of the translation of the term Phase method is that f should yield the original term. If this condition holds, the translation is valid and does not introduce drift or noise.

Let A represent the original Arabic terms.

Let E represent the translated English terms of A using the Every-Match method.

Let A’ represent the translated Arabic terms of E using Every-Match method.

Then, the Two-Phase method can be implemented as follows.

Translate original Arabic terms A into English terms E using the Every-Match method via a machine-readable dictionary.

Translate the English terms E to the Arabic terms A’ using the Every-Match method via English-Arabic machine-readable dictionary.

Return the original Arabic terms A and the translated Arabic terms A’ to their infinitive form.

A candidate English term of E is one that it yields to its original Arabic term based on the comparison between A and A’.

In the rare case when the original terms do not yield a candidate translation term, the

following modification is incorporated into the algorithm:

1. If an English term in E does not yield its original Arabic term in A, then :

Find the synonyms of the English term; translate them using the Every-Match method, each translated synonym that matches the original Arabic term A is selected as candidate translation.

2. If neither the English term nor its synonyms in E yield the original term, use the first match term in E as a candidate translation The Two-Phase method can likewise be illustrated as shown in Figure 1. Circled terms are assumed to be the most appropriate translation of the original Arabic terms. For term a2 in original Arabic terms, the English translation is e3 and e4. Neither e3 nor e4 yield the Arabic term a2 in the original Arabic terms set. To overcome this situation, the approach is to find the synonyms of the term that do not yield the original translation after the second phase. For example, e3, does not yield to the original Arabic term a2, then the synonyms of e3 are translated to Arabic, every synonym that yields the original Arabic term is chosen as candidate for translation. The first match approach is applied when no synonyms yield to the original Arabic terms.

In Table 4, we illustrate an example of the original Arabic query ( ‫,) ﻣﺼﺒﺎح ﺿﻮﺋﻲ وهـــﺎج‬ as translated by the Two-Phase method.

Table 4. Terms of the original Arabic query, and Two-Phase (TP) technique As shown in Tables 1 and 4, the Two-Phase method removes 13 terms from all possible translations found in the dictionary.

The term burner results from the translation process of the original Arabic term (‫ ) ﺿﻮﺋﻲ‬using the machine-readable dictionary. This term is a noise term since it is irrelevant to original query. Similarly, the terms “brightness gleam glow illumination white-hot brilliant bright resplendent dazzling glittering glistening sparkling” are filtered out reducing the extraneous terms. The retained terms via applying the TP method are subset of the translated terms via EM method. There are overlaps between term translations via FM and TP methods. In most cases the first match in the dictionary is retained in the translation process of the TP method. The average query length after translation using TP method is 6 and 12 terms for TREC7 and TREC9, respectively.

3.4 Experimental Approach Some of the Arabic complexities that impact the query term translation are described in Section 3.4.1. In Section 3.4.2, we describe the resources that we used to conduct the experiments.

3.4.1 Pre-processing of the Source Arabic Terms Unlike the English language, in the Arabic language, nouns can be masculine or feminine. The nouns can be definite as in (‫ )اﻟﻤﻌﻠﻢ‬or in indefinite as in (‫.)ﻣﻌﻠﻢ‬Adding the prefix (‫ )ال‬makes the difference. Plurals in Arabic are three kinds: the masculine plural, the feminine plural, and the broken plural. The plural is formed via suffixes or via pattern modification of the nouns. In the first case, the suffix ~een for the accusative (‫ )ﻣﻌﻠﻤﻴﻦ‬and genitive or ~oon for the nominative (‫ )ﻣﻌﻠﻤﻮن‬is appended to the masculine noun. While ~aat (‫)ﻣﻌﻠﻤﺎت‬ is appended to the plural feminine noun and the letter “h” is attached to the end of the word to form singular feminine noun (‫.)ﻣﻌﻠﻤﺔ‬The dual is formed by adding "‫"ان‬ or “‫ ”ﻳﻦ‬at the end of the noun as in (‫.)ﻣﻌﻠﻤﺎن‬In the third case, often referred to as broken plurals, the pattern of the singular noun is dramatically altered. We can recognize these plurals from the patterns. There are 27 patterns for most of the broken nouns.

Another kind of suffixation is the personal pronouns. The personal pronoun can appear as an isolated form or as suffixes attached to the nouns, verbs, or prepositions. Certain suffixes are attached at the end of words to make them possessive pronouns. The attached can be one letter, for example (‫ )ﺑﻴﺘﻲ‬when the letter "‫ "ي‬is attached to the end of the word (‫ )ﺑﻴﺖ‬to form “my house” in English. For plural, two letters are attached to the end of the word, for the masculine, the letters "‫ "هﻢ‬are attached (‫,)ﺑﻴﺘﻬﻢ‬and the letters "‫ "هﻦ‬for the feminine nouns (‫. )ﺑﻴﺘﻬﻦ‬These are the most common modifications to Arabic words.

Dictionaries do not store every form of regular words. Most of the dictionary entries are stored in singular form except the words that are usually used in the plural like (‫ )آﻤﺎﻟﻴﺎت‬which means “luxuries” in English. The verbs are stored in perfect form.

Therefore, before matching the Arabic terms in the dictionary, some of the nouns must be returned to their singular form by removing all suffixes and prefixes. The procedure of removing the affixes is performed when the process of matching fails to find the source terms in the dictionary.

To conduct the Two-Phase method as described in Section 3.3, the verbs are returned to their infinitive form. The infinitive form is a noun that derived from the verbs without connected to the time. In our example, it becomes (‫ )آﺘﺎﺑﺔ‬in English “Writing”. This infinitive form is implemented as a base of comparison in the Two-Phase method.

Similarly, for English-Arabic CLIR, the source English queries are normalized to match them in the dictionary. For example, the terms “performing” and “performance” are normalized to “perform”.

3.4.2 Experimental Environment For Arabic-English CLIR, we evaluated all the three dictionary-based approaches using our search engine AIRE (Chowdhury, et al, 2000) on both the commonly used 2 GB subset of the TIPSTER (Disks 4 and 5) collection and the 10 GB web data from TREC.

Each of these collections contains over 500,000 documents. To obtain a set of standard test queries, we manually translated the English TREC topics into Arabic TREC topics.

The Text Retrieval Conference (TREC) has three distinct parts to the collections used in TREC: the documents, the topics, and the relevance judgments. For queries, we used a human translation of the TREC7 (topics 351-400) and TREC9 (topics 451-500) queries as our original Arabic queries. Since in practice most queries are only a few words long, we used the query titles representation of the 351-400 and 451-500 topics.

A native Arabic speaker manually translated the 100 queries from English into Arabic, and we used these translated versions as our original Arabic queries issued against the TREC English collection. The Arabic queries were translated back to English by means of dictionaries. This approach is often used in dictionary-based CLIR studies (Pirkola, 1998). To compare the effectiveness of the translated queries, our Arabic-English CLIR system compares the results of the translated queries to the performance of the monolingual information retrieval. The dictionary provides words and some phrases as keyword entries. Phrase based translations were used as appropriate. In the translation process, we start to match the phrases in the query to the phrases in the dictionary, if match then the result is retained. If not, then word-by-word translation basis is performed by applying the proposed dictionary-based methods.

For English-Arabic CLIR, we used the Arabic collection that consists of 383,872 documents provided by Linguistic Data Consortium (LDC). TREC provided 25 topics in three parallel languages; Arabic, English and French. Our focus is on English-Arabic CLIR, so we used the English queries as our source queries against the Arabic collection.

The titles of the TREC Arabic topics are used.

We chose the Al-Mawrid Arabic-English and English-Arabic Dictionary (aDawliah Universal Electronics) in the translation process. Al-Mawrid is a bilingual dictionary with two sections: English-Arabic which has more than 100,000 entries and Arabic-English which has more than 67000 entries; it is considered the most comprehensive and accurate Arabic-English bilingual dictionary. Al-Mawrid is the official dictionary used by the United Nations (UN) as well as most academic institutions. It is specially designed for human understanding. We converted a portion of Al-Mawrid to a transfer dictionary suitable for information retrieval. The dictionary includes word-based and some multiword expression as a keyword entry. The process of extracting the term lists from the dictionary involved the removal of a large amount of excess information, such as examples and descriptions.

3.5 Results Using the TREC data and queries described earlier, we evaluated our Arabic-English CLIR approaches. In all cases, the translated Arabic to English queries resulted in low retrieval accuracy (as measured by the average precision and recall) as compared with that of the original English queries. The results using the original and the translated queries for titles of TREC topics 351-400 and 451-500 are shown in Tables 5 and 6. As shown, for both data sets, the Every-Match consistently performed the poorest while the Two-Phase Method was consistently the best. Note that no relevance feedback was used in any of the runs.

In Table 7, we summarize the statistical significant test interpretation of our experiments.

The evaluation is conducted using the paired t-test (Wonnacott, R. and Wonnacott, T, 1990). The obtained α values demonstrate that the performance differences of the TP and FM methods over the EM method are significant at a 99% confidence interval for both the TREC-7 and TREC-9 datasets. Less significant are the performance differences between the TP and FM methods that are significant at an 86% (α = 0.1404) and 89% (α = 0.1090) confidence interval for the TREC-7 and TREC-9 datasets, respectively.

0.3 0.15 0.2 0.1 0.1 0.05

As shown in Tables 8 and 9, again, the Two-Phase method outperforms the Every-Match and the First-Match methods at 5, 10, 15, 20, and 30 top retrieved documents. A comparison of the retrieval performance of the three runs is shown in Figures 2 and 3. As shown, the Two-Phase approach outperforms all the other methods. At the higher precision-lower recall levels (recall up to 0.3), the difference between the Two-Phase method and the other methods is even more noticeable. Since it is unrealistic to expect the user to read many retrieved documents that are expressed in a language other than the user’s native language, the higher precision region is of greater interest. Therefore, higher precision results obtained by the Two-Phase method are even more significant. As measured in average precision, the First-Match method improves the effectiveness over the Every-Match method by 33.7% and 42.9% for TREC topic 351-400 and TREC topics 451-500, respectively. The Two-Phase method outperforms the First-Match method by 3.8% and 6.5% for TREC topics 351-400 and TREC topics 451-500, respectively.

Table 10 summarizes the results of using the FM method for English-Arabic CLIR. The FM method achieved 60.2% of the monolingual run for English-Arabic CLIR.

Applying the two-phase method in English-Arabic direction requires finding a generalized form of English terms for more chances of matching. In fact, such characteristics of Arabic words make the two-phase more practical since words in Arabic are primarily based on three letter roots. The three letter roots dramatically simplifies the formation of word classes making it practical to use the two-phase method.

As described in Table 11, feedback after translation improved the effectiveness by 16.5% over the FM method without post-translation expansion. The differences between the MRD on title and post-translation expansion of the translated Arabic queries are statistically significant at 95% confidence level.

Since it is not realistic for the foreign users to read many retrieved documents, we demonstrate the effects on the precision-recall measure for the MRD and the MRD+post translation approaches at lower levels of recall, up to 1000 documents retrieved. In Table 12, column one corresponds to MRD without expansion. As shown in column three, the MRD augmented with query expansion via PRF outperforms the MRD without expansion.

–  –  –

