«Abstract In Cross-Language Information Retrieval (CLIR), queries in one language retrieve relevant documents in other languages. Machine-Readable ...»

To reduce the number of extraneous terms, the First-Match (FM) technique was evaluated for Arabic-English and English-Arabic. For Arabic-English CLIR, this approach achieved 68.9% and 64.7% of the titles of English only TREC topics 351-400 and TREC topics 451-500, respectively. The drawback of this method is that many terms that are related to the original queries may be ignored. Therefore, we proposed a new method for Arabic-English CLIR; it is called the Two-Phase method.

In the Two-Phase method, we ignore all the terms that do not retranslate to the original Arabic query term. This method achieved 71.5% and 69.0% of monolingual retrieval by using titles of TREC topics 351-400 and TREC topics 451-500, respectively. The TwoPhase method yields a 38% and 52% improvement over the Every-Match (EM) method of TREC topics 351-400 and TREC topics 451-500, respectively. It also yields a 4% and 7% improvement over the First-Match (FM) method of TREC topics 351-400 and TREC topics 451-500, respectively. We found that our TP results were statistically significant at greater than a 99% confidence interval over the EM for both TREC-7 and TREC-9. It achieved 86% and 89% over FM method for TREC-7 and TREC-9, respectively. In this study, we showed that eliminating unrelated terms by the Two-Phase method can significantly reduce the ambiguity associated with dictionary translation. We also conducted initial experiments with a commercial MT-based Arabic-English CLIR; we found its performance inferior to that of the FM and TP methods.

We also evaluated the MT-based Arabic-English CLIR; we found that the query length affects the performance of the MT system. The evaluation was conducted by using the ALKAFI system and two standard TREC collections and topics. To explore the effects of the context to the quality of translation, we experimented with various query lengths.

We studied the effects of using Al-Mutarjim Al-Arabey MT system and MRD for English-Arabic CLIR. The post-translation approach was used. We found that the query expansion after translation via PRF is consistently more effective for both MT and MRD approaches.

The experimental results indicate that the less source terms that are needed to form a context, the better is the retrieval accuracy and efficiency. However, the problem of semantics is perennial due to the complexities of the Arabic grammar. Without some level of semantic representation, MT systems are unable to achieve high quality translation, because they cannot differentiate between cases that are lexically and syntactically ambiguous. Accordingly, a well-formed source query makes the MT system able to provide its best accuracy.

A possible extension to our work is to expand the original source query using PRF for Arabic-English CLIR to emphasize the context of the source query and finding term threshold for the TP method. Another extension is to apply the Two-Phase method by using parallel corpus or a combination of MRD and parallel corpus.

