«Abstract The paper presents the partially automatically annotated and fully manually validated Bulgarian-English Sentence- and Clause-Aligned Corpus. ...»
Non-straightforward alignment patterns account for considerable number of 0:1 (7.05%) and 1:2 (9.12%) clause alignments in Bulgarian-English, with the reverse types amounting to just 1.95% (1:0) and 2.51% (2:1), respectively. These results suggest that a stronger tendency exists for 1:N (N1) correspondences for Bulgarian-to-English than for English-to-Bulgarian. Some of the factors for this trend include the different segmentation into clauses as in the case of participial constructions versus participial clauses, and the rendition of prepositional phrases as clauses or vice versa.
4.2 Annotation of clause relations
The BulEnAC is supplied with partial syntactic annotation that includes:
(i) delimiting the sentence and clause boundaries;
(ii) identifying the type of relation (subordination or coordination) between the clauses in a sentence;
(iii) identifying the linguistic markers that introduce clauses – conjunctions, adverbs, pronouns, punctuation marks, etc.
A clause relation is deﬁned between a pair of clauses. We were interested in the type of relation between the clauses, the ordering of clauses that stand in a given relation, the position of the conjunction, and language-speciﬁc clause-to-clause ordering constraints. With respect to the relation each clause in the pair is identiﬁed as either main or subordinate with at least one being main. In this paper the term main is used in a broader sense that encompasses both the meaning of an independent clause and that of a superordinate clause. Thus, main (N) denotes either a clause with equal status as the other member of the pair or one that is superordinate to it. Subordinate (S) status is assigned to a clause that is syntactically subordinate to the other member of the pair.
The status of the clauses is deﬁned with respect to a particular clause relation and is therefore relative. Consequently, the relationship between a pair of coordinated independent or coordinated subordinate clauses is both N_N, cf. Example (6) for independent and Example (7) for dependent clauses. In the case of coordinated subordinate clauses, the dependent status of the pair is denoted by the relation N_S established between their superordinate and the ﬁrst of the subordinate clauses (7b).
Example 6 (a) [N1 I usually forget things,] [N2 butN1_N2 I remembered it!] (b) [N1 He asked her] [S ifN1_S he could pick her up on the morning of the experiment] [N2 andN1_N2 she agreed gratefully.] Example 7 (a) [1 Dutch police authorities said] [2 they were illegal immigrants] [3 and would be deported.] (b) [1 N Dutch police authorities said] [2 S ====N_S they were illegal immigrants ] [2 N1 they were illegal immigrants ] [3 N2 andN1_N2 would be deported.] (c) A syntactically subordinate clause that is superordinate to another clause has the status main with respect to it. For instance, in (8a) clause 2 is subordinate to
the matrix clause – clause 1 (8b), and a main clause with respect to clause 3 (8c):
Example 8 (a) [1 This Regulation does not go beyond] [2 what is necessary] [3 to achieve those objectives.] [1 N This Regulation does not go beyond] [2 S whatN_S is necessary] (b) [2 N...what is necessary] [3 S toN_S achieve those objectives.] (c) In the languages under consideration the following three clause ordering models cover almost all the cases: N_N, N_S and _SN.
4.3 More on translational asymmetries Translational asymmetries stem also from different information distribution, lexical and grammatical choices, reordering of the clauses with respect to each other and (cross-clause boundary) reordering of constituents. In this section, we point out two types of asymmetry concerning the internal structure of clauses and their relative order within the sentence.
A frequent pattern found in the corpus is the selection of verbs with different types of complements motivated by grammatical structure, lexical choice or other factors. In the aligned sentences in Example (9) the choice of the Bulgarian verb nastoyavam (insist) as the translation equivalent of the English object-control verb urge predetermines the difference in the structure of the matrix and the subordinate clause in the two languages – in (9a) Croatia is the object of the main clause, whereas its counterpart Harvatska is the subject of the subordinate clause in (9b).
Another frequent example is the different order of the clauses in a sentence.
For instance, in Example (10), the English clauses N_S (10a) are in reverse order as compared with the Bulgarian translation – _SN (10b).
Example 10 (a) [N She had to make a detour] [S to get to the stove.]
Translation asymmetries represent a systemic phenomenon and account for the inter-lingual variations in grammatical structure, lexicalisation patterns, etc. At the same time, they often give rise to wrong alignments, mistranslations, and other errors. Therefore, the successful identiﬁcation of such phenomena and their proper description and treatment is a prerequisite for improving the accuracy of alignment and translation models.
5 Conclusion and applications The development of the Bulgarian-English Sentence- and Clause-Aligned Corpus is a considerable advance towards establishing a general framework for syntactic annotation and multilingual alignment, as well as for building signiﬁcantly larger parallel annotated corpora. The manual annotation and/or validation has ensured the high quality of the corpus annotation and has made it applicable as a training resource for various NLP tasks. As the goal was to explore the inﬂuence of clause alignment, further levels of alignment were only partially attempted as a technique enhancing the alignment method.
The quality of the manual clause splitting, relation type annotation and alignment was guaranteed by inter-annotator agreement. Each annotator made at least two passes of each Bulgarian and English ﬁle, one performed after the ﬁnal revision of the annotation conventions. Clause segmentation was additionally validated at the stage of clause alignment.
The NLP applications of the BulEnAC encompass at least three interrelated areas: (i) developing methods for automatic clause splitting and alignment; (ii) developing methods for clause reordering to improve the training data for SMT ;
(iii) word and phrase alignment. These lines of research will facilitate the creation of large-scale syntactically and semantically annotated corpora. In the ﬁeld of the humanities the corpus is a valuable resource for studies in lexical semantics, comparative syntax, translation studies, language learning, cross-linguistic studies.
The BulEnAC will be made accessible to the scholarly community through the uniﬁed multilingual search interface of the Bulgarian National Corpus11.
6 Acknowledgements The present paper was prepared within the project Integrating New Practices and Knowledge in Undergraduate and Graduate Courses in Computational Linguistics (BG051PO001-3.3.06-0022) implemented with the ﬁnancial support of the Human Resources Development Operational Programme 2007-2013 co-ﬁnanced by the European Social Fund of the European Union. The Institute for Bulgarian Language takes full responsibility for the content of the present paper and under no conditions can the conclusions made in it be considered an ofﬁcial position of the European Union or the Ministry of Education, Youth and Science of the Republic of Bulgaria.
11 http://search.dcl.bas.bg References  B. Cowan, I. Kucerova, and M. Collins. A discriminative model for tree-to-tree translation. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, Sydney, pages 232–241, 2006.
 C.-L. Goh, T. Onishi, and E. Sumita. Rule-based reordering constraints for phrase-based SMT. In Proceedings of the 15th International Conference of the European Association for MT, May 2011, pages 113–120, 2011.
 J.-D. Kim, T. Ohta, and J. Tsujii. Corpus annotation for mining biomedical events from literature. BMC Bioinformatics, 9(10), 2008.
 C. Kit, J.J. Webster, K. Kui Sin, Pan H., and H. Li. Clause alignment for bilingual Hong Kong legal texts: A lexical-based approach. International Journal of Corpus Linguistics, 9(1):29–51, 2004.
 S. Koeva and A. Genov. Bulgarian language processing chain. In Proceedings of Integration of Multilingual Resources and Tools in Web Applications.
Workshop in conjunction with GSCL 2011, University of Hamburg, 2011.
 S. Koeva, B. Rizov, E. Tarpomanova, Ts. Dimitrova, R. Dekova, I. Stoyanova, S. Leseva, H. Kukova, and A. Genov. Application of clause alignment for statistical machine translation. In Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-6), Korea, 2012.
 S. Piperidis, H. Papageorgiou, and S. Boutsis. From sentences to words and clauses. In J. Veronis, editor, Parallel Text Processing, Alignment and Use of Translation Corpora, pages 117–138. Kluwer Academic Publishers, 2000.
 A. Ramanathan, P. Bhattacharyya, K. Visweswariah, K. Ladha, and A. Gandhe.
Clause-based reordering constraints to improve statistical machine translation.
In Proceedings of the 5th International Joint Conference on NLP, Thailand, November, pages 1351–1355, 2011.
 K. Sudoh, K. Duh, H. Tsukada, T. Hirao, and M. Ngata. Divide and translate: improving long distance reordering in statistical machine translation. In Proceedings of the Joint 5th Workshop on SMT and Metrics MATR, pages 418– 427, 2010.