Non-straightforward alignment patterns account for considerable number of 0:1 (7.05%) and 1:2 (9.12%) clause alignments in Bulgarian-English, with the reverse types amounting to just 1.95% (1:0) and 2.51% (2:1), respectively. These results suggest that a stronger tendency exists for 1:N (N1) correspondences for Bulgarian-to-English than for English-to-Bulgarian. Some of the factors for this trend include the different segmentation into clauses as in the case of participial constructions versus participial clauses, and the rendition of prepositional phrases as clauses or vice versa.

4.2 Annotation of clause relations

The BulEnAC is supplied with partial syntactic annotation that includes:

(i) delimiting the sentence and clause boundaries;

(ii) identifying the type of relation (subordination or coordination) between the clauses in a sentence;

(iii) identifying the linguistic markers that introduce clauses – conjunctions, adverbs, pronouns, punctuation marks, etc.

A clause relation is defined between a pair of clauses. We were interested in the type of relation between the clauses, the ordering of clauses that stand in a given relation, the position of the conjunction, and language-specific clause-to-clause ordering constraints. With respect to the relation each clause in the pair is identified as either main or subordinate with at least one being main. In this paper the term main is used in a broader sense that encompasses both the meaning of an independent clause and that of a superordinate clause. Thus, main (N) denotes either a clause with equal status as the other member of the pair or one that is superordinate to it. Subordinate (S) status is assigned to a clause that is syntactically subordinate to the other member of the pair.

The status of the clauses is defined with respect to a particular clause relation and is therefore relative. Consequently, the relationship between a pair of coordinated independent or coordinated subordinate clauses is both N_N, cf. Example (6) for independent and Example (7) for dependent clauses. In the case of coordinated subordinate clauses, the dependent status of the pair is denoted by the relation N_S established between their superordinate and the first of the subordinate clauses (7b).

Example 6 (a) [N1 I usually forget things,] [N2 butN1_N2 I remembered it!] (b) [N1 He asked her] [S ifN1_S he could pick her up on the morning of the experiment] [N2 andN1_N2 she agreed gratefully.] Example 7 (a) [1 Dutch police authorities said] [2 they were illegal immigrants] [3 and would be deported.] (b) [1 N Dutch police authorities said] [2 S ====N_S they were illegal immigrants ] [2 N1 they were illegal immigrants ] [3 N2 andN1_N2 would be deported.] (c) A syntactically subordinate clause that is superordinate to another clause has the status main with respect to it. For instance, in (8a) clause 2 is subordinate to

the matrix clause – clause 1 (8b), and a main clause with respect to clause 3 (8c):

Example 8 (a) [1 This Regulation does not go beyond] [2 what is necessary] [3 to achieve those objectives.] [1 N This Regulation does not go beyond] [2 S whatN_S is necessary] (b) [2 N...what is necessary] [3 S toN_S achieve those objectives.] (c) In the languages under consideration the following three clause ordering models cover almost all the cases: N_N, N_S and _SN.

4.3 More on translational asymmetries Translational asymmetries stem also from different information distribution, lexical and grammatical choices, reordering of the clauses with respect to each other and (cross-clause boundary) reordering of constituents. In this section, we point out two types of asymmetry concerning the internal structure of clauses and their relative order within the sentence.

A frequent pattern found in the corpus is the selection of verbs with different types of complements motivated by grammatical structure, lexical choice or other factors. In the aligned sentences in Example (9) the choice of the Bulgarian verb nastoyavam (insist) as the translation equivalent of the English object-control verb urge predetermines the difference in the structure of the matrix and the subordinate clause in the two languages – in (9a) Croatia is the object of the main clause, whereas its counterpart Harvatska is the subject of the subordinate clause in (9b).

–  –  –

Another frequent example is the different order of the clauses in a sentence.

For instance, in Example (10), the English clauses N_S (10a) are in reverse order as compared with the Bulgarian translation – _SN (10b).

Example 10 (a) [N She had to make a detour] [S to get to the stove.]

–  –  –

Translation asymmetries represent a systemic phenomenon and account for the inter-lingual variations in grammatical structure, lexicalisation patterns, etc. At the same time, they often give rise to wrong alignments, mistranslations, and other errors. Therefore, the successful identification of such phenomena and their proper description and treatment is a prerequisite for improving the accuracy of alignment and translation models.

5 Conclusion and applications The development of the Bulgarian-English Sentence- and Clause-Aligned Corpus is a considerable advance towards establishing a general framework for syntactic annotation and multilingual alignment, as well as for building significantly larger parallel annotated corpora. The manual annotation and/or validation has ensured the high quality of the corpus annotation and has made it applicable as a training resource for various NLP tasks. As the goal was to explore the influence of clause alignment, further levels of alignment were only partially attempted as a technique enhancing the alignment method.

The quality of the manual clause splitting, relation type annotation and alignment was guaranteed by inter-annotator agreement. Each annotator made at least two passes of each Bulgarian and English file, one performed after the final revision of the annotation conventions. Clause segmentation was additionally validated at the stage of clause alignment.

The NLP applications of the BulEnAC encompass at least three interrelated areas: (i) developing methods for automatic clause splitting and alignment; (ii) developing methods for clause reordering to improve the training data for SMT [6];

(iii) word and phrase alignment. These lines of research will facilitate the creation of large-scale syntactically and semantically annotated corpora. In the field of the humanities the corpus is a valuable resource for studies in lexical semantics, comparative syntax, translation studies, language learning, cross-linguistic studies.

The BulEnAC will be made accessible to the scholarly community through the unified multilingual search interface of the Bulgarian National Corpus11.

6 Acknowledgements The present paper was prepared within the project Integrating New Practices and Knowledge in Undergraduate and Graduate Courses in Computational Linguistics (BG051PO001-3.3.06-0022) implemented with the financial support of the Human Resources Development Operational Programme 2007-2013 co-financed by the European Social Fund of the European Union. The Institute for Bulgarian Language takes full responsibility for the content of the present paper and under no conditions can the conclusions made in it be considered an official position of the European Union or the Ministry of Education, Youth and Science of the Republic of Bulgaria.

