FREE ELECTRONIC LIBRARY - Theses, dissertations, documentation

Pages:   || 2 |

«Abstract The paper presents the partially automatically annotated and fully manually validated Bulgarian-English Sentence- and Clause-Aligned Corpus. ...»

-- [ Page 1 ] --

Bulgarian-English Sentence- and Clause-Aligned


Svetla Koeva, Borislav Rizov, Ekaterina Tarpomanova,

Tsvetana Dimitrova, Rositsa Dekova, Ivelina Stoyanova,

Svetlozara Leseva, Hristina Kukova, Angel Genov

Department of Computational Linguistics,

Institute for Bulgarian Language, BAS

52 Shipchenski Prohod Blvd., 1113 Sofia, Bulgaria

E-mail: {svetla, boby, katja, cvetana}@dcl.bas.bg,

{rosdek, iva, zarka, hristina, angel}@dcl.bas.bg


The paper presents the partially automatically annotated and fully manually validated Bulgarian-English Sentence- and Clause-Aligned Corpus. The discussion covers the motivation behind the corpus development, the structure and content of the corpus, illustrated by statistical data, the segmentation and alignment strategy and the tools used in the corpus processing. The paper sketches the principles of clause annotation adopted in the creation of the corpus and addresses some issues related to interlingual asymmetry. The paper concludes with an outline of some applications of the corpus in the field of computational linguistics.

1 Introduction and motivation Although parallel texts can be aligned at various levels (word, phrase, clause, sentence), clause alignment has proved to have advantages over sentence and word alignment in certain NLP tasks. Due to the fact that many of the challenges encountered in parallel text processing are related to (i) sentence length and complexity, (ii) the number of clauses in a sentence and (iii) their relative order, clause segmentation and alignment can significantly help in handling them. This observation is based on the linguistic fact that differences in word order and phrase structure across languages are better captured and formalised at clause level rather than at sentence level. As a result, monolingual and parallel text processing at clause level facilitates the automatic linguistic analysis, parsing, translation, and other NLP tasks.

Consequently, this strand of research has incited growing interest with regard to machine translation (MT). Clause-aligned corpora have been successfully employed in the training of models based on clause-to-clause translation and clause reordering in Statistical Machine Translation (SMT) – see [1] for syntax-based German-to-English SMT; [9] for English-to-Japanese phrase-based SMT; [2] for Japanese-to-English SMT; [8] for English-Hindi SMT, among others. Clause alignment has also been suggested for translation equivalent extraction within the example-based machine translation framework [7].

The Bulgarian-English Sentence- and Clause-Aligned Corpus (BulEnAC) was created as a training and evaluation data set for automatic clause alignment in the task of exploring the effect of clause reordering on the performance of SMT [6].

The paper is organised as follows. Section 2 describes the structure, content and format of the BulEnAC and the annotation tool. Section 3 summarises the approach to sentence identification and alignment. Section 4 outlines the approach to clause splitting and alignment followed by a discussion on the principles of clause annotation. Section 5 addresses the possible applications of the corpus.

2 Structure of the BulEnAC

2.1 Basic structure The BulEnAC is an excerpt from the Bulgarian-English Parallel Corpus – a part of the Bulgarian National Corpus (BulNC) of approximately 280.8 million tokens and 8.2 million sentences for Bulgarian and 283.1 million tokens and 8.9 million sentences for English. The Bulgarian-English Parallel Corpus has been processed at several levels: tokenisation, sentence splitting, lemmatisation. The processing has been performed using the Bulgarian language processing chain [5] for the Bulgarian part and Apache OpenNLP1 with pre-trained modules for the English part2.

The BulEnAC consists of 366,865 tokens altogether. The Bulgarian texts comprise 176,397 tokens in 14,667 sentences, with average sentence length 12.02 words. The English part totals 190,468 tokens and 15,718 sentences (12.11 words per sentence). The number of clauses in a sentence averages 1.67 for Bulgarian compared with 1.85 clauses per sentence for English.

The text samples are distributed in five broad categories, called ’styles’. A style is a general complex text category that combines the notions of register, mode, and discourse and describes the intrinsic characteristics of texts in relation to the external, sociolinguistic factors, such as the function of the communication act.

Clause-aligned corpora typically contain a limited number of sentences and cover a particular style, domain or genre3, such as biomedical texts [3], legal texts [4], etc.

1 http://opennlp.apache.org/ 2 TheOpenNLP implementations used in the development of the BulEnAC were made by Ivelina Stoyanova.

3 The further subdivision of the styles includes categorisation into domains (e.g., Administrative:

Economy, Law, etc.) and genres (e.g., Fiction: novel, poem, etc.).

The goal in creating the corpus was to cover diverse styles so as to be able to make judgments on the performance of the alignment methods across different text types. As a result, the corpus consists of the following categories: Administrative texts (20.5%), Fiction (21.35%), Journalistic texts (37.13%), Science (11.16%) and Informal/Fiction (9.84%). Figure 1 shows a comparison of the average sentence length across styles for the two languages.

Figure 1: Average length of Bulgarian and English sentences (in terms of number of clauses) across the different styles.

2.2 Format of the Corpus The files of the corpus are stored in a flat XML format. The words in the text are represented as a sequence of XML elements of the type word. Each word element

is defined by a set of attributes that correspond to different annotation levels:

1. Lexical level (lemmatisation) – the attributes w and l denote the word form and the lemma, respectively.

2. Syntactic (sentence level) – the combination of two attributes, e=True and sen=senID, denotes the end of each sentence and the corresponding id of the sentence in the corpus.

3. Syntactic (clause level) – the attribute cl corresponds to the id of the clause in which the word occurs.

4. Syntactic (applied only to conjunctions) – the attribute cl2 is used for conjunctions and other words and phrases that connect two clauses4, and denotes the id of the clause to which the current clause is connected. The attribute m 4 For brevity and simplicity such words and phrases are also termed ’conjunctions’.

defines the type of the relation between the two clauses cl and cl2 (coordination or subordination), the direction of the relation (in the case of subordination) and the position of the conjunction with respect to the clauses. The inter-clausal relations are discussed in more detail in Section 4.2.

5. Alignment – the attributes sen_al and cl_al define sentence and clause alignment, respectively. Corresponding sentences/clauses in the two parallel texts are assigned the same id.

Example (1) shows the basic format of the corpus files.

Example 1 The EU says Romania needs reforms.

word cl="864" cl_al="6c8f" l="the" w="The"/ word cl="864" l="eu" w="EU"/ word cl="864" l="say" w="says"/ word cl="865" cl2="864" cl_al="19f" l="PUNCT" m="N_S" w="===="/ word cl="865" l="Romania" w="Romania"/ word cl="865" l="need" w="needs"/ word cl="865" e="True" l="reform" sen="bc90" w="reforms.}"/ Empty words (w="====") are artificial elements introduced at the beginning of a new clause when the conjunction is not explicit or the clauses are connected by means of a punctuation mark. For simplicity of annotation punctuation marks are not identified as independent tokens but are attached to the preceding token.

The flat XML format is more suitable for the representation of discontinuous clauses than a hierarchical one; at the same time it is powerful enough to represent the annotation and to encode the syntactic hierarchy between pairs of clauses through the clause relation type.

2.3 The Annotation Tool The manual sentence and clause alignment, as well as the verification and postediting of the automatically performed alignment were carried out with a specially designed tool – ClauseChooser5. It supports two kinds of operating modes: a monolingual one intended for manual editing and annotation of each part of the parallel corpus, and a multilingual one that allows annotators to align the parallel units.

The monolingual mode includes: (i) sentence splitting; (ii) clause splitting; (iii) correction of wrong splitting (merging of split sentences/clauses); (iv) annotation of conjunctions; and (v) identification of the type of relation between pairs of connected clauses. Figure 2 shows the monolingual mode of ClauseChooser used for sentence and clause segmentation and annotation of clause relations. After having been segmented in the bottom left pane, the clauses are listed to the right. The type 5 ClauseChooser was developed at the Department of Computational Linguistics by Borislav Rizov.

of relation for each pair of syntactically linked clauses is selected with the grey buttons N_N, N_S, etc.

–  –  –

The multilingual mode uses the output of the monolingual sentence and clause splitting and supports: (i) manual sentence alignment; (ii) manual clause alignment.

3 Sentence segmentation and alignment Both the Bulgarian and the English parts of the corpus were automatically sentencesplit and sentence-aligned. The sentence segmentation of the Bulgarian part was performed with the BG Sentence Splitter. The tool identifies the sentence boundaries in a raw Bulgarian text using regular rules and a lexicon [5]. The English part was sentence-split using an implementation of an OpenNLP6 pre-trained model.

Sentence alignment was carried out automatically using HunAlign7, and manually verified by experts.

The dominant sentence alignment pattern is 1:1 that stands for one-to-one correspondences in the two languages. The 0:1 and 1:0 alignments designate that a sentence in one of the languages is either not translated, or is merged with another sentence. Table 1 shows the distribution of the sentences in the corpus across alignment types. The category ’other’ covers models with low frequency, such as 1:3, 3:1, 2:2, etc.

6 http://opennlp.apache.org/ 7 http://mokk.bme.hu/resources/hunalign/

–  –  –

4 Clause segmentation and alignment A pre-trained OpenNLP parser8 was used to determine the clause boundaries in the English part, followed by manual expert post-editing. The Bulgarian sentences were split into clauses manually. Clause segmentation is a language-dependent task that should be performed in compliance with the specific syntactic rules and the established grammar tradition and annotation practices for the respective languages. This approach ensures the authenticity of the annotation decisions and helps in outlining actual language-specific issues of multilingual alignment.

4.1 Clause alignment After clause segmentation took place, the parallel clauses in the English and the Bulgarian texts were manually aligned. Alignment was performed only between clauses located within pairs of corresponding sentences.

The prevalent alignment pattern for clauses is also 1:1. However, due to some distinct syntactic properties of the languages involved, the different lexical choices, ’information packaging’ patterns, etc., various asymmetries arise. The non-straightforward alignments have proved to be considerably more pronounced at clause than at sentence level as reflected in the higher frequency of clause alignment patterns of the type 1:0, 1:N and N:M (N, M1), and the greater number of patterns that are represented by a considerable number of instances (Table 2).

1:0 and 0:1 alignments are found where a clause in one language does not have a correspondence in the other. For instance, in Example (2) the clause he said (2a) is not translated to Bulgarian (2b)9.

Example 2 (a) [ La Guardia, step on it! ], [ he said. ] 8 http://opennlp.apache.org/ 9 The Bulgarian examples are transliterated and glossed. We adopted word-by-word glossing with the following abbreviations (cf. Leipzig Glossing Rules, http://www.eva.mpg.de/lingua/pdf/ LGR08.02.05.pdf): N – noun; ADJ – adjective; ADV – adverb; PTCP – participle; PST – past; PRS – present; SG – singular; PL – plural; ACC – accusative; COMP – comparative; DEF – definite.

[ La Guardia, po-barzo! ] (b) [ La Guardia, quick-ADV;COMP! ] 1:N, N:1 patterns (N1) stand for alignments where a given clause corresponds to a complex of (two or more) clauses. A systemic asymmetry is represented by the participial -ing and -ed clauses in English – clause 2 in (3a), and their Bulgarian counterparts. Bulgarian lacks non-finite clauses, therefore syntactic units that are headed by non-finite verbs are treated as participial constructions (the bold face part of the sentence in (3b)). In Example (3), the different clause structure of the English and the Bulgarian sentences leads to 2:1 alignment.

–  –  –

Another frequent pattern is illustrated in Example (4). The two subordinate clauses marked in the sentence as clauses 2 and 3 in (4a), are translated as prepositional phrases PP2 and PP3, respectively10. As a result, the Bulgarian translation of the 3-clause English sentence consists of a single clause (4b); hence the alignment pattern is 3:1.

–  –  –

Alignments of the type N:M (N,M1) represent complex-to-complex correspondence and are relatively rare (0.84% of the clauses, Table 2). Example (5) illustrates an alignment pattern of the type 3:2. The English matrix clause 1 in (5a) 10 The phrase labels are given for expository purposes. The clause-aligned corpus does not include annotation of phrasal categories.

is translated into Bulgarian (5b) by means of clause 1 and the part of clause 2 in boldface. The object of the English clause 1 measures (BG: merki) is the subject of the Bulgarian subordinate clause 2 da badat vzeti merki... (EN: for measures to be taken...) that roughly corresponds to the prepositional phrase in the English counterpart for measures. On the other hand, the subordinate clauses 2 and 3 in the English sentence are rendered as the prepositional phrase PP in Bulgarian (5b).

–  –  –

Pages:   || 2 |

Similar works:

«Title: Impact Assessment (IA) Date: 06/0212013 IA No: DFT00037 Stage: Enactment Source of intervention: Domestic Lead department or agency: Type of measure: Primary legislation Department for Transport Contact for enquiries: Other departments or agencies: steven.may@dft.gsi.gov.uk RPC:AMBER Summary: Intervention and Options Cost of Preferred (or more likely) Option Business Net Net cost to business per In scope of One-In, Measure qualifies as Total Net Present Present Value year (EANCS on 2009...»

«Revision Date 04/05/16 Competition Rules and Regulations Gents Competitions Club Championship 1. The Club Championship is a scratch competition.2. Eligibility – All Members other than 5 Day or Junior Members with a handicap above 12 are eligible.3. The competition format is for two 18 hole qualifying rounds played on consecutive days followed by a knockout stage played over the following week. Championship competitions for 2nd Class and 3rd Class Handicap categories and for Seniors are run...»

«Ascites Balvir S Tomar Definition Ascites is of Greek derivation (askhos) which refers to a ‘bag’ or ‘sack’. The word describes pathologic fluid accumulation within the peritoneal cavity (Fig. 9.17.1). Figure 9.17.1 Child with ascites Background Inside the abdomen there is a membrane called the peritoneum which has two layers. One layer lines the abdominal wall and the other layer covers the organs inside the abdominal cavity. The peritoneum produces a fluid that acts as a lubricant and...»

«Markdown To Ebook K-2052 This book is for sale at http://leanpub.com/markdown-to-ebook This version was published on 2013-12-16 This is a Leanpub book. Leanpub empowers authors and publishers with the Lean Publishing process. Lean Publishing is the act of publishing an in-progress ebook using lightweight tools and many iterations to get reader feedback, pivot until you have the right book and build traction once you do. ©2013 K-2052. Licensed under MIT. Contents An Introduction...........»

«Deja de robar sueños (¿Para qué sirve la escuela?) Seth Godin Si tú no me subestimas, yo no te subestimaré a ti. Bob Dylan Dedicado a todos los profesores a los que le importa lo suficiente como para cambiar el sistema, y a todos los estudiantes lo suficientemente valientes como para levantarse y hablar alto y claro. En concreto, para Ross Abrams, Jon Guillaume, Beth Rudd, Steve Greenberg, Benji Kanters, Florian Kønig, y para ese profesor que lo cambió todo para tí. 1. Prólogo: la...»

«LIBRA Software Manual Release 1.0.3 Brad M. Keller Michael M.K. Hsieh December 23, 2015 Contents 1 About the Algorithm 3 2 Download 4 2.1 Software License............................................. 4 2.2 Documentation.............................................. 4 2.3 Installation................................................ 4 2.4 Prerequisites...»

«Reunion Planner Guide Fredericksburg Convention & Visitor Bureau 302 East Austin Street Fredericksburg, Texas 78624 Janet Musgrove, Sales & Services Coordinator (830) 997.6523 or (888) 997.3600 Email: cvbservices@fbgtx.org www.VisitFredericksburgTX.com www.facebook.com/fredericksburgtx www.twitter.com/visitfredtx www.pinterest.com/visitfredtx/family-reunions/ REUNION SERVICES The Fredericksburg Convention and Visitor Bureau (FCVB) is committed to making your reunion or retreat a success. Our...»

«ENGLISH • SPELLING • SOCIETY Personal View SimpelThe self-expression medium Fonetik for Society members by The views expressed here are the author’s and are not necessarily shared by Allan Kiisk the Society, or a majority of its members. The Author This article describes the simplest, the most Kiisk, has been involved in logical method of spelling words in English,.Allan the single-sound-per-letter method, also linguistics since his early childhood. His mother tongue is Estonian, but his...»

«EPA’s Safer Choice Standard (formerly, the ‘DfE Standard for Safer Products’) June 2009 Revised April 2011 and September 2012 February 2015 Revisions in green This document was developed with the purpose of making criteria for recognition under the EPA Safer Choice Program more transparent and accessible. A group convened under the Green Chemistry and Commerce Council provided guidance to Safer Choice for the development of this document to ensure that it would communicate well to its...»

«HELP AT HOME: HOW TO HIRE A HOME CARE WORKER Area 9 In-Home & Community Services Agency 520 South 9th Street  Richmond, Indiana 765-966-1795 or 800-458-9345 Area 9 Agency TABLE OF CONTENTS Help at Home: How to Hire a Home Care Worker Rights and Responsibilities My Needs Inventory Assessing My Lifestyle Job Description Advertising for a Home Care Worker Telephone Screening Home Care Worker Interviews Reference Check Choosing A Rate Of Pay Checklist: How To Train Communication: How To Be An...»

«CONCRETING METHODS THAT PRODUCE LOW CARBON Prof. Hakim Abdelgader, Civil Engineering Department, Tripoli University Email: hakimsa@poczta.onet.pl Dr. Ali El-Baden, Civil Engineering Department, Tripoli University Email:elbadenpool@gmail.com Dr. Ahmed Segayer, Civil Engineering Department, Tripoli University Email:asegayer@yahoo.com Dr. Abdulmunaem Fahema, Civil Engineering Department, Tripoli University Email:Abdulmunaemf@yahoo.com Abstract A new types of concretes have been produced recently...»

«Frequently Asked Questions (FAQs) in regard to Molecular Testing Guideline for Selection of Lung Cancer Patients for EGFR and ALK Tyrosine Kinase Inhibitors 1. Q: What has changed from the draft recommendations posted for public comment in November/December 2011? A: There have been several changes to the final recommendations and we encourage you to read the final document completely. Since the time that the draft recommendations were posted for comment, another literature search was performed...»

<<  HOME   |    CONTACTS
2016 www.theses.xlibx.info - Theses, dissertations, documentation

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.