VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY
HAI-LONG TRIEU
BILINGUAL SENTENCE ALIGNMENT
BASED ON SENTENCE LENGTH AND
WORD TRANSLATION
MASTER THESIS OF INFORMATION TECHNOLOGY
Hanoi - 2014
VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY
HAI-LONG TRIEU
BILINGUAL SENTENCE ALIGNMENT
BASED ON SENTENCE LENGTH AND
WORD TRANSLATION
Major: Computer science
Code: 60 48 01
MASTER THESIS OF INFORMATION TECHNOLOGY
SUPERVISOR: PhD. Phuong-Thai Nguyen
Hanoi - 2014
2
ORIGINALITY STATEMENT
„I hereby declare that this submission is my own work and to the best of my
knowledge it contains no materials previously published or written by another person, or
substantial proportions of material which have been accepted for the award of any other
degree or diploma at University of Engineering and Technology (UET) or any other
educational institution, except where due acknowledgement is made in the thesis. I also
declare that the intellectual content of this thesis is the product of my own work, except to
the extent that assistance from others in the project‟s design and conception or in style,
presentation and linguistic expression is acknowledged.‟
Signed ........................................................................
3
Acknowledgements
I would like to thank my advisor, PhD Phuong-Thai Nguyen, not only for his
supervision but also for his enthusiastic encouragement, right suggestion and knowledge
which I have been giving during studying in Master‟s course. I would also like to show
my deep gratitude M.A Phuong-Thao Thi Nguyen from Institute of Information
Technology - Vietnam Academy of Science and Technology - who provided valuable
data in my evaluating process. I would like to thank PhD Van-Vinh Nguyen for
examining and giving some advices to my work, M.A Kim-Anh Nguyen, M.A Truong
Van Nguyen for their help along with comments on my work, especially M.A Kim-Anh
Nguyen for supporting and checking some issues in my research.
In addition, I would like to express my thanks to lectures, professors in Faculty of
Information Technology, University of Engineering and Technology (UET), Vietnam
University, Hanoi who teach me and helping me whole time I study in UET.
Finally, I would like to thank my family and friends for their support, share, and
confidence throughout my study.
4
Abstract
Sentence alignment plays an important role in machine translation. It is an essential
task in processing parallel corpora which are ample and substantial resources for natural
language processing. In order to apply these abundant materials into useful applications,
parallel corpora first have to be aligned at the sentence level.
This process maps sentences in texts of source language to their corresponding units in
texts of target language. Parallel corpora aligned at sentence level become a useful
resource for a number of applications in natural language processing including Statistical
Machine Translation, word disambiguation, cross language information retrieval. This
task also helps to extract structural information and derive statistical parameters from
bilingual corpora.
There have been a number of algorithms proposed with different approaches for
sentence alignment. However, they may be classified into some major categories. First of
all, there are methods based on the similarity of sentence lengths which can be measured
by words or characters of sentences. These methods are simple but effective to apply for
language pairs that have a high similarity in sentence lengths. The second set of methods
is based on word correspondences or lexicon. These methods take into account the lexical
information about texts, which is based on matching content in texts or uses cognates. An
external dictionary may be used in these methods, so these methods are more accurate but
slower than the first ones. There are also methods based on the hybrids of these first two
approaches that combine their advantages, so they obtain quite high quality of alignments.
In this thesis, I summarize general issues related to sentence alignment, and I evaluate
approaches proposed for this task and focus on the hybrid method, especially the proposal
of Moore (2002), an effective method with high performance in term of precision. From
analyzing the limits of this method, I propose an algorithm using a new feature, bilingual
word clustering, to improve the quality of Moore‟s method. The baseline method (Moore,
2002) will be introduced based on analyzing of the framework, and I describe advantages
as well as weaknesses of this approach. In addition to this, I describe the basis knowledge,
algorithm of bilingual word clustering, and the new feature used in sentence alignment.
Finally, experiments performed in this research are illustrated as well as evaluations to
prove benefits of the proposed method.
Keywords: sentence alignment, parallel corpora, natural language processing, word
clustering.
5
Table of Contents
ORIGINALITY STATEMENT ........................................................................................ 3
Acknowledgements ............................................................................................................. 4
Abstract ............................................................................................................................... 5
Table of Contents ................................................................................................................ 6
List of Figures ..................................................................................................................... 9
List of Tables ..................................................................................................................... 10
CHAPTER ONE Introduction ........................................................................................ 11
1.1. Background.............................................................................................................. 11
1.2. Parallel Corpora ....................................................................................................... 12
1.2.1. Definitions ....................................................................................................... 12
1.2.2. Applications..................................................................................................... 12
1.2.3. Aligned Parallel Corpora ............................................................................... 12
1.3. Sentence Alignment ................................................................................................ 12
1.3.1. Definition......................................................................................................... 12
1.3.2. Types of Alignments ........................................................................................ 12
1.3.3. Applications..................................................................................................... 15
1.3.4. Challenges ....................................................................................................... 15
1.3.5. Algorithms ....................................................................................................... 16
1.4. Thesis Contents ....................................................................................................... 16
1.4.1. Objectives of the Thesis................................................................................... 16
1.4.2. Contributions................................................................................................... 17
1.4.3. Outline ............................................................................................................. 17
1.5. Summary.................................................................................................................. 18
CHAPTER TWO Related Works ................................................................................... 19
2.1. Overview ................................................................................................................. 19
2.2. Overview of Approaches ......................................................................................... 19
6
2.2.1. Classification................................................................................................... 19
2.2.2. Length-based Methods .................................................................................... 19
2.2.3. Word Correspondences Methods .................................................................... 21
2.2.4. Hybrid Methods ............................................................................................... 21
2.3. Some Important Problems ....................................................................................... 22
2.3.1. Noise of Texts .................................................................................................. 22
2.3.2. Linguistic Distances ........................................................................................ 22
2.3.3. Searching......................................................................................................... 23
2.3.4. Resources ........................................................................................................ 23
2.4. Length-based Proposals ........................................................................................... 23
2.4.1. Brown et al., 1991 ........................................................................................... 23
2.4.2. Vanilla: Gale and Church, 1993 ..................................................................... 24
2.4.3. Wu, 1994 ......................................................................................................... 27
2.5. Word-based Proposals ............................................................................................. 27
2.5.1. Kay and Roscheisen, 1993 .............................................................................. 27
2.5.2. Chen, 1993 ...................................................................................................... 27
2.5.3. Melamed, 1996 ................................................................................................ 28
2.5.4. Champollion: Ma, 2006 .................................................................................. 29
2.6. Hybrid Proposals ..................................................................................................... 30
2.6.1. Microsoft’s Bilingual Sentence Aligner: Moore, 2002 ................................... 30
2.6.2. Hunalign: Varga et al., 2005 .......................................................................... 31
2.6.3. Deng et al., 2007 ............................................................................................. 32
2.6.4. Gargantua: Braune and Fraser, 2010 ............................................................ 33
2.6.5. Fast-Champollion: Li et al., 2010 ................................................................... 34
2.7. Other Proposals ....................................................................................................... 35
2.7.1. Bleu-align: Sennrich and Volk, 2010 .............................................................. 35
2.7.2. MSVM and HMM: Fattah, 2012 ..................................................................... 36
2.8. Summary.................................................................................................................. 37
CHAPTER THREE Our Approach ............................................................................... 39
3.1. Overview ................................................................................................................. 39
7
3.2. Moore‟s Approach ................................................................................................... 39
3.2.1. Description ...................................................................................................... 39
3.2.2. The Algorithm.................................................................................................. 40
3.3. Evaluation of Moore‟s Approach ............................................................................ 42
3.4. Our Approach .......................................................................................................... 42
3.4.1. Framework ...................................................................................................... 42
3.4.2. Word Clustering .............................................................................................. 43
3.4.3. Proposed Algorithm ........................................................................................ 45
3.4.4. An Example ..................................................................................................... 49
3.5. Summary.................................................................................................................. 50
CHAPTER FOUR Experiments ..................................................................................... 51
4.1. Overview ................................................................................................................. 51
4.2. Data.......................................................................................................................... 51
4.2.1. Bilingual Corpora ........................................................................................... 51
4.2.2. Word Clustering Data ..................................................................................... 53
4.3. Metrics ..................................................................................................................... 54
4.4. Discussion of Results .............................................................................................. 54
4.5. Summary.................................................................................................................. 57
CHAPTER FIVE Conclusion and Future Work........................................................... 58
5.1. Overview ................................................................................................................. 58
5.2. Summary.................................................................................................................. 58
5.3. Contributions ........................................................................................................... 58
5.4. Future Work............................................................................................................. 59
5.4.1. Better Word Translation Models..................................................................... 59
5.4.2. Word-Phrase ................................................................................................... 59
Bibliography ...................................................................................................................... 60
8
List of Figures
Figure 1.1. A sequence of beads (Brown et al., 1991). .................................................... 13
Figure 2.1. Paragraph length (Gale and Church, 1993). .................................................. 25
Figure 2.2. Equation in dynamic programming (Gale and Church, 1993) ...................... 26
Figure 2.3. A bitext space in Melamed‟s method (Melamed, 1996). .............................. 29
Figure 2.4. The method of Varga et al., 2005 .................................................................. 31
Figure 2.5. The method of Braune and Fraser, 2010 ....................................................... 33
Figure 2.6. Sentence Alignment Approaches Review. .................................................... 38
Figure 3.1. Framework of sentence alignment in our algorithm. ..................................... 43
Figure 3.2. An example of Brown's cluster algorithm ..................................................... 44
Figure 3.3. English word clustering data ......................................................................... 44
Figure 3.4. Vietnamese word clustering data ................................................................... 44
Figure 3.5. Bilingual dictionary ....................................................................................... 46
Figure 3.6. Looking up the probability of a word pair ..................................................... 47
Figure 3.7. Looking up in a word cluster ......................................................................... 48
Figure 3.8. Handling in the case: one word is contained in dictionary ............................ 48
Figure 4.1. Comparison in Precision ................................................................................ 55
Figure 4.2. Comparison in Recall .................................................................................... 56
Figure 4.3. Comparison in F-measure .............................................................................. 57
9
List of Tables
Table 1.1.
Frequency of alignments (Gale and Church, 1993) ....................................... 14
Table 1.2.
Frequency of beads (Ma, 2006) ..................................................................... 14
Table 1.3.
Frequency of beads (Moore, 2002) ................................................................ 14
Table 1.4.
An entry in a probabilistic dictionary (Gale and Church, 1993) ................... 15
Table 2.1.
Alignment pairs (Sennrich and Volk, 2010) .................................................. 36
Table 4.1.
Training data-1............................................................................................... 51
Table 4.2.
Topics in Training data-1............................................................................... 52
Table 4.3.
Training data-2............................................................................................... 52
Table 4.4.
Topics in Training data-2............................................................................... 52
Table 4.5.
Input data for training clusters ....................................................................... 53
Table 4.6.
Topics for Vietnamese input data to train clusters ........................................ 53
Table 4.7.
Word clustering data sets. .............................................................................. 54
10
CHAPTER ONE
Introduction
1.1. Background
Parallel corpora play an important role in a number of tasks such as machine
translation, cross language information retrieval, word disambiguation, sense
disambiguation, bilingual lexicography, automatic translation verification, automatic
acquisition of knowledge about translation, and cross-language information retrieval.
Building a parallel corpus, therefore, helps connecting considered languages [1, 5, 7, 1213, 15-16].
Parallel texts, however, are useful only when they have to be sentence-aligned. The
parallel corpus first is collected from various resources, which has a very large size of the
translated segments forming it. This size is usually of the order of entire documents and
causes an ambiguous task in learning word correspondences. The solution to reduce the
ambiguity is first decreasing the size of the segments within each pair, which is known as
sentence alignment task. [7, 12-13, 16]
Sentence alignment is a process that maps sentences in the text of the source language
to their corresponding units in the text of the target language [3, 8, 12, 14, 20]. This task
is the work of constructing a detailed map of the correspondence between a text and its
translation (a bitext map) [14]. This is the first stage for Statistical Machine Translation.
With aligned sentences, we can perform further analyses such as phrase and word
alignment analysis, bilingual terminology, and collocation extraction analysis as well as
other applications [3, 7-9, 17]. Efficient and powerful sentence alignment algorithms,
therefore, become increasingly important.
A number of sentence alignment algorithms have been proposed [1, 7, 9, 12, 15, 17,
20]. Some of these algorithms are based on sentence length [3, 8, 20]; some use word
correspondences [5, 11, 13-14]; some are hybrid of these two approaches [2, 6, 15, 19].
Additionally, there are also some other outstanding methods for this task [7, 17]. For
details of these sentence alignment algorithms, see Sections 2.3, 2.4, 2.5, 2.6.
I propose an improvement to an effective hybrid algorithm [15] that is used in
sentence alignment. For details of our approach, see Section 3.4. I also create experiments
11
to illustrate my research. For details of the corpora used in our experiments, see Section
4.2. For results and discussions of experiments, see Sections 4.4, 4.5.
In the rest of this chapter, I describe some issues related to the sentence alignment
task. In addition to this, I introduce objectives of the thesis and our contributions. Finally,
I describe the structure of this thesis.
1.2. Parallel Corpora
1.2.1. Definitions
Parallel corpora are a collection of documents which are translations of each other
[16]. Aligned parallel corpora are collections of pairs of sentences where one sentence is a
translation of the other [1].
1.2.2. Applications
Bilingual corpora are an essential resource in multilingual natural language processing
systems. This resource helps to develop data-driven natural language processing
approaches. This also contributes to applying machine learning to machine translation
[15-16].
1.2.3. Aligned Parallel Corpora
Once the parallel text is sentence aligned, it provides the maximum utility [13].
Therefore, this makes the task of aligning parallel corpora of considerable interest, and a
number of approaches have been proposed and developed to resolve this issue.
1.3. Sentence Alignment
1.3.1. Definition
Sentence alignment is the task of extracting pairs of sentences that are translation of
one another from parallel corpora. Given a pair of texts, this process maps sentences in
the text of the source language to their corresponding units in the text of the target
language [3, 8, 13].
1.3.2. Types of Alignments
Aligning sentences is to find a sequence of alignments. This section provides some
more definitions about “alignment” as well as issues related to it.
Brown et al., 1991, assumed that every parallel corpus can be aligned in terms of a
sequence of minimal alignment segments, which they call “beads”, in which sentences
align 1-to-1, 1-to-2, 2-to-1, 1-to-0, 0-to-1.
12
Figure 1.1. A sequence of beads (Brown et al., 1991).
Groups of sentence lengths are circled to show the correct alignment. Each of the
groupings is called a bead, and there is a number to show sentence length of a sentence in
the bead. In figure 1.1, “17e” means the sentence length (17 words) of an English
sentence, and “19f” means the sentence length (19 words) of a French sentence. There is a
sequence of beads as follows:
An 𝑒𝑓-bead (one English sentence aligned with one French sentence) followed by
An 𝑒𝑓𝑓-bead (one English sentence aligned with two French sentences) followed
by
An 𝑒-bead (one English sentence) followed by
A ¶𝑒¶𝑓 bead (one English paragraph and one French paragraph).
An alignment, then, is simply a sequence of beads that accounts for the observed
sequences of sentence lengths and paragraph markers [3].
There are quite a number of beads, but it is possible to only consider some of them
including 1-to-1 (one sentence of source language aligned with one sentence of target
language), 1-to-2 (one sentence of source language aligned with two sentences of target
language), etc; Brown et al., 1991 [3] mentioned to beads 1-to-1, 1-to-0, 0-to-1, 1-to-2, 2to-1, and a bead of paragraphs ( ¶𝑒, ¶𝑓, ¶𝑒𝑓 ) because of considering alignments by
paragraphs of this method. Moore, 2002 [15] only considers five of these beads: 1-to-1, 1to-0, 0-to-1, 1-to-2, 2-to-1 in which each of them is called as follows:
1-to-1 bead (a match)
1-to-0 bead (a deletion)
0-to-1 bead (an insertion)
1-to-2 bead (an expansion)
2-to-1 bead (a contraction)
13
The common information related to this is the frequency of beads. Table 1.1 shows
frequencies of types of beads proposed by Gale and Church, 1993 [8].
Table 1.1.
Frequency of alignments (Gale and Church, 1993)
Category
Frequency
Prob(match)
1-1
1167
0.89
1-0 or 0-1
13
0.0099
2-1 or 1-2
117
0.089
2-2
15
0.011
1312
1.00
Meanwhile, these frequencies of Ma, 2006 [13] are illustrated as Table 1.2:
Table 1.2.
Frequency of beads (Ma, 2006)
Category
Frequency
Percentage
1306
89.4%
1-0 or 0-1
93
6.4%
1-2 or 2-1
60
4.1%
2
0.1%
1-1
Others
Total
1461
Table 1.3 also describes these frequencies of types of beads in Moore, 2002 [15]:
Table 1.3.
Frequency of beads (Moore, 2002)
Category
Percentage
1-1
94%
1-2
2%
2-1
2%
1-0
1%
0-1
1%
Total
100%
14
Generally, the frequency of bead 1-to-1 in almost all corpora is largest in all types of
beads, with frequency around 90% whereas other types are only about few percentages.
1.3.3. Applications
Sentence alignment is an important topic in Machine Translation. This is an important
first step for Statistical Machine Translation. It is also the first stage to extract structural
and semantic information and to derive statistical parameters from bilingual corpora [17,
20]. Moreover, this is the first step to construct probabilistic dictionary (Table 1.4) for use
in aligning words in machine translation, or to construct a bilingual concordance for use
in lexicography.
Table 1.4.
An entry in a probabilistic dictionary (Gale and Church, 1993)
English
French
Prob(French|English)
the
le
0.610
the
la
0.178
the
l‟
0.083
the
les
0.023
the
ce
0.013
the
il
0.012
the
de
0.009
the
à
0.007
the
que
0.007
1.3.4. Challenges
Although this process might seem very easy, it has some important challenges which
make the task difficult [9]:
The sentence alignment task is non-trivial because sentences do not always align 1-to1. At times a single sentence in one language might be translated as two or more
sentences in the other language. The input text also affects the accuracies. The
performance of sentence alignment algorithms decreases significantly when input data
becomes very noisy. Noisy data means that there are more 1-0 and 0-1 alignments in the
data. For example, there are 89% 1-1 alignments in English-French corpus (Gale and
Church, 1991), and 1-0 and 0-1 alignments are only 1.3% in this corpus. Whereas in UN
15
Chinese English corpus (Ma, 2006), there are 89% 1-1 alignments, but 1-0 or 0-1
alignments are 6.4% in this corpus. Although some methods work very well on clean
data, their performance goes down quickly as data becomes noisy [13].
In addition, it is difficult to achieve perfect accurate alignments even if the texts are
easy and “clean”. For instance, the success of an alignment program may decline
dramatically when applied on a novel or philosophy text, but this program gives
wonderful results when applied on a scientific text.
The performance alignment also depends on languages of corpus. For example, an
algorithm based on cognates (words in language pairs that resemble each other
phonetically) is likely to work better for English-French than for English-Hindi because
there are fewer cognates for English-Hindi [1].
1.3.5. Algorithms
A sentence alignment program is called “ideal” if it is fast, highly accurate, and
requires no special knowledge about the corpus or the two languages [2, 9, 15]. A
common requirement for sentence alignment approaches is the achievement of both high
accuracy and minimal consumption of computational resources [2, 9]. Furthermore, a
method for sentence alignment should also work in an unsupervised fashion and be
language pair independent in order to be applicable to parallel corpora in any language
without requiring a separate training set. A method is unsupervised if it is an alignment
model directly from the data set to be aligned. Meanwhile, language pair independence
means that approaches require no specific knowledge about the languages of the parallel
texts to align.
1.4. Thesis Contents
This section introduces the organization of contents in this thesis including: objectives,
our contributions, and the outline.
1.4.1. Objectives of the Thesis
In this thesis, I report results of my study of sentence alignment and approaches
proposed for this task. Especially, I focus on Moore‟s method (2002), a method which is
outstanding and has a number of advantages. I also discover a new feature, word
clustering, which may apply for this task to improve the accuracy of alignment. I examine
this proposal in experiments and compare results to those in the baseline method to prove
advantages of my approach.
16
1.4.2. Contributions
My main contributions are as follows:
Evaluating methods in sentence alignment and introducing an algorithm that
improves Moore‟s method.
Using new feature - word clustering, helps to improve accuracy of alignment.
This contributes in complementing strategies in the sentence alignment
problem.
1.4.3. Outline
The rest of the thesis is organized as follows:
Chapter 2 – Related Works
In this chapter I introduce some recent research about sentence alignment. In order to
have a general view of methods proposed to deal this problem, an overall presentation
about methods of sentence alignment is introduced in this chapter. Methods are classified
into some types in which each method is given by describing its algorithm along with
evaluations related to it.
Chapter 3 – Our Approach
This chapter describes the method we proposed in sentence alignment to improve
Moore‟s method. Initially, an analysis of Moore‟s method and evaluations about it are
also mentioned in this chapter. The major content of this chapter is the framework of the
proposed method, an algorithm using bilingual word clustering. An example is described
in this chapter to illustrate the approach clearly.
Chapter 4 – Experiments
This chapter shows experiments performed in our approach. Data corpora used in
experiments are presented completely. Results of experiments as well as discussions
about them are clearly described for evaluating our approach to the baseline method.
Chapter 5 –Conclusions and Future Works
In this last chapter, advantages and restrictions of my works are summarized in a
general conclusion. Besides, some research directions are mentioned to improve the
current model in the future.
Finally, references are given to show research published that my system refers to.
17
1.5. Summary
This chapter introduces my research work. I have given background information about
parallel corpora, sentence alignment, definitions of issues as well as some initial problems
related to sentence alignment algorithms. Terms of alignment which are used in this task
have been defined in this chapter. In addition, an outline of my research work in this
thesis has also been provided. A discussion of future proposed work is also presented.
18
CHAPTER TWO
Related Works
2.1. Overview
This chapter is an introduction to some research in sentence alignment in recent years
and some evaluations about these approaches. A number of problems related to this work
are also discussed: factors that affect the performance of alignment algorithms, searching
and resources for each method. Evaluations of algorithm are introduced to give a general
view of advantages as well as weaknesses of each algorithm.
Section 2.2 provides an overview of sentence alignment approaches. Section 2.3
introduces and evaluates some primary approaches in length-based methods. Section 2.4
introduces and evaluates proposals of word-correspondence-based approaches. Proposals
as well as evaluations for each of them in hybrid methods are presented in Section 2.5.
Certainly, there are some other outstanding approaches about this task, which are also
introduced in Section 2.6. Section 2.7 concludes this chapter.
2.2. Overview of Approaches
2.2.1. Classification
From the first approaches proposed in 1990s, there have been a number of
publications reported in sentence alignment with different techniques.
In various sentence alignment algorithms which have been proposed, there are three
widespread approaches based respectively on a comparison of sentence length, lexical
correspondence and a combination of these first two methods.
There are also some other techniques such as methods based on BLEU score, support
vector machine, and hidden Markov model classifiers.
2.2.2. Length-based Methods
Length-based approaches are based on modeling the relationship between the lengths
of sentences that are mutual translations. The length is measured by characters or words
of a sentence. In these approaches, semantics of the text are not considered. Statistical
methods are used for this task instead of the content of texts. In other words, these
19
methods only consider the length of sentences in order to make the decision for
alignment.
These methods are based on the fact that longer sentences in one language tend to be
translated into longer sentences in the other language, and that shorter sentences tend to
be translated into shorter sentences. A probabilistic score is assigned to each proposed
correspondence of sentences, based on the scaled difference of lengths of the two
sentences (in characters) and the variance of this difference. There are two random
variables 𝑙1 and 𝑙2 which are the lengths of the two sentences under consideration. It is
assumed that these random variables are independent and identically distributed with a
normal distribution [8].
Given the two parallel texts 𝑆𝑇 (source text) and 𝑇𝑇 (target text), the goal of this task
is to find alignment A which is highest probability.
𝑚𝑎𝑥𝐴 𝑃(𝐴, 𝑆𝑇, 𝑇𝑇)
In order to estimate this probability, aligned text is decomposed in a sequence of
aligned sentence beads where each bead is assumed to be independent of others.
The algorithms of this type were first proposed in Brown, et al., 1991 and Gale and
Church, 1993. These approaches use sentence-length statistics in order to model the
relationship between groups of sentences that are translations of each other. Wu (Wu,
1994) also uses the length-based method by applying the algorithm proposed by Gale and
Church, and he further uses lexical cues from corpus-specific bilingual lexicon to improve
alignment.
The methods proposed in this type of sentence alignment algorithm are based solely
on the lengths of sentences, so they require almost no prior knowledge. Furthermore,
these methods are highly accurate despite their simplicity. They can also perform in a
high speed. When aligning texts whose languages are similar or have a high length
correlation such as English, French, and German, these approaches are especially useful
and work remarkably well. They also perform fairly well if the input text is clean such as
in Canadian Hansards corpus [3]. The Gale and Church algorithm is still widely used
today, for instance to align Europarl (Koehn, 2005).
Nevertheless, these methods are not robust since they only use the sentence length
information. They will no longer be reliable if there is too much noise in the input
bilingual texts. As shown in (Chen, 1993) [5] the accuracy of sentence-length based
methods decreases drastically when aligning texts containing small deletions or free
translation; they can easily misalign small passages because they ignore word identities.
The algorithm of Brown et al. requires corpus-dependent anchor points while the method
20
- Xem thêm -