iAligner: a tool for syntax-based intra-language text alignment

Talk: Chiara Palladino and Tariq Youssef (Leipzig), “iAligner: a tool for syntax-based intra-language text alignment”.

Permalink: <>

Date: Tuesday, 29 November 2016

Time: starting at 17:00 c.t. (i.e. 17:15)

Venue: DAI, Wiegandhaus, Podbielskiallee 69-71, D-14195 Berlin (map)

Abstract

The detection of textual variants is a crucial step in Classical Philology. It represents both the first stage of collation and the preliminary phase for recognising quotation and text reuse in the indirect tradition [1]. As digital tools can improve the mechanical stage of textual comparison, the interaction between automated process and traditional philological methods is in this case very promising [2].

iAligner performs pairwise intralanguage syntaxbased automatic alignment [3] on Ancient Greek, Latin and English, and it is now being tested on other languages. Texts are aligned at line or sentence level, at any length chosen by the user. They are then converted to vectors of single tokens, and pairwise alignment is performed through Needleman-Wunsch algorithm [4]. Additional languagedependent criteria can be established by the user for further refinement, according to the purpose of the alignment: nonalphabetical characters and diacritics can be ignored, the alignment can be set as case sensitive and Levensthein distance metric [5] can be applied to adjust the tolerance threshold.

The current workflow allows direct upload of texts in CSV or TXT format, and it also provides a simple workspace for copypasting texts directly on the interface. Future work in this regard will implement input options, expanding to XML and JSON formats.

The resulting alignment can then be visualized: the current graphic layout shows the portions of aligned and nonaligned text, with a color code indicating the length of the text (dark grey), the gaps (yellow), words aligned on the basis of the case (light green), complete alignment (dark green) and nonaligned tokens (red). Additional criteria, if indicated by the user, have a color code as well: aligned tokens according to Levensthein distance, for example, are indicated in turquoise green. A specific section also isolates the longest common substring.

The aim of the tool is to facilitate various types of textual comparison for several purposes: for editorial practice, it allows the detection of manuscript variants across several witnesses of the same text, but also the comparison across extended texts, such as editions or OCR outputs; it also facilitates the detection of nonliteral variants, for example in instances of text reuse in the indirect tradition. Results so far encourage applications on scholarly editorial practice and on larger crowdsourcing efforts for the detection of a high amount of variants.

Future stages of development include the establishment of a workflow for automatic alignment of OCR outputs for postcorrection, Multiple Sequence Alignment (MSA) and more variety of visualization options, including graphic layouts based on a reference text [6]. The export format of the resulting aligned files will also be extended to CSV, XML, TXT and JSON.

The tool is available as a webservice on the webpage http://ialignment.com/ or as Python and PHP code in the Github repository ( https://github.com/OpenGreekAndLatin/ILA_python and https://github.com/OpenGreekAndLatin/ILA_php ).

Bibliography

[1] M. West, Textual criticism and editorial technique , Stuttgart, 1973.

[2] R. H. Dekker, D. van Hulle, G. Middell, V. Neyt, J. van Zundert, “Computersupported collation of modern manuscripts: CollateX and the Beckett Digital Manuscript Project”, in Digital Scholarship in the Humanities, Oxford, Mar 2014.

[3] F. Makedon, M. Owen, C. Owen, J. Ford, C. MetaxakiKossionides, and T. Steinberg, “HEAR HOMER: A multimediadata access remote prototype for ancient texts”, in Proceedings of EDMEDIA’ 98, Freiburg, 1998.

[4] S. B. Needleman and C. D. Wunsch, “A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins”, Journal of Molecular Biology, 48(3): 443-453, 1970.

[5] V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions, and reversals”. Soviet Physics Doklady 10 (8): 707-710. 1966.

[6] S. Jänicke, M. Büchler and G. Scheuermann, “Improving the Layout for Text Variant Graphs”, in VisLR workshop at LREC Conference 2014, Reykjiavik, Iceland.