# OpusTools - the Perl package OpusTools-perl is a collection of tools and scripts to manipulate parallel data collected in [OPUS](http://opus.nlpl.eu/). There are tools for reading, converting and processing the data in various ways. Note that there is also an alterantive Python package called [OpusTools](https://github.com/Helsinki-NLP/OpusTools) that provides similar functionalities but both of these packages do not cover the same kind of tasks even though there is some overlap. ## Installing ``` perl Makefile.pl make all make install ``` ### Requirements The following Perl libraries are required * Module::Install * Archive::Zip * DB_File * HTML::Entities * Lingua::Sentence * Ufal::UDPipe * XML::Parser * XML::Writer ## Tools and their usage The package includes a number of tools that can be used on the command-line. Tools for reading and processing data: * **opus-read**: read and filter sentence aligned corpora * **opus-cat**: read files from zipped OPUS corpus file collections * **opus-udpipe**: parse OPUS corpora with UDPipe Tools related to alignment: * **opus-merge-align**: merge sentence alignment files (delete duplicates) * **opus-pivoting**: create transitive sentence links via a pivot language * **opus-pt2dic**: extract a rough bilingual dictionary from SMT phrase-tables * **opus-pt2dice**: extract a bilingual dictionary with DICE scores * **opus-split-align**: get alignments per document from a sentence alignment file * **opus-swap-align**: swap the sentence alignment Conversion tools: * **moses2opus**: convert aligned plain text files to OPUS format * **opus2moses**: extract aligned plain text files from OPUS files * **tmx2moses**: convert TMX files into aligned plain text files * **tmx2opus**: convert TMX files into OPUS format * **xml2opus**: add sentence boundary markup to arbitrary XML files * **opus2text**: extract plain text from OPUS XML files * **opus2multi**: make a multiparallel corpus using a pivot language * **opus-iso639**: convert between ISO639 standards Admin tools: * **opus-index**: create CWB indeces from OPUS corpora * **opus-make-website**: generate corpus websites ### opus-read Read aligned sentences from OPUS corpora: ``` opus-read [OPTIONS] align-file.xml ``` Command-line options: ``` -c ........... set a link threshold -d ........... set home directory for aligned XML documents -h ................. print simple HTML -l ................. print links (filter mode) -m ........... print max alignments -n ......... get only documents that match the regex -N ......... skip all documents that match the regex -o ........... set a threshold for time overlap (subtitle data) -r ....... release (default = latest) -s ........ require source sentences to match -t ........ require target sentences to match -S ........... maximum number of source sentence in alignments -T ........... maximum number of target sentence in alignments -SN ........... number of source sentence in alignments -TN ........... number of target sentence in alignments ``` "opus-read" is a simple script to read sentence alignments stored in XCES align format and prints the aligned sentences to STDOUT. It requires monolingual alignments (ascending order, no crossing links) of sentences in linked XML files. Linked XML files are specified in the "toDoc" and attributes (see below). ``` .... ``` Several parameters can be set to filter the alignments and to print only certain types of alignments. `opus-read` can also be used to filter the XCES alignment files and to print the remaining links in the same XCES align format. Use the "-l" flag to enable this mode. Example usage: ``` # read sentence alignments and print aligned sentences opus-read align-file.xml opus-read align-file.xml.gz opus-read corpusname/lang-pair opus-read -d corpusname lang-pair opus-read -d corpusname -s srclang -t trglang # print alignments with alignment certainty > LinkThr=0 opus-read -c 0 align-file.xml # print alignments with max 2 source sentences and 3 target sentences opus-read -S 2 -T 3 align-file.xml # print aligned sentences marked as 'de' (source) and 'en' (target) # (this only works if sentences are marked with languages: # for example, in the German XML file: ...) opus-read -s de -t en align-file.xml # wrap aligned sentences in simple HTML opus-read -h align-file.xml # print max 10 alignments opus-read -m 10 align-file.xml # specify home directory of aligned XML files opus-read -d /path/to/xml/files align-file.xml # print XCES align format of all 1:1 sentence alignments opus-read -S 1 -T 1 -l align-file.xml ``` ### opus-udpipe opus-udpipe runs OPUS data through UDPipe and produces OPUS compatible XML. ``` opus-upipe [OPTIONS] < input.xml > output.xml ``` Command-line options: ``` -l ......... language ID (ISO639-1) -m ....... path to udpipe models -v ........ model version -D .................. print model dir (and stop) -L .................. list supported languages -M .................. list UDPipe models ``` Option -M can be combined with -D and -L/-l to get various kinds of combined output. ### opus-index A tool for indexing OPUS data with the Corpus Work Bench (CWB). It extracts sentences, positional attributes (such as POS tags) and structural markup. It also converts sentence alignment information and prepares the vertical format that can be imported by the CWB tools. This tool is mainly for internal use within the OPUS server environment. Command-line options: ``` -a lang.... list of aligned languages (optional, space separated) -o ........ overwrite existing data (deletes entire data directory!!) -y ........ assumes yes (doesn't prompt before deleting data dir!) -s ........ skip conversion via recode (used for OO) -m dir .... directory for temporary data (otherwise /tmp/BITEXTINDEXER...) -i depth .. min depth for finding alignment file (0 otherwise) -u pattern allowed structural patterns -p pattern allowed positional patterns -U pattern disallowed structural patterns -P pattern disallowed positional patterns -M ........ skip creating monolingual index files -A ........ skip creating alignment files -k ........ keep temp file for cwb encoding -e enc .... use character encoding enc -C ........ convert only (don't run indexing and registring) ```