All » Articles » Xaraya (4)
All » Downloads (10)

Spellchecker for Xaraya or other UTF-8 XML files

Posted by: Ferenc Veres on February 20, 2005 05:42:51 PM +00:00(300446 Reads)

This Perl script is an integration of other 2 scripts, allows users to spellcheck UTF-8 encoded XML files. The script is designed to spellcheck Xaraya CMS translations, but you can use it for other UTF8-XML files too. The source code is pre-configured for checking Xaraya files (e.g. default XML node names).

Spellchecking Xaraya:

If you use your national Xaraya site (NN.xaraya.com) for translating the system, spellchecking Xaraya is done by spellchecking a downloaded local copy of the language pack, and fixing the errors online. You must run this program on those local files and fix the errors in the Translations module online on the NLS site.

This may sound a bit odd, but this maintains the advantages what you already had on the NLS site, like co-operation, BitKeeper push and so on. Believe me, the work is very simple and quick this way.

To help the translation, the temporary TXT files use a filename which refers to the real template file, thus you can identify which page to load in the Translations module. The name of the file is always displayed on the top of the spellchecker window (assuming you use ispell).

To spellcheck a module run a "find" command on your Linux, because this "lazy" script can spellcheck only a single file at a time (the unix philosophy..).

find modules/articles -name \*.xml -exec xml_utf8_spellcheck.pl {} \;

Theoritically you could also save the changes back to the file direcly, if you want to fix a local copy of the language pack.

Download

xml_utf8_spellcheck_1.0.zip

Original man page

(99% of it was written by the author of xml_spellcheck):

NAME

xml_utf8_spellcheck

SYNOPSIS

xml_utf8_spellcheck [options] <files>

DESCRIPTION

xml_utf8_spellcheck lets you spell check the content of an XML file.  It extracts the text (the content of elements and optionally of attributes), decodes utf8 to latin1/2, call a spell checker on it and then recreates the XML document.

OPTIONS

Note that all options can be abbreviated to the first letter These are the original options of xml_spellcheck.pl.
--conf <configuration_file>
Gets the options from a configuration file. NOT IMPLEMENTED YET.
--spellchecker <spellchecker>
The command to use for spell checking, including any option. By default "ispell -d magyar" is used
--backup-extension <extension>
By default the original file is saved with a ".bak" extension. This option changes the extension
--attributes
Spell check attribute content. By default attribute values are NOT spell checked. NOT YET IMPLEMENTED
--exclude_elements <list_of_excluded_elements>
A list of elements that should not be spell checked
--include_elements <list_of_included_elements>
A list of elements that should be spell checked (by default all elements are spell checked).
"--exclude_elements" and "--include_elements" are mutually exclusive
--pretty_print <optional_pretty_print_style>
A pretty print style for the document, as defined in XML::Twig. If the option is provided without a value then the "indented" style is used
--spell_charset <character_encoding_name>
The encoding of the temporary file which is passed to the spellchecker.
Default is iso-8859-2.

--version
Dislay the tool version and exit
--help
Display help message and exit
--man
Display longer help message and exit

EXAMPLES

To spellcheck one single file:

xml_utf8_spellcheck.pl my-utf8-file.xml

To spellcheck one complete directory use:

find modules/articles -type f -name \*.xml -exec xml_utf8_spellcheck.pl {} \;

BUGS

"<" : "lt;" replace happens inside CDATA elements too, which is a serious bug.

TODO

--conf option
--attribute option

PRE-REQUISITE

XML::Twig, Getopt::Long, Pod::Usage, File::Temp XML::Twig requires XML::Parser.

SEE ALSO

XML::Twig

COPYRIGHT AND DISCLAIMER

This program is Copyright 2005 by Ferenc Veres
Original xml_spellcheck is Copyright 2003 by Michel Rodriguez

This program is free software; you can redistribute it and/or modify it under the terms of the Perl Artistic License or the GNU General Public License as published by the Free Software Foundation either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MER- CHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details.

If you do not have a copy of the GNU General Public License write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.

AUTHOR

Integrated 2 scripts together: Ferenc Veres <lionNO@SPAMnetngine.NOhu>

Original xml_spellcheck.pl: Michel Rodriguez <mirodNO@SPAMxmltwig.NOcom>
Original utf8spell.pl: Has no author name marked, sorry. (License: PD)

xml_utf8_spellcheck is available at http://lion.xaraya.hu/news/71

 

About me

Photo of me Ferenc Veres
web developer
about me

Commodore books
Commodore logo My C64 and Plus/4 book collection (Hungarian): Commodore könyvek
Featured article

Exisitng editors for text data DjVu files are quite limited, like for example DjVuSmooth. So I've implemented a new editor in JavaScript, that allows editing both the strucutre of the text (paragraphs, lines, words,...) and the coordinates of the text boxes by simply dragging with the mouse, features like create, delete, merge are also available.

My other websites
Categories