All » Articles » Xaraya (4)
All » Downloads (10)

Spellchecker for Xaraya or other UTF-8 XML files

Posted by: Ferenc Veres on February 20, 2005 05:42:51 PM +00:00(300446 Reads)

This Perl script is an integration of other 2 scripts, allows users to spellcheck UTF-8 encoded XML files. The script is designed to spellcheck Xaraya CMS translations, but you can use it for other UTF8-XML files too. The source code is pre-configured for checking Xaraya files (e.g. default XML node names).

Spellchecking Xaraya:

If you use your national Xaraya site ( for translating the system, spellchecking Xaraya is done by spellchecking a downloaded local copy of the language pack, and fixing the errors online. You must run this program on those local files and fix the errors in the Translations module online on the NLS site.

This may sound a bit odd, but this maintains the advantages what you already had on the NLS site, like co-operation, BitKeeper push and so on. Believe me, the work is very simple and quick this way.

To help the translation, the temporary TXT files use a filename which refers to the real template file, thus you can identify which page to load in the Translations module. The name of the file is always displayed on the top of the spellchecker window (assuming you use ispell).

To spellcheck a module run a "find" command on your Linux, because this "lazy" script can spellcheck only a single file at a time (the unix philosophy..).

find modules/articles -name \*.xml -exec {} \;

Theoritically you could also save the changes back to the file direcly, if you want to fix a local copy of the language pack.


Original man page

(99% of it was written by the author of xml_spellcheck):




xml_utf8_spellcheck [options] <files>


xml_utf8_spellcheck lets you spell check the content of an XML file.  It extracts the text (the content of elements and optionally of attributes), decodes utf8 to latin1/2, call a spell checker on it and then recreates the XML document.


Note that all options can be abbreviated to the first letter These are the original options of
--conf <configuration_file>
Gets the options from a configuration file. NOT IMPLEMENTED YET.
--spellchecker <spellchecker>
The command to use for spell checking, including any option. By default "ispell -d magyar" is used
--backup-extension <extension>
By default the original file is saved with a ".bak" extension. This option changes the extension
Spell check attribute content. By default attribute values are NOT spell checked. NOT YET IMPLEMENTED
--exclude_elements <list_of_excluded_elements>
A list of elements that should not be spell checked
--include_elements <list_of_included_elements>
A list of elements that should be spell checked (by default all elements are spell checked).
"--exclude_elements" and "--include_elements" are mutually exclusive
--pretty_print <optional_pretty_print_style>
A pretty print style for the document, as defined in XML::Twig. If the option is provided without a value then the "indented" style is used
--spell_charset <character_encoding_name>
The encoding of the temporary file which is passed to the spellchecker.
Default is iso-8859-2.

Dislay the tool version and exit
Display help message and exit
Display longer help message and exit


To spellcheck one single file: my-utf8-file.xml

To spellcheck one complete directory use:

find modules/articles -type f -name \*.xml -exec {} \;


"<" : "lt;" replace happens inside CDATA elements too, which is a serious bug.


--conf option
--attribute option


XML::Twig, Getopt::Long, Pod::Usage, File::Temp XML::Twig requires XML::Parser.




This program is Copyright 2005 by Ferenc Veres
Original xml_spellcheck is Copyright 2003 by Michel Rodriguez

This program is free software; you can redistribute it and/or modify it under the terms of the Perl Artistic License or the GNU General Public License as published by the Free Software Foundation either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MER- CHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details.

If you do not have a copy of the GNU General Public License write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.


Integrated 2 scripts together: Ferenc Veres <lionNO@SPAMnetngine.NOhu>

Original Michel Rodriguez <mirodNO@SPAMxmltwig.NOcom>
Original Has no author name marked, sorry. (License: PD)

xml_utf8_spellcheck is available at


About me

Photo of me Ferenc Veres
web developer
about me

Commodore books
Commodore logo My C64 and Plus/4 book collection (Hungarian): Commodore könyvek
Featured article

Exisitng editors for text data DjVu files are quite limited, like for example DjVuSmooth. So I've implemented a new editor in JavaScript, that allows editing both the strucutre of the text (paragraphs, lines, words,...) and the coordinates of the text boxes by simply dragging with the mouse, features like create, delete, merge are also available.

My other websites