This Perl script is an integration of other 2 scripts, allows users to spellcheck UTF-8 encoded XML files. The script is designed to spellcheck Xaraya CMS translations, but you can use it for other UTF8-XML files too. The source code is pre-configured for checking Xaraya files (e.g. default XML node names).
If you use your national Xaraya site (NN.xaraya.com) for translating the system, spellchecking Xaraya is done by spellchecking a downloaded local copy of the language pack, and fixing the errors online. You must run this program on those local files and fix the errors in the Translations module online on the NLS site.
This may sound a bit odd, but this maintains the advantages what you already had on the NLS site, like co-operation, BitKeeper push and so on. Believe me, the work is very simple and quick this way.
To help the translation, the temporary TXT files use a filename which refers to the real template file, thus you can identify which page to load in the Translations module. The name of the file is always displayed on the top of the spellchecker window (assuming you use ispell).
To spellcheck a module run a "find" command on your Linux, because this "lazy" script can spellcheck only a single file at a time (the unix philosophy..).
find modules/articles -name \*.xml -exec xml_utf8_spellcheck.pl {} \;
Theoritically you could also save the changes back to the file direcly, if you want to fix a local copy of the language pack.
(99% of it was written by the author of xml_spellcheck):
xml_utf8_spellcheck
xml_utf8_spellcheck [options] <files>
xml_utf8_spellcheck lets you spell check the content of an XML file. It extracts the text (the content of elements and optionally of attributes), decodes utf8 to latin1/2, call a spell checker on it and then recreates the XML document.
Note that all options can be abbreviated to the first letter These are the original options of xml_spellcheck.pl.
--conf <configuration_file>
Gets the options from a configuration file. NOT IMPLEMENTED YET.--spellchecker <spellchecker>
The command to use for spell checking, including any option. By default "ispell -d magyar" is used--backup-extension <extension>
By default the original file is saved with a ".bak" extension. This option changes the extension--attributes
Spell check attribute content. By default attribute values are NOT spell checked. NOT YET IMPLEMENTED--exclude_elements <list_of_excluded_elements>
A list of elements that should not be spell checked--include_elements <list_of_included_elements>
A list of elements that should be spell checked (by default all elements are spell checked).--pretty_print <optional_pretty_print_style>
"--exclude_elements" and "--include_elements" are mutually exclusive
A pretty print style for the document, as defined in XML::Twig. If the option is provided without a value then the "indented" style is used--spell_charset <character_encoding_name>
The encoding of the temporary file which is passed to the spellchecker.Default is iso-8859-2.
--version
Dislay the tool version and exit--help
Display help message and exit--man
Display longer help message and exit
To spellcheck one single file:
xml_utf8_spellcheck.pl my-utf8-file.xml
To spellcheck one complete directory use:
find modules/articles -type f -name \*.xml -exec xml_utf8_spellcheck.pl {} \;
"<" : "lt;" replace happens inside CDATA elements too, which is a serious bug.
--conf option
--attribute option
XML::Twig, Getopt::Long, Pod::Usage, File::Temp XML::Twig requires XML::Parser.
XML::Twig
This program is Copyright 2005 by Ferenc Veres
Original xml_spellcheck is Copyright 2003 by Michel Rodriguez
This program is free software; you can redistribute it and/or modify it under the terms of the Perl Artistic License or the GNU General Public License as published by the Free Software Foundation either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MER- CHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
If you do not have a copy of the GNU General Public License write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
Integrated 2 scripts together: Ferenc Veres <lionNO@SPAMnetngine.NOhu>
Original xml_spellcheck.pl: Michel Rodriguez <mirodNO@SPAMxmltwig.NOcom>
Original utf8spell.pl: Has no author name marked, sorry. (License: PD)
xml_utf8_spellcheck is available at http://lion.xaraya.hu/news/71
Ferenc Veres
web developer
about me
Exisitng editors for text data DjVu files are quite limited, like for example DjVuSmooth. So I've implemented a new editor in JavaScript, that allows editing both the strucutre of the text (paragraphs, lines, words,...) and the coordinates of the text boxes by simply dragging with the mouse, features like create, delete, merge are also available.