![]() |
![]() |
HTMLStripper Version 0.4 Details |
||||||||||||||||||||||||||||||||||||||||||
|
The HTMLStripper*1
is a quick and on the spur of the moment hack by our CEO to mitigate the problem in
finding/defining and filtering out the words to include
in our web pages' "keywords" meta tags and creating the indexes of the compiled help for the offline,
setup and user guides that ship with our products.
In the year since its first public release (2023), our CEO has not only redesigned the application's internals
but also continued to add to improve the application's functionality.
What's it Good for ?
Apart for what it is intended for, it has already proven its use in the very first tests,
during which we discovered that the word lists are a real help in detecting typographical
errors. Something it wasn't developed for. It is therefore highly likely that, even in its
present, raw, and incomplete state it will prove useful not only in proof-reading, but also
in porting HTML/XML*2
content into other file formats (for example, TeX), research, journalism, and several other
things we haven't thought of.
Against our normal policy, we have therefore decided to publish it in its current, preliminary
state, without all the planned features and functionality, and pick up some of the ideas users
(i.e. you) may have.
What it Does and How it Works
The HTMLStripper takes a HTML or XML file as input and strips it of all scripts, comments, and tags.
That is, it removes everything except for the texts that a user normally sees when the
file is displayed in a browser. From this text it then generates a list of the words that occur in the text,
together with the number of occurrences of each word.
A very simple example would be the Hyper Text Markup Language (HTML) file, the source code of which
is given, below
<html>
Opened in a web browser this source code would merely display the text
Hello Example Stoelzel Software Tech. World ! ,in very large letters.
From this source code the HTMLStripper will produce the following plain, text, output
"SST HTMLStripper Example, HTML Source File Hello Example Stoelzel Software Tech. World !" (without the quotation marks).
In other words, the HTML document title and the unformatted text that would be/is displayed in a web browser.
At the same time the HTMLStripper would also generate two lists of the words that occur in the
stripped text. An alphabetically sorted, plain, text list and a, likewise alphabetically sorted,
comma separated list, that includes the number of occurrences of each word.
The number of occurrences in the second list can then be interpreted as the rank or relevance
of each word.
Apart from using it in the HTMLStripper to create a Microsoft Compiled Help .hhk index file
or a keyword meta tag, this information can then also be saved to various plain text files
or simply copied into other applications for further processing.
Of course, the above example only uses 11 different words in the entire, readable text
(i.e. that text which is normally displayed in a web browser), so manually removing any words that
may not be wanted in a meta tag or index is no problem. But, in "real" files, a text may use
more than a thousand different words and having to remove such words as "and", "my", "the", etc.
would be a nuisance, so the HTMLStripper also has an integrated filter. This filter is invoked
simply by providing a plain text list of the words to ignore/suppress in the generated word lists.
If it is omitted, the HTMLStripper will list all words, otherwise only those that are not
listed in the provided list.
In the former case (i.e. without a filter), the comma separated list
that will be generated for the above example is
Word, Occurrences
Various, other HTMLStripper features then make it a simple task
to produce both, a Microsoft Compiled Help .hhk index file and/or
a keyword meta tag, from this data. For further information on how
this is achieved, please refer to the
Demo Video 1
Known Issues
Footnotes
|
![]() |
Document/Contents version 1.00 Page/URI last updated on 22.12.2024 |
Copyright © Stoelzel Software Technologie (SST) 2010 - 2017 |
Suggestions and comments mail to: webmaster@stoelzelsoftwaretech.com |