HTMLStripper Version 0.4 Details

HTMLStripper Version 0.4

Details

Topic Hierarchy

Products

Applications

HTMLStripper

Related Topics

(Product) Overview

Support

License Terms

Downloads

LFNAlias

The HTMLStripper^*1 is a quick and on the spur of the moment hack by our CEO to mitigate the problem in finding/defining and filtering out the words to include in our web pages' "keywords" meta tags and creating the indexes of the compiled help for the offline, setup and user guides that ship with our products. In the year since its first public release (2023), our CEO has not only redesigned the application's internals but also continued to add to improve the application's functionality.

What's it Good for ?

Apart for what it is intended for, it has already proven its use in the very first tests, during which we discovered that the word lists are a real help in detecting typographical errors. Something it wasn't developed for. It is therefore highly likely that, even in its present, raw, and incomplete state it will prove useful not only in proof-reading, but also in porting HTML/XML^*2 content into other file formats (for example, TeX), research, journalism, and several other things we haven't thought of.

Against our normal policy, we have therefore decided to publish it in its current, preliminary state, without all the planned features and functionality, and pick up some of the ideas users (i.e. you) may have.

What it Does and How it Works

The HTMLStripper takes a HTML or XML file as input and strips it of all scripts, comments, and tags. That is, it removes everything except for the texts that a user normally sees when the file is displayed in a browser. From this text it then generates a list of the words that occur in the text, together with the number of occurrences of each word.

A very simple example would be the Hyper Text Markup Language (HTML) file, the source code of which is given, below


      <html>

      <head>

      <title>SST HTMLStripper Example, HTML Source File</title>

      <meta name="author" content="Administrator">

      <meta name="generator" content="Ulli Meybohms HTML EDITOR">

      </head>

      <body text="#000000" bgcolor="#FFFFFF" link="#FF0000" alink="#FF0000"     vlink="#FF0000">

      <font size="+3">Hello Example Stoelzel Software Tech. World !</font>

      </body>

      </html>

Opened in a web browser this source code would merely display the text
Hello Example Stoelzel Software Tech. World !,
in very large letters.

From this source code the HTMLStripper will produce the following plain, text, output
"SST HTMLStripper Example, HTML Source File
Hello Example Stoelzel Software Tech. World !" (without the quotation marks).

In other words, the HTML document title and the unformatted text that would be/is displayed in a web browser. At the same time the HTMLStripper would also generate two lists of the words that occur in the stripped text. An alphabetically sorted, plain, text list and a, likewise alphabetically sorted, comma separated list, that includes the number of occurrences of each word. The number of occurrences in the second list can then be interpreted as the rank or relevance of each word. Apart from using it in the HTMLStripper to create a Microsoft Compiled Help .hhk index file or a keyword meta tag, this information can then also be saved to various plain text files or simply copied into other applications for further processing.

Of course, the above example only uses 11 different words in the entire, readable text (i.e. that text which is normally displayed in a web browser), so manually removing any words that may not be wanted in a meta tag or index is no problem. But, in "real" files, a text may use more than a thousand different words and having to remove such words as "and", "my", "the", etc. would be a nuisance, so the HTMLStripper also has an integrated filter. This filter is invoked simply by providing a plain text list of the words to ignore/suppress in the generated word lists. If it is omitted, the HTMLStripper will list all words, otherwise only those that are not listed in the provided list.

In the former case (i.e. without a filter), the comma separated list that will be generated for the above example is


      Word, Occurrences

      Example, 2

      File, 1

      Hello, 1

      HTML, 1

      HTMLStripper, 1

      Software, 1

      Source, 1

      SST, 1

      Stoelzel, 1

      Tech, 1

      World, 1

Various, other HTMLStripper features then make it a simple task to produce both, a Microsoft Compiled Help .hhk index file and/or a keyword meta tag, from this data. For further information on how this is achieved, please refer to the Preliminary User Guide or simply watch the following 20 minute video.

Demo Video 1

In some cases it may be desirable or necessary to play the above video in an external application (i.e. in application other than the browser). You can do so
by clicking on the download button to the right of this text.

Known Issues

•	The current version may only produce faultless output files if the source/input file is an ANSI encoded file.
•	Automatic conversion of the input file from UTF-8 or other character sets to ANSI (or in future, Unicode) has not been implemented yet. This means that to reliably process an input file that is not encoded in an ANSI character set, it may be necessary to convert it manually into an ANSI encoded file first. This has to be performed in a third party application, for example, Notepad (aka the Windows Editor).
•	The entire user interface is still one, big, construction site.
•	The "Delete" Menu Item and Tool Bar Button will only delete the items in the Distinct Words and Keywords List Views, but not the text in other controls. Nonetheless, any text marked/highlighted as selected can be deleted in all controls by pressing the "Del" key on the keyboard.
•	The standard, Windows keyboard shortcuts Ctrl + C and Ctrl + V cannot be used to copy and insert text from and into the HTML Source File Combo Box.
•	To open the URL specified in the HTML Source File Combo Box in the integrated browser, the Open URL Menu Item in the View Menu or the corresponding button in the Button Tool Bar has to be used.
•	The integrated browser is not a current (i.e. year 2023) generation browser. Many pages, of websites that do not support older generation browsers, will not be displayed correctly and/or lead to numerous JavaScript errors.
•	The HTML/XML page that is open in the integrated browser can only be saved by means of the Save As Menu Item or Tool Bar Button.
•	When saving the HTML page in the integrated browser to disk, only the HTML source code, but no images, scripts, or style sheets, etc. are saved with it. Although this suffices to subsequently process the thus saved page/file, it will obviously not be displayed correctly in a browser.
•	The application can be closed without any warning/notification that one or more modified files haven't been saved.
•	The number of references/links in the "Num. Link Targets" column of the Keywords List View are not updated.
•	The offsets of the logged tag positions aren't correct.
•	and, presumably many more we haven't found yet or simply omitted in this list.

Footnotes

*1	... or TagStripper, or TagBuster, or ? We haven't decided on a final product name yet and you are welcome to make suggestions.
*2	Actually, the Hyper Text Markup Language (HTML) is a Extensible Markup Language (XML) implementation, and therefore a sub-set of XML. So, strictly speaking, the HTMLStripper is really a XML stripper.