Stoelzel Software Technologie SST
         
         
Information X
OK
 
Please note, this application is still a prototype !
The HTMLStripper is currently still in a very early stage of development. Version 0.3 is a preliminary and very rudimentary implementation, not a fully developed and tested product.
Although it is stable, many of its features are not fully functional, don't function correctly, or have yet to be implemented.
Nonetheless, to a certain, very limited, extent, it can already be used.
       
SST App Icon   HTMLStripper Version 0.3

Details
       
Click to expand or collapse Topic Hierarchy  
Click to expand or collapse Related Topics  
  The HTMLStripper*1 is a quick, very recent, and on the spur of the moment, hack by our CEO to mitigate the problem in finding/defining and filtering out the words to include in our web pages' "keywords" meta tags and creating the indexes of the compiled help for the offline, setup and user guides that ship with our products.
What's it Good for ?
Apart for what it is intended for, it has already proven its use in the very first tests, during which we discovered that the word lists are a real help in detecting typographical errors. Something it wasn't developed for. It is therefore highly likely that, even in its present, raw, and incomplete state it will prove useful not only in proof-reading, but also in porting HTML/XML*2 content into other file formats (for example, TeX), research, journalism, and several other things we haven't thought of.
Against our normal policy, we have therefore decided to publish it in its current, preliminary state, without all the planned features and functionality, and pick up some of the ideas users (i.e. you) may have.
What it Does and How it Works
The HTMLStripper takes a HTML or XML file as input and strips it of all scripts, comments, and tags. That is, it removes everything except for the texts that a user normally sees when the file is displayed in a browser. From this text it then generates a list of the words that occur in the text, together with the number of occurrences of each word.
A very simple example would be the Hyper Text Markup Language (HTML) file, the source code of which is given, below
<html>
<head>
<title>SST HTMLStripper Example, HTML Source File</title>
<meta name="author" content="Administrator">
<meta name="generator" content="Ulli Meybohms HTML EDITOR">
</head>
<body text="#000000" bgcolor="#FFFFFF" link="#FF0000" alink="#FF0000" vlink="#FF0000">
<font size="+3">Hello Example Stoelzel Software Tech. World !</font>
</body>
</html>
Opened in a web browser this source code would merely display the text
Hello Example Stoelzel Software Tech. World !,
in very large letters.
From this source code the HTMLStripper will produce the following plain, text, output
"SST HTMLStripper Example, HTML Source File
Hello Example Stoelzel Software Tech. World !" (without the quotation marks).
In other words, the HTML document title and the unformatted text that would be/is displayed in a web browser. At the same time the HTMLStripper would also generate two lists of the words that occur in the stripped text. An alphabetically sorted, plain, text list and a, likewise alphabetically sorted, comma separated list, that includes the number of occurrences of each word. The number of occurrences in the second list can then be interpreted as the rank or relevance of each word. Apart from using it in the HTMLStripper to create a Microsoft Compiled Help .hhk index file or a keyword meta tag, this information can then also be saved to various plain text files or simply copied into other applications for further processing.
Of course, the above example only uses 11 different words in the entire, readable text (i.e. that text which is normally displayed in a web browser), so manually removing any words that may not be wanted in a meta tag or index is no problem. But, in "real" files, a text may use more than a thousand different words and having to remove such words as "and", "my", "the", etc. would be a nuisance, so the HTMLStripper also has an integrated filter. This filter is invoked simply by providing a plain text list of the words to ignore/suppress in the generated word lists. If it is omitted, the HTMLStripper will list all words, otherwise only those that are not listed in the provided list.
In the former case (i.e. without a filter), the comma separated list that will be generated for the above example is
Word, Occurrences
, -1
Example, 2
File, 1
Hello, 1
HTML, 1
HTMLStripper, 1
Software, 1
Source, 1
SST, 1
Stoelzel, 1
Tech, 1
World, 1
Various, other HTMLStripper features then make it a simple task to produce both, a Microsoft Compiled Help .hhk index file and/or a keyword meta tag, from this data. For further information on how this is achieved, please refer to the Preliminary User Guide.
Known Issues
The current version may only produce faultless output files if the source/input file is an ANSI encoded file.
Automatic conversion of the input file from UTF-8 or other character sets to ANSI (or in future, Unicode) has not been implemented yet. This means that to reliably process an input file that is not encoded in an ANSI character set, it may be necessary to convert it manually into an ANSI encoded file first. This has to be performed in a third party application, for example, Notepad (aka the Windows Editor).
The entire user interface is still one, big, construction site.
The "Delete" Menu Item and Tool Bar Button will only delete the items in the Distinct Words and Keywords List Views, but not the text in other controls. Nonetheless, any text marked/highlighted as selected can be deleted in all controls by pressing the "Del" key on the keyboard.
The standard, Windows keyboard shortcuts Ctrl + C and Ctrl + V cannot be used to copy and insert text from and into the HTML Source File Combo Box.
To open the URL specified in the HTML Source File Combo Box in the integrated browser, the Open URL Menu Item in the View Menu or the corresponding button in the Button Tool Bar has to be used.
The integrated browser is not a current (i.e. year 2023) generation browser. Many pages, of websites that do not support older generation browsers, will not be displayed correctly and/or lead to numerous JavaScript errors.
The HTML/XML page that is open in the integrated browser can only be saved by means of the Save As Menu Item or Tool Bar Button.
When saving the HTML page in the integrated browser to disk, only the HTML source code, but no images, scripts, or style sheets, etc. are saved with it. Although this suffices to subsequently process the thus saved page/file, it will obviously not be displayed correctly in a browser.
The application can be closed without any warning/notification that one or more modified files haven't been saved.
The number of references/links in the "Num. Link Targets" column of the Keywords List View are not updated.
The offsets of the logged tag positions aren't correct.
and, presumably many more we haven't found yet or simply omitted in this list.
Footnotes
*1 ... or TagStripper, or TagBuster, or ? We haven't decided on a final product name yet and you are welcome to make suggestions.
*2 Actually, the Hyper Text Markup Language (HTML) is a Extensible Markup Language (XML) implementation, and therefore a sub-set of XML. So, strictly speaking, the HTMLStripper is really a XML stripper.


Discover
Downloads
Support
Site Map


Document/Contents version 1.00
Page/URI last updated on 19.10.2023
 
Copyright © Stoelzel Software Technologie (SST) 2010 - 2017
Suggestions and comments mail to:
webmaster@stoelzelsoftwaretech.com