codemadness.org

       README: expand README - webdump - HTML to plain-text converter for webpages
 (HTM) git clone git://git.codemadness.org/webdump
 (DIR) Log
 (DIR) Files
 (DIR) Refs
 (DIR) README
 (DIR) LICENSE
       ---
 (DIR) commit 1232b5b3d77c458704341ac436ff4230a3077007
 (DIR) parent bff9fbe51c0f5f5ac37a46deca1016bb56834dac
 (HTM) Author: Hiltjo Posthuma <hiltjo@codemadness.org>
       Date:   Sun, 15 Oct 2023 13:47:16 +0200
       
       README: expand README
       
       Describe the scope and trade-offs a bit more clearly, because webdump is quite
       limited.
       
       Diffstat:
         M README                              |      42 ++++++++++++++++++++++++++++---
       
       1 file changed, 39 insertions(+), 3 deletions(-)
       ---
 (DIR) diff --git a/README b/README
       @@ -34,11 +34,17 @@ Example:
                curl -s "$url" | webdump -s 'main' -8 -a -i -l -r -b "$url" | less -R
        
        
       +Yes, all these option flags look ugly, a shellscript wrapper could be used :)
       +
       +
        Goals / scope
        -------------
        
       -The tool will only render HTML to stdout, similarly to links -dump or
       -lynx -dump but simpler and more secure.
       +The main goal is to use it for converting HTML mails to plain-text and to
       +convert HTML content in RSS feeds to plain-text.
       +
       +The tool will only convert HTML to stdout, similarly to links -dump or lynx
       +-dump but simpler and more secure.
        
        - HTML and XHTML will be supported.
        - There will be some workarounds and quirks for broken and legacy HTML code.
       @@ -46,8 +52,11 @@ lynx -dump but simpler and more secure.
        - No remote resources which are part of the HTML will be downloaded:
          images, video, audio, etc. But these may be visible as a link reference.
        - Data will be written to stdout. Intended for plain-text or a text terminal.
       -- No support for Javascript, CSS, frame rendering or forms.
       +- No support for Javascript, CSS, frame rendering or form processing.
        - No HTTP or network protocol handling: HTML data is read from stdin.
       +- Listings for references and some options to extract them in a list that is
       +  usable for scripting. Some references are: link anchors, images, audio, video,
       +  HTML (i)frames, etc.
        
        
        Features
       @@ -62,6 +71,33 @@ Features
        - Export link references and resources to a TAB-separated format.
        
        
       +Trade-offs
       +----------
       +
       +All software has trade-offs.
       +
       +webdump processes HTML in a single-pass. It does not buffer the full DOM tree.
       +Although due to the nature of HTML/XML some parts like attributes need to be
       +buffered.
       +
       +Rendering tables in webdump is very limited. Twibright Links has really nice
       +table rendering. Implementing a similar feature in the current design of
       +webdump would make the code much more complex however. Twibright links
       +processes a full DOM tree and processes the tables in multiple passes (to
       +measure the table cells) etc.  Of course tables can be nested also, or is used
       +in (older web) pages that use HTML tables for layout.
       +
       +These trade-offs and preferences are chosen for now. It may change in the
       +future.  Fortunately there are the usual good suspects for HTML to plain-text
       +conversion, (each with their own chosen trade-offs of course):
       +
       +For example:
       +
       +- twibright links
       +- lynx
       +- w3m
       +
       +
        Examples
        --------