hngopher.com

       [HN Gopher] Show HN: I made a tool to convert images of tables t...
       ___________________________________________________________________
        
       Show HN: I made a tool to convert images of tables to CSV
        
       Author : aperrin
       Score  : 104 points
       Date   : 2021-03-09 19:58 UTC (3 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | luplex wrote:
       | This is similar to WebPlotDigitizer, which helps you extract data
       | from graphs:
       | 
       | https://automeris.io/WebPlotDigitizer/index.html
        
         | aperrin wrote:
         | Hi ! Thank you for sharing this, it's a great tool I bumped
         | into when searching for an image to CSV converter. But it seems
         | to work with graphs only if I'm not mistaken.
        
           | luplex wrote:
           | Yes, your tool is a welcome addition!
        
       | ohazi wrote:
       | I had been meaning to find or write a tool like this for ages --
       | often times the only place where you can find pinout information
       | for a chip is from a table buried on page 7xx of a massive pdf
       | datasheet. Trying to create a symbol for, e.g. a 200+ ball BGA is
       | _awful_.
        
       | aperrin wrote:
       | Hi ! I couldn't find a tool like that when I needed it, so I made
       | that as a Python beginner's project. Hope you'll find it useful.
       | :-)
        
       | roussanoff wrote:
       | A similar tool:
       | 
       | https://github.com/eihli/image-table-ocr
        
       | vmchale wrote:
       | That's pretty neat.
        
       | cosmotic wrote:
       | How fast is it? Does it work with rotated images? How about
       | multiple tables per image?
        
         | cosmotic wrote:
         | What about hand writing?
        
         | aperrin wrote:
         | The program runs with Python and Tesseract. It is quite fast
         | (less than one second for a table of 100 numbers) though I
         | never tested it with larger tables. It detects numbers from an
         | image of a table, which is supposed not to be rotated and also
         | cropped : only the table is visible on the image. So, in order
         | to process multiple tables per image, one needs to create an
         | image for each table. This program is rather simple I must say.
         | ;-)
         | 
         | As for the handwriting, I think Tesseract can handle the
         | recognition if the writing is good, but the table needs to
         | fullfil the expected hypothesis. Also the pre-processing can't
         | get rid of a lot of noise so it can be a problem too !
        
       | technicolorwhat wrote:
       | Is there also a solution for automatic border detection. Last
       | year tried reading bank statements, which were scanned slips.
       | Unfortunately they didn't have any borders which made it super
       | difficult to extract content. Would be cool if someone could make
       | something for this :) I thought it would be easy but I broke my
       | mind on it for several days until I gave up.
        
         | [deleted]
        
         | boogies wrote:
         | https://github.com/eihli/image-table-ocr seems to automatically
         | find tables within larger images, IDK if it works without
         | borders though.
        
           | eihli wrote:
           | The logic for detecting a table is to get rid of everything
           | but vertical lines over a certain length, save that in one
           | image, then get rid of everything but horizontal lines of a
           | certain length, save that image. Then overlay the two and
           | take the bounding rectangle. So you don't need the table to
           | have a border as long as you have vertical and horizontal
           | lines and they extend far enough to encompass all the data
           | you need.
        
         | adflux wrote:
         | Azure FormRecognizer API
        
       | spudwaffle wrote:
       | It would be cool if you could put a license for this!
        
         | aperrin wrote:
         | Done it, thank you for the tip ! ;-)
        
       | leeoniya wrote:
       | also https://github.com/tabulapdf/tabula-java
        
       | arathore wrote:
       | Great project! I've had success using camelot-py
       | (https://camelot-py.readthedocs.io) to extract tabular data from
       | PDFs (for images, I use imagemagick to convert those to PDF). If
       | your table has borders the default method (lattice) works quite
       | well. For non-bordered table there is the option to use 'stream'
       | option but usually requires bit more preprocessing to get usable
       | results.
        
       ___________________________________________________________________
       (page generated 2021-03-09 23:00 UTC)