[HN Gopher] Show HN: I made a tool to convert images of tables t... ___________________________________________________________________ Show HN: I made a tool to convert images of tables to CSV Author : aperrin Score : 104 points Date : 2021-03-09 19:58 UTC (3 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | luplex wrote: | This is similar to WebPlotDigitizer, which helps you extract data | from graphs: | | https://automeris.io/WebPlotDigitizer/index.html | aperrin wrote: | Hi ! Thank you for sharing this, it's a great tool I bumped | into when searching for an image to CSV converter. But it seems | to work with graphs only if I'm not mistaken. | luplex wrote: | Yes, your tool is a welcome addition! | ohazi wrote: | I had been meaning to find or write a tool like this for ages -- | often times the only place where you can find pinout information | for a chip is from a table buried on page 7xx of a massive pdf | datasheet. Trying to create a symbol for, e.g. a 200+ ball BGA is | _awful_. | aperrin wrote: | Hi ! I couldn't find a tool like that when I needed it, so I made | that as a Python beginner's project. Hope you'll find it useful. | :-) | roussanoff wrote: | A similar tool: | | https://github.com/eihli/image-table-ocr | vmchale wrote: | That's pretty neat. | cosmotic wrote: | How fast is it? Does it work with rotated images? How about | multiple tables per image? | cosmotic wrote: | What about hand writing? | aperrin wrote: | The program runs with Python and Tesseract. It is quite fast | (less than one second for a table of 100 numbers) though I | never tested it with larger tables. It detects numbers from an | image of a table, which is supposed not to be rotated and also | cropped : only the table is visible on the image. So, in order | to process multiple tables per image, one needs to create an | image for each table. This program is rather simple I must say. | ;-) | | As for the handwriting, I think Tesseract can handle the | recognition if the writing is good, but the table needs to | fullfil the expected hypothesis. Also the pre-processing can't | get rid of a lot of noise so it can be a problem too ! | technicolorwhat wrote: | Is there also a solution for automatic border detection. Last | year tried reading bank statements, which were scanned slips. | Unfortunately they didn't have any borders which made it super | difficult to extract content. Would be cool if someone could make | something for this :) I thought it would be easy but I broke my | mind on it for several days until I gave up. | [deleted] | boogies wrote: | https://github.com/eihli/image-table-ocr seems to automatically | find tables within larger images, IDK if it works without | borders though. | eihli wrote: | The logic for detecting a table is to get rid of everything | but vertical lines over a certain length, save that in one | image, then get rid of everything but horizontal lines of a | certain length, save that image. Then overlay the two and | take the bounding rectangle. So you don't need the table to | have a border as long as you have vertical and horizontal | lines and they extend far enough to encompass all the data | you need. | adflux wrote: | Azure FormRecognizer API | spudwaffle wrote: | It would be cool if you could put a license for this! | aperrin wrote: | Done it, thank you for the tip ! ;-) | leeoniya wrote: | also https://github.com/tabulapdf/tabula-java | arathore wrote: | Great project! I've had success using camelot-py | (https://camelot-py.readthedocs.io) to extract tabular data from | PDFs (for images, I use imagemagick to convert those to PDF). If | your table has borders the default method (lattice) works quite | well. For non-bordered table there is the option to use 'stream' | option but usually requires bit more preprocessing to get usable | results. ___________________________________________________________________ (page generated 2021-03-09 23:00 UTC)