[HN Gopher] TaBERT: A new model for understanding queries over t...
       ___________________________________________________________________
        
       TaBERT: A new model for understanding queries over tabular data
        
       Author : speculator
       Score  : 89 points
       Date   : 2020-07-03 17:31 UTC (5 hours ago)
        
 (HTM) web link (ai.facebook.com)
 (TXT) w3m dump (ai.facebook.com)
        
       | abhgh wrote:
       | Does anyone know how it relates/compares to Google's TaPaS? [1] I
       | notice this paper doesn't refer to it.
       | 
       | [1] https://ai.googleblog.com/2020/04/using-neural-networks-
       | to-f...
        
       | neeeeees wrote:
       | Seems similar to this work out of Salesforce a few years ago:
       | https://www.salesforce.com/blog/2017/08/salesforce-research-...
        
       | philprx wrote:
       | Git repo or it doesn't exist ;-)
       | 
       | Seriously, if this is not available, what are the alternatives?
       | 
       | I've seen in the past some NLP + Storage project but I don't
       | recall them. (even remotely connected, there was something to
       | convert PDFs into machine readable data).
       | 
       | Is this AwesomeNLP https://github.com/keon/awesome-nlp a good
       | starting point there?
        
         | jhj wrote:
         | it's in the paper
         | 
         | https://github.com/facebookresearch/tabert
        
           | IfOnlyYouKnew wrote:
           | No pretrained models, though, unfortunately.
        
       | j4ah4n wrote:
       | Does the following mean that one can map/train to runtimes that
       | give proper results based on the underlying data _results_?
       | 
       | "A representative example is semantic parsing over databases,
       | where a natural language question (e.g., "Which country has the
       | highest GDP?") is mapped to a program executable over database
       | (DB) tables."
       | 
       | Could it be thought of in the same fashion as Resolvers in
       | GraphQL integrated into BERT?
        
       | ianhorn wrote:
       | I'd like to see something that could do this, handling the
       | awfulness of real world tabular data. "What country has the
       | highest GDP? Okay, which table has GDP? Is it the country_gdp
       | table? No, that's an old one that hasn't been written to in 3
       | years. Ah here it is, but you need to join against `geopolitics`,
       | but first dedup the crimea data, since it's showing up in two
       | places, we're can't remember why it got written to twice there.
       | Also, you need to exclude June 21 because we had an outage on the
       | brazil data that day. What do you mean some of the country_id
       | rows are NULL?" And so on. I dream that someday there's a
       | solution for that. That's a looooong ways away, I'd bet.
        
         | mlthoughts2018 wrote:
         | This could be seriously addressed with a configurable rule
         | system similar to email filters + search. You have to store any
         | of the metadata factors you want to consider in a companion
         | index that can allow complex filtering or decision tree splits,
         | then for introspection of SQL-like data sources, you can follow
         | key and type relationships to determine what's joinable.
         | 
         | Perhaps outputting several potential answers at the end, each
         | explaining the "pathway" it chose to use (filters / decision
         | tree splits + graphical path through keys / joinable types in
         | the underlying data), and allow the user to select one or more
         | results that they believe are valid pathways of criteria, or
         | perhaps tweak individual filters and joins in the listed
         | pathway for a given result.
         | 
         | I think this would offer a lot more value than trying to get a
         | full natural language interface that "just works" on complex
         | filtering conditions, where getting just one answer back
         | (instead of seeing the variety of pathways the system could
         | choose and what influence each step has on the end result)
         | entails too many cases the ML system fails with unrealistic
         | results.
        
         | lacker wrote:
         | The tough thing is that a common failure mode of many of the
         | modern AI solutions is some output that looks superficially
         | correct, but doesn't actually map correctly to the real world.
         | When you want a table of data, it seems like the danger will be
         | high that the table looks correct but isn't actually accurate.
         | The problem here is about keeping sloppy data out of your
         | table, which is tough for a statistical AI.
         | 
         | So yeah, I would expect this to be a long ways away.
        
           | ianhorn wrote:
           | For more interesting problems than "which country has the
           | highest GDP?", it's about more than just sloppy data. If you
           | want to include any covariates, how do you know which ones to
           | include? You could try to include everything predictive, but
           | then you'll use the client margin column to predict client
           | revenue or something. Or you'll control for a column causally
           | downstream, biasing your estimates, like estimating revenue
           | differences and controlling for page views in an experiment
           | that affects page views. There's so much that we just don't
           | include in our databases that's crucial to using them, and
           | it's not just about sloppiness.
        
           | djohnston wrote:
           | tbf that problem is also quite tough for data scientists, the
           | model doesnt need to be flawless just better
        
         | mmsimanga wrote:
         | Metabase[0] lets you ask the question step by step. You pick
         | the table, pick the filter, pick the aggregation and so on. I
         | have been working in BI long enough to know that even this
         | isn't going to answer all questions. It is cleaning data,
         | filtering out stuff that shouldn't be included in the data set
         | that somehow is included and other issues that makes it
         | difficult to automate. The is usually a bunch of things not
         | documented. I will be watching this with interest.
         | 
         | [0]https://www.metabase.com/
        
           | riku_iki wrote:
           | The problem which TaBERT supposedly can solve is speed of
           | this process, you can have hundreds of tables, with dozens of
           | columns with many unobvious relationships, and constructing
           | query manually can take lot's of effort even using UI
           | automation. And the idea is that you have some system, which
           | creates this query instantly following short description.
        
         | vosper wrote:
         | This was my experience trying to work with the Johns Hopkins
         | COVID data.
         | 
         | I don't know if Johns Hopkins became the canonical data source
         | because they were amongst the first to have public data and
         | charts, but honestly I was kinda surprised at the low quality,
         | coming from a group called "Center for Systems Science and
         | Engineering". Their data was far harder to use than it needed
         | to be, even months into the pandemic.
         | 
         | Fortunately there were a handful of other projects dedicated to
         | making it sane and resolving the inconsistencies, unreconciled
         | changes in format, etc... that was really helpful.
        
           | chmullig wrote:
           | Which projects do you think have particularly high quality
           | and easy to work with data? I was using that but it's so
           | messy...
        
       | runawaybottle wrote:
       | I thought Google already did something similar?
       | 
       | Are we entering deep copycat culture?
        
       ___________________________________________________________________
       (page generated 2020-07-03 23:00 UTC)