[HN Gopher] TaBERT: A new model for understanding queries over t... ___________________________________________________________________ TaBERT: A new model for understanding queries over tabular data Author : speculator Score : 89 points Date : 2020-07-03 17:31 UTC (5 hours ago) (HTM) web link (ai.facebook.com) (TXT) w3m dump (ai.facebook.com) | abhgh wrote: | Does anyone know how it relates/compares to Google's TaPaS? [1] I | notice this paper doesn't refer to it. | | [1] https://ai.googleblog.com/2020/04/using-neural-networks- | to-f... | neeeeees wrote: | Seems similar to this work out of Salesforce a few years ago: | https://www.salesforce.com/blog/2017/08/salesforce-research-... | philprx wrote: | Git repo or it doesn't exist ;-) | | Seriously, if this is not available, what are the alternatives? | | I've seen in the past some NLP + Storage project but I don't | recall them. (even remotely connected, there was something to | convert PDFs into machine readable data). | | Is this AwesomeNLP https://github.com/keon/awesome-nlp a good | starting point there? | jhj wrote: | it's in the paper | | https://github.com/facebookresearch/tabert | IfOnlyYouKnew wrote: | No pretrained models, though, unfortunately. | j4ah4n wrote: | Does the following mean that one can map/train to runtimes that | give proper results based on the underlying data _results_? | | "A representative example is semantic parsing over databases, | where a natural language question (e.g., "Which country has the | highest GDP?") is mapped to a program executable over database | (DB) tables." | | Could it be thought of in the same fashion as Resolvers in | GraphQL integrated into BERT? | ianhorn wrote: | I'd like to see something that could do this, handling the | awfulness of real world tabular data. "What country has the | highest GDP? Okay, which table has GDP? Is it the country_gdp | table? No, that's an old one that hasn't been written to in 3 | years. Ah here it is, but you need to join against `geopolitics`, | but first dedup the crimea data, since it's showing up in two | places, we're can't remember why it got written to twice there. | Also, you need to exclude June 21 because we had an outage on the | brazil data that day. What do you mean some of the country_id | rows are NULL?" And so on. I dream that someday there's a | solution for that. That's a looooong ways away, I'd bet. | mlthoughts2018 wrote: | This could be seriously addressed with a configurable rule | system similar to email filters + search. You have to store any | of the metadata factors you want to consider in a companion | index that can allow complex filtering or decision tree splits, | then for introspection of SQL-like data sources, you can follow | key and type relationships to determine what's joinable. | | Perhaps outputting several potential answers at the end, each | explaining the "pathway" it chose to use (filters / decision | tree splits + graphical path through keys / joinable types in | the underlying data), and allow the user to select one or more | results that they believe are valid pathways of criteria, or | perhaps tweak individual filters and joins in the listed | pathway for a given result. | | I think this would offer a lot more value than trying to get a | full natural language interface that "just works" on complex | filtering conditions, where getting just one answer back | (instead of seeing the variety of pathways the system could | choose and what influence each step has on the end result) | entails too many cases the ML system fails with unrealistic | results. | lacker wrote: | The tough thing is that a common failure mode of many of the | modern AI solutions is some output that looks superficially | correct, but doesn't actually map correctly to the real world. | When you want a table of data, it seems like the danger will be | high that the table looks correct but isn't actually accurate. | The problem here is about keeping sloppy data out of your | table, which is tough for a statistical AI. | | So yeah, I would expect this to be a long ways away. | ianhorn wrote: | For more interesting problems than "which country has the | highest GDP?", it's about more than just sloppy data. If you | want to include any covariates, how do you know which ones to | include? You could try to include everything predictive, but | then you'll use the client margin column to predict client | revenue or something. Or you'll control for a column causally | downstream, biasing your estimates, like estimating revenue | differences and controlling for page views in an experiment | that affects page views. There's so much that we just don't | include in our databases that's crucial to using them, and | it's not just about sloppiness. | djohnston wrote: | tbf that problem is also quite tough for data scientists, the | model doesnt need to be flawless just better | mmsimanga wrote: | Metabase[0] lets you ask the question step by step. You pick | the table, pick the filter, pick the aggregation and so on. I | have been working in BI long enough to know that even this | isn't going to answer all questions. It is cleaning data, | filtering out stuff that shouldn't be included in the data set | that somehow is included and other issues that makes it | difficult to automate. The is usually a bunch of things not | documented. I will be watching this with interest. | | [0]https://www.metabase.com/ | riku_iki wrote: | The problem which TaBERT supposedly can solve is speed of | this process, you can have hundreds of tables, with dozens of | columns with many unobvious relationships, and constructing | query manually can take lot's of effort even using UI | automation. And the idea is that you have some system, which | creates this query instantly following short description. | vosper wrote: | This was my experience trying to work with the Johns Hopkins | COVID data. | | I don't know if Johns Hopkins became the canonical data source | because they were amongst the first to have public data and | charts, but honestly I was kinda surprised at the low quality, | coming from a group called "Center for Systems Science and | Engineering". Their data was far harder to use than it needed | to be, even months into the pandemic. | | Fortunately there were a handful of other projects dedicated to | making it sane and resolving the inconsistencies, unreconciled | changes in format, etc... that was really helpful. | chmullig wrote: | Which projects do you think have particularly high quality | and easy to work with data? I was using that but it's so | messy... | runawaybottle wrote: | I thought Google already did something similar? | | Are we entering deep copycat culture? ___________________________________________________________________ (page generated 2020-07-03 23:00 UTC)