[HN Gopher] Fugue: A unified interface for distributed computing ___________________________________________________________________ Fugue: A unified interface for distributed computing Author : duck Score : 64 points Date : 2023-03-27 05:46 UTC (1 days ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | antman wrote: | What is the difference with ibis which is also one of its | included backends? Also ibis has other backends e.g. duckdb so | what is the constraint of accessing duckdb with a python | interface as in: fugue->ibis->duckdb? | | Seems very interesting but I can't tell what it's scope range is, | or its comparative advantages with ibis or dbt or others? | kvnkho wrote: | Hi antman, thanks for the question. I will type some points on | differences, but will answer the first question. The Fugue -> | Ibis -> DuckDB example is a bit weird. Yes it can be done but | it's not practical (as you can tell). There may be some overlap | sometime, but I do think the projects differ in scope (more | below). | | The Ibis integration is more about accessing data in various | data stores already. For example, we use it under the hood also | for our recently released BigQuery integration: https://fugue- | tutorials.readthedocs.io/tutorials/integration... | | On to differences: | | 1. We guarantee consistency between backends. NULL handling can | be different depending on the backend. For example, Pandas | joins NULL with NULL while Spark doesn't. So if you prototype | locally on Pandas, and then scale to Spark, we guarantee same | results. Fugue is 100% unit tested and the backends go through | the same test suite. | | 2. Ibis is Pythonic for SQL backends. We embrace SQL, but | understand its limitations. FugueSQL is an enhanced SQL dialect | that can invoke Python code. FugueSQL can be the first-class | grammar instead of being sandwiched by Python code. Fugue's | Python API and SQL API are 1:1 in capability. | | 3. Opinionated here, but we don't want users to learn any new | language. Ibis is a new way to express things; we just want to | extend the capabilities of what people already know (SQL, | native Python, and Pandas). Fugue can also be incrementally | adopted, meaning it can be used for just one portion of your | workflow. | | 4. Roadmap-wise, we think the optimal solutions will be a mix | of different tools. A clear one is pre-aggregating data with | DuckDB, and then using Pandas for further processing. | Similarly, can we preprocess in Snowflake and do machine | learning in Spark? Fugue is working on connecting these | different systems to enable cross-platform workloads. | | There may be more information for you here: https://fugue- | tutorials.readthedocs.io/tutorials/integration... | cmarschner wrote: | Fugue is really cool as it makes a complicated thing simple. The | learning curve is a few minutes (provided that you have an | installation that works, of course) | crabbone wrote: | I want to cry... why distributed programming _in Python_? | | I mean, I understand that a lot of people want Pandas to run | faster, and distributing computation would help... but come on! | There needs to be a line in the sand that a sane person would not | cross. It doesn't matter how much effort you put into making | Python distributed, it's not going to work unless you have the | power to control the language itself. | | You know, I just thought of a metaphor for this. Maybe not 100%, | but you'll get the idea. Suppose you live in a small country, | something close to 10M population. And you want to buy pants. | Well, you want to buy them from an offline store, so you don't | get all the choices you have by shopping online. You don't care | about the price, you can afford to buy one pair of pants for a | price of ten. You just want a specific kind. Let's say, you want | JNCO jeans. | | Tough luck. You go to every store that sells pants, but JNCO is | just too niche for a small country to make sense for the | retailers to import. So, every store that wants to make profits | buys the same exact model of jeans that was advertised in the | last year fashion catalogue. You can pay a little more or a | little less for better or worse material, but all the pants on | the offer have the same trashy design. You just cannot stand | them. | | This is what's happening with Python. It's trash. But it's the | model from the last year fashion catalogue. So any program you | write, any problem you solve must be in Python, or else you lose | the competition before you even enter into the contest. | [deleted] | whinvik wrote: | This looks great but as someone who has been doing lots of Spark | lately, I feel like this will further worsen the development | process. | | The problem with Spark is that there's a lot of magic that | happens underneath and being able to debug and figure out issues | is pretty hairy. If we put this on top then there are 2 levels of | magic that we will have to go through to figure out what is going | on. | | I wish someone works on this aspect. | kvnkho wrote: | Hi whinvik, we agree that development in Spark is hard, and | that is part of the motivation of Fugue. Spark code couples the | distributed orchestration and business logic together. | | By keeping your code in native Python or Pandas, it will be | much easier to develop, debug, and maintain the business logic | because your tracebacks will be in native Python. Fugue then | takes it to Spark when you are ready to scale. | whinvik wrote: | I appreciate your response but that is not what I was getting | at. I understand that with this you only have to write Pandas | and then not worry about scaling. | | First, I think PySpark syntax is much better than the | insanity than is Pandas but if you really like Pandas then | you can always use Pandas UDF which Spark supports. | | But let's say that writing only in Pandas is the preferred | way. Now comes the magic part. How do I know that it is using | the best join? Will it optimize for spills? Will there be | OOM's? These are the things we need to worry about which | often lead us needing to go deep inside Spark magic. | | Now if there's another level of magic which is Pandas to | Spark transpiling as I imagine you do here, then I have even | less of an idea how to tune it. | | Again I appreciate you are solving a specific problem in a | nice way but I feel like we are actually making the problem | even more complicated. | AndrewKemendo wrote: | This looks really well architected, and your documentation is | really easy to read, it's structured logically and simply which | is refreshing. | | It also looks pretty powerful, though I admit I haven't used it | yet and haven't used Dask or Ray but it's making me go look into | popping spark back up and learning this. | | Kudos on what looks to be a really well built product. | kvnkho wrote: | Hi AndrewKemendo, Fugue co-author here. Thanks for the kind | words! We do put a lot of effort into our documentation. Always | happy to chat potential use cases if you want. My contact info | is in my profile. | dscape wrote: | Me and the Decipad team have been following fugue for a while - | would love a chat | kvnkho wrote: | Yeah let's chat! My contact info is in my profile. ___________________________________________________________________ (page generated 2023-03-28 23:01 UTC)