[HN Gopher] Big data is dead
       ___________________________________________________________________
        
       Big data is dead
        
       Author : munchor
       Score  : 554 points
       Date   : 2023-02-07 16:34 UTC (6 hours ago)
        
 (HTM) web link (motherduck.com)
 (TXT) w3m dump (motherduck.com)
        
       | glogla wrote:
       | I agree with a lot of the sentiments of the MotherDuck people,
       | but boy are they loud and proud for someone who never delivered
       | anything more than blogposts and vague promise to somehow exploit
       | the MIT licensed DuckDB.
       | 
       | Meanwhile for example boilingdata.com seems to have already done
       | that - by using AWS Lambda + DuckDB as distributed compute engine
       | which I can't decide if its awesome, deranged or both.
        
         | mytherin wrote:
         | We (the DuckDB team) are very happy working together with
         | MotherDuck in a close partnership [1].
         | 
         | [1] https://duckdblabs.com/news/2022/11/15/motherduck-
         | partnershi...
        
       | singularity2001 wrote:
       | Big Data lives on in LLMs.
        
       | KaiserPro wrote:
       | big data isn't big anymore.
       | 
       | 1) 10 years ago, having access to 300tb of data that could
       | sustain 10gigabytes/s of throughput would require something like
       | two racks of disks with some SSD cache and junk.
       | 
       | 2) people thought hadoop was a good idea
       | 
       | 3) People assumed that everything could be solved with map:reduce
       | 
       | 3) machine learning was much less of a thing.
       | 
       | 4) people realised that postgres does virtually everything that
       | mongo claimed it could.
       | 
       | 5) people realised that cassandra was a very expensive way to
       | make a write only database.
       | 
       | I gave a talk about using big data, and basically at the time the
       | best definition I could come up with was "anything that's too big
       | to reasonably fit in one computer. so think 4, 60 disk direct
       | attached SAS boxes".
       | 
       | Most of the time people were chasing the stuff for the CV, rather
       | than actually stopping to think if it was a good idea. (think k8s
       | two years ago, chatGPT now, chat bots in 2020). Most buisnesses
       | just wanted metrics, and instead of building metrics into the
       | app, they decided to boil the ocean by parsing unstructured logs.
       | 
       | Not surprisingly it turned to shit pretty quick. Nowadays people
       | are much better at building metrics generation directly into
       | apps, so its much easier to easily plot and correlate stuff.
        
       | jl6 wrote:
       | To add to the "the real issue is..." pile:
       | 
       | Most orgs collect the data that is easy to collect, and they are
       | extremely lucky if that happens to be the data that enables the
       | insights they desire. When the data they really _need_ looks too
       | hard to get, the org tries to compensate by collecting more of
       | the easy stuff, and hoping that if blood can't be squeezed out of
       | a stone, maybe it can be squeezed out of 100bn stones.
        
       | meindnoch wrote:
       | Good riddance.
        
       | itamarst wrote:
       | This is an excellent summary, but it glosses over part of the
       | problem (perhaps because the author has an obvious, and often
       | quite good solution, namely DuckDB).
       | 
       | The implicit problem is that even if the dataset fits in memory,
       | the software processing that data often uses more RAM than the
       | machine has. And unlike using too much CPU, which just slows you
       | down, using too much memory means your process is either dead or
       | so slow it may as well be. It's _really easy_ to use way too much
       | memory with e.g. Pandas. And there's three ways to approach this:
       | 
       | * As mentioned in the article, throw more money at the problem
       | with cloud VMs. This gets expensive at scale, and can be a pain,
       | and (unless you pursue the next two solutions) is in some sense a
       | workaround.
       | 
       | * Better data processing tools: Use a smart enough tool that it
       | can use efficient query planning and streaming algorithms to
       | limit data usage. There's DuckDB, obviously, and Polars; here's a
       | writeup I did showing how Polars uses much less memory than
       | Pandas for the same query:
       | https://pythonspeed.com/articles/polars-memory-pandas/
       | 
       | * Better visibility/observability: Make it easier to actually see
       | where memory usage is coming from, so that the problems can be
       | fixed. It's often very difficult to get good visibility here,
       | partially because the tooling for performance and memory is often
       | biased towards web apps, that have different requirements than
       | data processing. In particular, the bottleneck is _peak_ memory,
       | which requires a particular kind of memory profiling.
       | 
       | In the Python world, relevant memory profilers are pretty new.
       | The most popular open source one at this point is Memray
       | (https://bloomberg.github.io/memray/), but I also maintain Fil
       | (https://pythonspeed.com/fil/). Both can give you visibility into
       | sources of memory usage that was previous painfully difficult to
       | get. On the commercial side, I'm working on
       | https://sciagraph.com, which does memory and also performance
       | profiling for Python data processing applications, and is
       | designed to support running in development but also in
       | production.
        
       | H8crilA wrote:
       | Big data starts somewhere around a petabyte, maybe a bit lower
       | than that. That's when you need some serious, dedicated systems.
       | But as always everyone wants to (pretend to) do what the big
       | players do.
        
       | travisgriggs wrote:
       | I've made anecdotal observations similiar to this over the last
       | 10 years. I work in AgTech. A big push for a while here has been
       | "more and more more data". Sensor-the-heck out of your farm, and
       | We'll Tell You Things(tm).
       | 
       | Most of what we as an industry are able to tell growers is stuff
       | they already know or suspect. There is the occasional suprise or
       | "Aha" moment where some correlation becomes apparent, but the
       | thing about these is that once they've been observed and
       | understood, the value of ongoing observation drops rapidly.
       | 
       | A great example of this is soil moisture sensors. Every farmer
       | that puts these in goes geek-crazy for the first year or so. It's
       | so cool to see charts that illustrate the effect of their
       | irrigation efforts. They may even learn a little and make some
       | adjustments. But once those adjustments and knowledge have been
       | applied, it's not like they really need this ongoing telementry
       | as much anymore. They'll check periodically (maybe) to continue
       | to validate their new assumptions, but 3 years later, the probes
       | are often forgotten and left to rot, or reduced in count.
        
         | jschveibinz wrote:
         | This analysis reminds me of the big interest in the use of
         | hyperspectral imaging for agriculture. The idea was the greater
         | spectral resolution (greater than Landsat) would result in more
         | interesting information. Agriculture was one of the
         | applications. But, once you did find the interesting stuff, you
         | no longer needed a hyperspectral sensor. You could just look at
         | one spot with a much lower cost sensor.
         | 
         | So hyperspectral, like big data, is useful up front. But in the
         | end, much simpler tools and algorithms will solve the problem
         | on a continuing basis.
        
         | barathr wrote:
         | Classic paper on soil moisture sensors (from 2010!) -- the
         | title says it all:
         | 
         | "Mate, we don't need a chip to tell us the soil's dry"
         | 
         | https://doi.org/10.1145/1858171.1858211
        
         | calvinmorrison wrote:
         | On the flip side, there's some great action coming from data
         | insights. Look at Strella Biotech - they're putting sensors in
         | sealed warehouses to detect spoilage for certain vegetables and
         | fruits. That's something that can have great returns with just
         | a few IoT devices and a novel sensor.
        
         | didip wrote:
         | I've been telling folks, storing everything all the time is
         | wasteful, a better alternative is:
         | 
         | 1. Keep the raw full data for short period of time, at most 1
         | month.
         | 
         | 2. Downsample what you need for longer period of time (5-10% of
         | the full data).
         | 
         | 3. Aggregate your metrics on a yearly basis to save money and
         | compute costs.
        
         | ff317 wrote:
         | I tend to think the problem is the "random digging for
         | correlations" part.
         | 
         | Having tons of data is a Good Thing, so long as you can afford
         | the marginal cost of gathering and managing all that data so
         | that it's ready at hand when you need it later.
         | 
         | It's how you use the data that makes all the difference. If
         | you're facing an issue you don't understand at all, don't go
         | digging for random correlations in your mountain of data to
         | find an explanation.
         | 
         | Think like a scientist: you need a valid hypothesis first! Once
         | you have a hypothesis about what your issue might plausibly be,
         | then you make a prediction: "If I'm right, I suspect our Foobar
         | data will show very low values of Xyzzy around 3AM every
         | weekday night". Only then do you go look at that specific data
         | to confirm or refute the hypothesis. If you don't get a
         | confirmation, you need to go back to hypothesizing and
         | predicting before you look again. You can't prove causation by
         | merely correlating data.
        
           | counters wrote:
           | > It's how you use the data that makes all the difference. If
           | you're facing an issue you don't understand at all, don't go
           | digging for random correlations in your mountain of data to
           | find an explanation.
           | 
           | Absolutely. But in my experience, there's this massive trend
           | across the tech world that flat out rejects the value of
           | domain/subject matter expertise. Instead, all you need is an
           | engineer who can throw some ML at the uncurated mountain of
           | data your organization has collected. Little to no value is
           | placed on the resources that can frame an actionable
           | hypothesis, even though the entire value proposition arises
           | from this exercise!
           | 
           | Maybe I'm just jaded. I end up wasting a lot of time trying
           | to re-direct data scientists and engineers down more
           | appropriate pathways than if the problem they're solving was
           | just brought to my attention earlier. Sorry, I understand you
           | spent two weeks shoe-horning dataset X into our analysis
           | system for your work, but it's invalid for the question
           | you're asking - use dataset Y instead, and you'll have an
           | answer in an hour or two.
        
         | PeterisP wrote:
         | Fine-grained measurement is useful when you have options for
         | fine-grained action.
         | 
         | You don't need a chip to tell you that the soil is dry, but if
         | you can use that chip to regulate drip irrigation that can
         | apply substantially different flow to different plants, then
         | you can get a not-too-much, not-too-little watering even if you
         | have a big variation in conditions.
         | 
         | You don't need a big analysis to acknowledge that everybody
         | knows that a particular competitor has lower or higher prices
         | and adjust your pricing; but doing that continuously on a per-
         | product basis does require data and analysis.
        
           | ryguyrg wrote:
           | Agreed. But how many executives will agree to take these
           | fine-grained actions to achieve value from the data? How many
           | data teams are able to build up a strong-enough argument to
           | convince them?
           | 
           | I've worked on many product-led-growth initiatives in the
           | software industry. The software industry is probably the
           | biggest 'believer in data' there is -- many scientific-
           | forward minds who understand the value. However, even in the
           | software industry, it's really hard to convince folks that if
           | you make 5 improvements that net 1% conversion gain each, you
           | can dramatically improve revenue.
        
         | chudi wrote:
         | Most of the time this story is true, but think this way, the
         | person that was using the system was an expert on the subject.
         | If you can replace the expert with a person just looking at a
         | graph from time to time to know if you have to irrigate the
         | soils it's a different thing. Most of the data or ML tools show
         | us something that the client as an expert already knows, but
         | the true power of this tools is to give them to a non expert
         | user and have roughly the same level of proficiency
        
         | e12e wrote:
         | But isn't this the essence of industrialization and automation?
         | Messure, adjust process; repeat until feedback loop is stable -
         | document and keep doing the thing that works, over and over?
         | 
         | If you want Toyota style continuous improvement you would need
         | to improve in new areas of the process / new metrics, most of
         | the time?
        
         | azubinski wrote:
         | Oh, those soil moisture sensors, they are so fascinating.
         | 
         | I spent a number of exciting year developing a high frequency
         | soil impedance scanner and finally understood why I was doing
         | it. To confirm the obvious :)
        
         | 0xdeadbeefbabe wrote:
         | > goes geek-crazy for the first year or so
         | 
         | The problem is that they don't stay geek-crazy?
        
         | ladyattis wrote:
         | I think there's a problem at the heart of the matter,
         | specifically the idea that the act of measurement is in itself
         | powerful when in point of fact that this isn't universally the
         | case. As the old adage goes: "garbage in, garbage out." Even
         | more troubling, there is a physical limit to our ability to
         | model what we measure. Take the retina, it has around a million
         | light receptors and even if you assumed they only have two
         | valid states then you're left with around 10^300,000 bits of
         | information to process, so good luck with that. Same thing
         | applies to whatever firms are measuring and what they think is
         | conveying relevant information as they'll have similarly
         | exponential increases if they don't filter out the vast
         | majority of irrelevant data points and states.
        
         | gffrd wrote:
         | I've observed the same in manufacturing ... and fitness
         | trackers a la FitBit.
         | 
         | There's initial value from training yourself on what something
         | looks/feels like ... but diminishing returns after that.
         | Whether there is more value to be found doesn't seem to matter.
         | 
         | Factories would sensor up, go nuts with data, find one or two
         | major insight, tire of data, and then just continue operating
         | how they were before ... but with a few new operational tools
         | in their quiver.
         | 
         | Same is true of fitness trackers: you excitedly get one, learn
         | how much you really are sitting(!), adjust your patterns, time
         | passes ... then one day you realize you haven't put it on for a
         | week. It stays in the drawer.
         | 
         | Not unless they're threatened with ruin will people make
         | changes to the standard way of doing things. This is actually
         | ... not bad! Continuity is important, and this is kind of a
         | subconscious gating function to prevent deviation from a proven
         | way of working. So, the change has to be so compelling or so
         | pressing that they're forced to. Not a bad thing.
         | 
         | While we think things change overnight in this world, they
         | generally take awhile ... stay patient ... it's worth it.
        
           | [deleted]
        
           | pradn wrote:
           | I went on a diet a few years ago. I obsessively recorded
           | every food I ate in MyFitnessPal. To this day, I know roughly
           | how many calories pretty much everything I eat is. So, I've
           | learned from the process and don't need the process as much
           | any more. (I'm kidding about that - it's easy to
           | underestimate how much you eat, and an extra 200 cal a day
           | adds up over the years.)
        
           | carabiner wrote:
           | I used to track sooo much health and fitness data... Then I
           | realized it mostly wasn't actionable, or at least, I wasn't
           | altering my decisions based on it. The answer was always,
           | "more training." So I stopped.
        
         | esel2k wrote:
         | Very interesting. Left AgTech last year but had similar
         | experiences, even worse where often the single most prominent
         | use-case was to follow some painful necessary documentation of
         | ag inputs (chemical, seeds, fertiliser) to get subsidies. Real
         | inputs from data? Nah!
        
       | guardiangod wrote:
       | There is literally a post on front page on ChatGPT, and Microsoft
       | and Google are preparing to duke it out starting in the _next 2
       | days_ over big-data generated 'chat' result.
       | 
       | Big data was never going to be useful to even medium size
       | enterprises, unless anyone can get public access to PBs of data,
       | but that doesn't mean big data is dead. ChatGPT is literally
       | changing how school will test their students, for a start.
       | 
       | Maybe what the author is trying to say is 'small-scale big data
       | is dead, but big data chugs on.'
        
         | gardenhedge wrote:
         | That is what the author said. From the article: "Big Data is
         | real, but most people may not need to worry about it."
        
         | hn_throwaway_99 wrote:
         | > Maybe what the author is trying to say is 'small-scale big
         | data is dead, but big data chugs on.'
         | 
         | That's pretty much exactly what the author says in the article.
        
         | zmmmmm wrote:
         | Yes, this occurred to me as well. The counter narrative here is
         | in fact that the story of the last 2-3 years has been break
         | throughs in AI have come about mostly by scaling up their
         | network sizes and training data sets 5 orders of magnitude or
         | so.
         | 
         | I guess the take away however is still that regular businesses
         | really just can't play in this game and should not be assuming
         | they have big data until that fact asserts itself out of
         | necessity rather than the other way around.
        
         | eppp wrote:
         | I kind of doubt they trained chatgpt on petabytes of
         | application logs and web server logs. Is keeping all of this
         | crap even useful for more than a small amount of time at this
         | scale?
         | 
         | Actual good information will always be useful, most of this
         | "big data" seems to be the equivalent of recording background
         | static.
        
         | tootie wrote:
         | That's a completely different topic. "Data" is obviously a
         | pretty generic term and "large sets of data" are going to be
         | more and more relevant to the world in general. What he's
         | talking about is the Big Data trend in industry specifically
         | around Business Intelligence (BI). That is, collecting as much
         | data as possible on your users to optimize your product
         | experience and profits. Tracking clicks, purchases, form
         | dropoffs, email opens, ad impressions. It's mostly going to be
         | first-party data (ie what did they do with our own products and
         | content).
         | 
         | ChatGPT and the like are not going to get much use from that
         | kind of data and instead are looking at a giant corpus of text
         | and images scraped from a variety of public sources to infer
         | what humans might think sounds smart. It's possible the two
         | worlds will meet, but that's probably not what's going to be
         | announced this week.
        
         | miguelazo wrote:
         | >ChatGPT is literally changing how school will test their
         | students, for a start.
         | 
         | Sure, instead of schools checking for plagiarism from other
         | students' papers using turnitin.com, they'll check for
         | plagiarism using ChatGPT tools that scan for known output from
         | their industrial-scale amalgamation of plagiarized materials.
         | Big whoop.
        
           | humanizersequel wrote:
           | All math homework through the high school level is now as
           | simple as figuring out how to describe it to ChatGPT (or
           | maybe ChatGPT 2.0 for particularly tricky examples). Paper-
           | writing is now a matter of figuring out how to rephrase LLM
           | output in your own words to get around any watermarking or
           | pattern detection.
        
             | eganist wrote:
             | Wolfram alpha has been around for math cheats (and people
             | like me who just needed a more visual representation to
             | learn) for a while now. Including proof of work.
        
               | LarryMullins wrote:
               | Years before wolfram alpha, we had TI-89s with computer
               | algebra systems for cheating your way through highschool
               | math.
        
               | eganist wrote:
               | Oh yeah, it's why most of my classes restricted us to
               | TI-83s. The TI-89 was restricted in schools to basically
               | calc and above, and the TI-92 was just banned. Lol
        
               | LarryMullins wrote:
               | In my school TI-89s were unrestricted, but I think that
               | was mostly because teachers were only trained with 83s
               | and assumed the 89 had equivalent capabilities. The SAT
               | permitting the 89 probably had something to do with it
               | too, since the 92 was banned (because qwerty as I
               | understand it.)
        
             | frgtpsswrdlame wrote:
             | You can't figure out how to describe a math problem
             | correctly to ChatGPT without already knowing the solution.
        
           | bhhaskin wrote:
           | Or just require students to use software that keeps track of
           | version history or require spyware installed.
        
             | SoftTalker wrote:
             | Go back to longhand. Even if they are plagiarizing, they
             | might learn something from the exercise of rewriting by
             | hand.
        
               | suddenclarity wrote:
               | That's solved by putting a pen on your 3d printer.
        
           | Loughla wrote:
           | I mean, that's certainly part of it. But you're also seeing
           | very fast restructuring of individual courses (because
           | programs overall will definitely take years to restructure
           | because higher education moves so damned slow) to account for
           | these tools.
           | 
           | In the small institution I am currently working with, the
           | English courses, in one week, integrated chatgpt as a tool
           | for students to work with. It's part of the collaborative
           | idea building and development process now for every student
           | enrolled in creative writing and writing analysis classes,
           | and that happened in one week. I cannot stress enough how
           | unbelievably fast that is for higher ed. That's faster than
           | light speed.
           | 
           | And we're not even that well resourced. I have to imagine
           | there are other examples where it's more than just running
           | through a bot to scan for known outputs.
        
             | stcroixx wrote:
             | Sounds like they simply panicked and threw something
             | together with little thought or preparation, what, right in
             | the middle of an actual course? And they want to charge
             | kids for this kind of 'expert instruction'? I'd be pissed
             | as a student.
        
               | Loughla wrote:
               | What a weird assumption you made there. In what way does
               | what I wrote sound panicked? Because it was a week? Yes
               | it was fast, but it was a massive effort of the entire
               | english faculty.
               | 
               | It's integrated into existing assignments, modifying
               | processes that are super well established already. It was
               | like integrating a new person into the class. Also, it
               | was before semester, so the students literally saw
               | nothing weird; that was another strange assumption for
               | you to make.
               | 
               | Just a tip, and don't read tone in this statement, but
               | don't assume things. 9/10 times you're going to be
               | incorrect. It's much better to ask questions, instead of
               | making statements with question marks at the end of them.
        
               | pcthrowaway wrote:
               | > don't assume things. 9/10 times you're going to be
               | incorrect
               | 
               | Isn't that... an assumption?
        
               | Loughla wrote:
               | No, it's an assertion. It's like the bastard cousin of an
               | assumption, in that it's only incorrect 8.67 times out of
               | 10.
        
               | stcroixx wrote:
               | I simply don't buy that they had anywhere close to enough
               | time to integrate this into an existing curriculum in a
               | way that would meet the high standards paying students
               | should have for a university education. I don't have to
               | ask how an English department became experts in using a
               | just released AI in the classroom in a week because I
               | don't think that's what they actually achieved.
        
               | Loughla wrote:
               | Again, stop putting words in my mouth, please. I never
               | said they became experts. I said they integrated it into
               | existing processes. I said that they were doing something
               | other than just scanning for chatgpt hits like plagiarism
               | checkers.
               | 
               | >It's part of the collaborative idea building and
               | development process now for every student enrolled in
               | creative writing and writing analysis classes
               | 
               | >It's integrated into existing assignments, modifying
               | processes that are super well established already. It was
               | like integrating a new person into the class.
               | 
               | I don't honestly know how else to say that. I
               | legitimately do not know how to help you understand what
               | I'm saying.
        
             | hgsgm wrote:
             | Wow, instead of learning and being creative, just write a
             | out whatever the generic safe whitewashed chatbot says?
             | 
             | Why pay for the college?
        
           | scottyah wrote:
           | > industrial-scale amalgamation of plagiarized materials
           | 
           | More Big Data!
        
           | criddell wrote:
           | Is it crazy to think that instead of stepping up in the war
           | against AI we instead try to figure out a way to teach kids
           | assuming they will use AIs?
        
             | SoftTalker wrote:
             | Are we trying to produce adults who are able to think
             | critically and creatively, and who reach their full
             | intellectual potential, or are we trying to produce adults
             | who can push a few buttons and blindly believe what the
             | machine tells them?
        
               | NineStarPoint wrote:
               | While likely not how it would be work out in practice,
               | you would hope that with better tools would also come
               | higher standards. If you expect more complex, more
               | thorough, and/or less error-prone output from students
               | using AI then you don't necessarily have to lower how
               | much critical and creative insight they need to have.
               | Like the difference in a test that does and doesn't allow
               | calculators, you always have to fit the assignments to
               | the tools that are used for them.
        
               | ilyt wrote:
               | Definitely the second one
        
               | [deleted]
        
               | nickersonm wrote:
               | > In fact, [writing] will introduce forgetfulness into
               | the soul of those who learn it: they will not practice
               | using their memory because they will put their trust in
               | writing, which is external and depends on signs that
               | belong to others, instead of trying to remember from the
               | inside, completely on their own.
               | 
               | - attributed to Socrates by Plato. c.399-347 BCE.
               | "Phaedrus."
        
               | drowsspa wrote:
               | Pretty sure there's a fallacy named after this whole "hey
               | this is just exactly like before so we have nothing to be
               | concerned about".
        
               | nickersonm wrote:
               | The point is that all technology is a tool. Whether it be
               | writing, calculators, or various narrow AI software. We
               | can either bemoan the loss of a now-less-useful skill
               | (memorization, long division, longform writing), or learn
               | how to use these tools to better achieve our goals.
        
             | qup wrote:
             | The average human hates changes of things they've grown
             | used to.
             | 
             | People are very attached to school being just like it was
             | when they went.
        
           | maliker wrote:
           | It appears that it's hard to detect AI generated content.
           | E.g. true detection rates are only around 25% and there are
           | also techniques to further mask output [1].
           | 
           | [1] https://www.nbcnews.com/tech/innovation/chatgpt-can-help-
           | foo...
        
           | penjelly wrote:
           | what if teachers ask chatgpt how best to test students
           | despite the existence of a tool like chatgpt enabling
           | cheating
        
         | pphysch wrote:
         | OpenAI and Google are clearly in the 1% as TFA describes.
        
         | snarf21 wrote:
         | Big Data drives the most profitable and society bending changes
         | of all time, just to serve us better Ads.
        
           | ryguyrg wrote:
           | Okay, Google as a company and as a product is definitely in
           | the top 1%, or top 0.0001% where big data drives profitable
           | and bending changes ;-)
        
         | epicureanideal wrote:
         | I wonder how long until training today's ChatGPT will cost
         | $1000 of AWS compute. 10 years?
         | 
         | At that point, does it keep scaling or is there an S curve
         | where 100x more data and compute only leads to a 2x
         | improvement?
        
           | thfuran wrote:
           | I think you're underestimating the scale by a few orders of
           | magnitude.
        
           | the8472 wrote:
           | > At that point, does it keep scaling or is there an S curve
           | where 100x more data and compute only leads to a 2x
           | improvement?
           | 
           | Careful with the scales. 2x improvement could be interpreted
           | from 80% human performance to 160% human performance. Or
           | going from 10% error rate to 5% (which again crosses into
           | superhuman territory on some tasks). Those last few bits are
           | the critical ones.
        
           | pphysch wrote:
           | We would need to see incredible advances in energy efficiency
           | for that to happen.
        
             | simplyluke wrote:
             | We are! Don't even have to reach for fusion potentially
             | being commercial technology to show it. Solar is already
             | approaching $0.03/kilowatt hour and likely to be half of
             | that by the end of the decade. Energy getting very cheap
             | coupled with computing capacity continuing to go way up is
             | going to enable lots of interesting new technologies beyond
             | LLMs
        
         | airstrike wrote:
         | _> ChatGPT is literally changing how school will test their
         | students, for a start._
         | 
         | Here's a novel idea: test students using pen and paper?
        
           | 0cf8612b2e1e wrote:
           | Teachers assign better scores to papers with better
           | penmanship. I forget how strong the effect was, but using a
           | keyboard does help equalize some biases.
        
             | politician wrote:
             | Scan the hand-written work back into a digital format and
             | present the results to the teachers for evaluation in their
             | preferred typeface.
        
             | albert_e wrote:
             | Keyboards and screens in an exam hall then?
        
             | 0xdeadbeefbabe wrote:
             | Why equalize that bias?
        
               | slaymaker1907 wrote:
               | Because there are a plethora of disabilities which make
               | neat penmanship difficult.
        
             | MonkeyMalarky wrote:
             | Typewriters for everyone!
        
           | Mizza wrote:
           | Or, preferably, admit that testing wasn't a good idea to
           | begin with and focus on optimizing children for learning, not
           | test-taking.
        
             | taftster wrote:
             | I don't even think we're just talking about children
             | either. Test taking in academia (university and above)
             | could stand a much needed fresh look.
             | 
             | I am hopeful that a change happens in academia to prepare
             | students for jobs, which is why they are going to school in
             | the first place. Yes, students need to learn how to
             | "think", but really they are wanting to get the technical
             | skills to perform their duties more than anything.
             | 
             | We have bestowed too much credence in traditional academia
             | not useful to the average person or average job. College is
             | a "game" for most students, and they put up going through
             | the motions of testing, etc. for the sake of the diploma at
             | the end.
             | 
             | I hope we're going to enter a new era of what college means
             | for those looking to get something different out of it.
        
           | throwmeout123 wrote:
           | In Italy we test with pen and paper and the quality of
           | italian schools is abysmal. The point is not the test, is the
           | quality of the teachers. They are the only hope to form good
           | humans, not some standardized test.
        
       | winterismute wrote:
       | The database was the key technology in the 2001-2011 decade: it
       | allowed companies to store massive amount of data in an organized
       | way, so that they could provide basic functionality (search,
       | monitoring) to users. Statistical learning is being the key
       | "technology" of 2011-today: it allowed companies, which had
       | stored massive amount of data, to feedback predictions to users.
       | I think AR/Computer Graphics will be the key technology of the
       | next decade: it will allow users to interact directly and
       | seamlessly with the insights produced by ML systems, and possibly
       | feed-back information.
        
       | idlewords wrote:
       | Pretty funny to see this when every other headline on this site
       | is about how large language models are about to revolutionize
       | dentistry, beekeeping, etc.
        
       | nerpderp82 wrote:
       | Big Data was whatever someone couldn't handle in a spreadsheet or
       | on their laptop using R.
       | 
       | This paper is 8 years old and it was somewhat obvious then.
       | 
       | Scalability! But at what COST?
       | https://www.usenix.org/system/files/conference/hotos15/hotos...
       | 
       | A big single machine can handle 98% of peoples data reduction
       | needs. This has always been true. Just because your laptop only
       | has 16GB doesn't mean you need a Hadoop (or Spark, or Snowflake)
       | cluster.
       | 
       | And it was always in the best interest of the BD vendors and
       | Cloud vendors to say, "collect it all" and analyze on/or using
       | our platform.
       | 
       | The future of data analysis is doing it at the point of use and
       | incorporating it into your system directly. Your actionable
       | insights should be ON your grafana dashboard seconds after the
       | event occurred.
        
         | angry_moose wrote:
         | My experience with "Big Data" is it was something that couldn't
         | be handled in a spreadsheet or on their laptop using R because
         | it was so inefficiently coded.
         | 
         | I got sucked into "weekly key metric takes over 14 hours to run
         | on our multi-node kubernetes cluster" a while back. I'm not
         | sure how many nodes it actually used, nor did I really care.
         | 
         | Digging into it, the python code ingested about ~50GB of
         | various files, made well over a dozen copies of everything,
         | leaving the whole thing extremely memory starved. I replaced
         | almost all of the program with some "grep | sed | awk | sed |
         | grep" abomination that stripped about 98% of the unnecessary
         | info first and it ran in under 2 minutes on my laptop. I
         | probably should have tightened it up more but I was more than
         | happy to wash my hands of the whole thing by that point.
         | 
         | Instead of improving the code, they just kept tossing more
         | compute at it. Still heard all kinds of grumbling about
         | os.system('grep | sed | awk | sed | grep') not being "pythonic"
         | and "bad practice"; but not enough that they actually bothered
         | to fix it.
        
           | nerpderp82 wrote:
           | That is one of the selling points of Hadoop, you can write
           | garbage code and scale your way out of any problem, turning
           | the $$$ knob up to more nodes.
        
             | angry_moose wrote:
             | Yeah, that's why I got involved (I was infrastructure at
             | the time) - how can we throw more hardware at it as the
             | kubernetes setup they had wasn't cutting it.
             | 
             | One of the "data scientists" point blank said in a meeting
             | "My time is too valuable to be spent optimizing the code, I
             | should be solving problems. We can always just buy more
             | hardware".
             | 
             | Admittedly the last little bit of analysis was pretty cool,
             | but >>99% of that runtime was massaging all of the data
             | into a format that allowed the last step to happen.
        
         | mywittyname wrote:
         | You can do a petabytes of analysis with regular old BigQuery
         | just as easily as you can analyze megabytes of data. This
         | solves the scalability issue for a lot of companies, IMHO.
        
           | nerpderp82 wrote:
           | I agree, BQ is a gem on GCP. You pay for storage (or not, you
           | can use federated queries) and don't pay anything when you
           | aren't using it. The ability to dynamically scale
           | reservations is pretty nice as well.
        
       | mejakethomas wrote:
       | So what I'm hearing is it's not the size of your data that
       | matters, it's how you use it?
        
       | zmmmmm wrote:
       | To be honest, I slightly disagree about data size. I think the
       | big data is there to be had, the real story is that data science
       | itself has not panned out to provide the business value that
       | people asserted would come from it. Data volumes haven't risen
       | more because in the end, it turns out most of the things
       | businesses need to know are easily ascertainable from much
       | smaller data and their ability to action even these smaller very
       | obvious things is already saturated.
       | 
       | It doesn't help that we've shifted into a climate where hoarding
       | data comes with a huge regulatory and compliance price tag, not
       | to mention risk. But if the value was there we would do it, so
       | this is not the primary driver.
        
       | papito wrote:
       | First they came for the sacred microservices, now they are after
       | Bid Data. What. Is. Happening.
       | 
       | Don't get me wrong, I love it. It's about time people got off
       | these stupid and shockingly expensive bandwagons.
        
       | heisenbit wrote:
       | Sampling has proven extremely useful. Pi can be approximated with
       | it as were nuclear bombs designed using statistical methods.
       | Flame graphs based on stack samples are used to optimize servers.
       | Government does planning with it. Management does its thing by
       | wandering around.
       | 
       | It usually does not take many data points for an actionable
       | insight and most actions then will invalidate small details in
       | old data anyhow. Better to start every round with fresh eyes.
        
       | edpichler wrote:
       | I believe we are living in the "emotional era", so data has being
       | ignored and 'feelings' come first when making decisions or
       | creating processes. This is happening not only in companies but
       | in our current society in general.
        
         | mordechai9000 wrote:
         | Perhaps I'm somewhat cynical, but I believe this is a feature
         | of the human condition, not an attribute of our age in
         | particular. Reason and analysis are tools that are used to
         | justify what we already believe.
        
           | maxfurman wrote:
           | Agreed! The so-called "Age of Reason" was the anomaly, and
           | probably not that much more reasonable than our own time.
        
         | tootie wrote:
         | I think there's absolutely a place for this. I often of the old
         | Henry Ford quote about people wanting faster horses. Data and
         | analytics are great for optimization, but sometimes you need to
         | trust your gut and give people something they didn't ask for to
         | have a breakthrough.
        
       | alexpetralia wrote:
       | I am writing an essay series on this topic: last-mile analytics
       | and how an abundance of data must be ultimately converted into
       | (measurably correct) action.
       | 
       | If anyone wants to follow along, the series is here!
       | 
       | https://alexpetralia.com/2023/01/19/working-with-data-from-s...
        
         | blakeburch wrote:
         | That looks like a huge undertaking, but kudos for taking the
         | time. I'll be following along. Totally agree that all data
         | should be tied to the business value that it's driving.
         | 
         | Unfortunately, I've found that many data teams focus more on
         | making the data clean and available. They never drive the
         | conversation about what actions are being taken with the data.
         | That leads to them being treated as cost centers. Wrote a
         | similar post about my perspective on it -
         | https://bytesdataaction.substack.com/p/transform-your-data-t...
         | 
         | I'd love to chat about the space more with you if you're
         | interested! Email in bio.
        
         | imachine1980_ wrote:
         | This sounds like "sane planning, sensible tomorrow." Book for
         | Al gore
        
       | moooo99 wrote:
       | I feel like big data has rarely lived in most organizations. My
       | own experience working in large orgs largely supports the point
       | that collected data is rarely queried. But this is rarely due to
       | a lack of interest, it is mostly because a) nobody really has a
       | great overview over what even is collected b) even if you
       | know/assume something is collected, you usually have no idea
       | where c) if you find the data, there is a decent chance that it
       | is in some sort of weird format that requires a ton of processing
       | to be usable.
       | 
       | This has been - to varying extends - my own experience working in
       | large organizations that don't have tech as their core business.
       | 
       | Although there are some successful data analysis project, the
       | potential of the collected data remains largely underutilized.
        
       | blipvert wrote:
       | Listen to "Reason"
        
       | danuker wrote:
       | I agree with many of the points here.
       | 
       | My cheap no-name old laptop SSD writes with 170MB/s.
       | 
       | A customer has a name, address, email and order. Let's say 200
       | bytes for each. That means I can write 844000 new customers per
       | second, far outside my personal marketing reach.
       | 
       | My disk is 240GB, which means I can store data for 1.2 billion
       | customers. It'll take a while until I become that successful.
        
         | hinkley wrote:
         | One of my "computers are really fucking fast" experiments,
         | almost a decade ago, was when I was trying to do a histogram
         | plot of a function that I was 98% sure was terribly broken. It
         | was expected to give a uniform distribution so I figured I'll
         | just plot a bunch of values into a 2d space and then convert it
         | to a greyscale image.
         | 
         | At first I tried to puzzle out a good sampling strategy to make
         | sure I didn't bias the output, then on a whim I tried 2^32
         | samples and went to lunch. It took something like a half an
         | hour to do 4 billion samples. Took me a couple times to figure
         | out how to squeeze 4k megapixels into a graph so I ran it a few
         | more times, but the results showed a very distinct banding
         | pattern that confirmed that the problem was every bit as bad as
         | I suspected, which was a blocking issue for our release. A
         | couple of hours well spent, running through an 'intractable
         | problem' that really wasn't.
        
         | lostmsu wrote:
         | > 170MB/s
         | 
         | That is not random access speed. For random access my
         | relatively high-performance SSD only does 42MB/s reading and
         | 80MB/s writing.
        
           | danuker wrote:
           | Indeed there's probably some caching going on.
        
         | ravivooda wrote:
         | It never sat well with me that none of the production services
         | could leverage my local computation and storage power. I don't
         | need to store my contacts on a remote server that could index
         | my contacts when mixed with every other contact in a single
         | table. That's a blatantly oversimplified example but you get
         | the gist.
        
           | ilyt wrote:
           | Developing apps as local-mostly with remote being "just
           | storage" might've been interesting approach but oh so many
           | stuff moved to webshit from native apps and browsers still
           | don't even have decent data management.
        
             | ravivooda wrote:
             | Well said! I wonder if Web3 could solve such a problem (or
             | a zero trust solution). Where you provide your service that
             | can run in a special container
        
         | tomwheeler wrote:
         | Presumably the "order" you mention is a primary key to another
         | table, likely one that references the individual items that
         | make up that order, so the data will be much larger than you
         | estimate.
         | 
         | It will grow larger still if you include web logs from your
         | e-commerce site and event data from your mobile app so that you
         | can correlate these orders with items that customers considered
         | but ultimately didn't buy. How will your laptop and SSD perform
         | when you then build a user-item matrix to generate product
         | recommendations for each of those 1.2 billion customers?
         | 
         | While plenty of organizations unnecessarily use Big Data tools
         | to store and analyze relatively small amounts of data, there
         | are plenty of customers with enough data to require them. I've
         | seen plenty of them firsthand.
        
           | ilyt wrote:
           | That's still well within 1U server with some RAM and bunch of
           | NVMes reach
        
           | 0xB31B1B wrote:
           | There are functionally less than 1000 organizations that
           | currently require distributed compute for data analysis. You
           | can get off the shelf AWS units with 1000 cores, terabytes of
           | ram and storage, etc. The cost of compute has decreased
           | faster than the amount of data we have to store and process.
           | What we used to do with spark jobs we can do with python on a
           | single box.
        
             | pocket_cheese wrote:
             | This is not true. Any column store database (bigquery,
             | Redshift, snowflake) implements distributed compute behind
             | the scenes. When an analyst/business intelligence people
             | have a query return in 3 seconds instead of 15 seconds,
             | it's actually huge. Not just in aggregate amount of time
             | saved, but in creating a quick feedback loop in testing
             | hypothesizes. This is especially true considering that most
             | analyst type people look at data as aggregates across some
             | dimension (e.g. sales per month , unique visitors per
             | region, etc...)
             | 
             | These types of questions are orders of magnitude faster
             | with a distributed backend.
        
               | glogla wrote:
               | Yup.
               | 
               | I was just playing with some data from our manufacturing
               | system, about 30 GB. I pulled the data to my laptop (very
               | expensive Apple one) and while it fits on my disk just
               | fine, it took about 15 minutes to download.
               | 
               | I imported it to ClickHouse which took a while due to
               | figuring out whatever compression and LowCardinality()
               | and so on. I ran a query and it took ClickHouse about 15
               | seconds. DuckDB pointed to the parquet files on my SSD
               | took 19 seconds to do the same. Our big data tool took 2
               | seconds, while working with data directly in cloud
               | storage.
               | 
               | Now of course this is entirely unfair - the big data
               | thingie has over twenty times more CPUs than my laptop,
               | and cloud storage is also quite fast when accessed from
               | many machines at once. If I ran ClickHouse or DuckDB on
               | 100 CPU machine with terabyte of RAM it might have still
               | turned out faster.
               | 
               | But this experiment (I was thinking of using some of the
               | new fancy tech to serve interactive applications with
               | less latency) made me realize that big data is still a
               | thing. This was a sample - one building from one site,
               | which we have quite a few of.
        
               | ryguyrg wrote:
               | I'd love to understand the shape of this data and some of
               | the types of queries you're performing. It would be very
               | helpful as we build our product here at motherduck.
               | 
               | I have no doubt that there are situations where the cloud
               | will be faster, especially when provisioned for max usage
               | [which many companies do not]. However, there are a lot
               | of these situations even where the local machine can
               | supplement the cloud resources [think re decisions a
               | query planner can make].
               | 
               | Feel free to reach out at ryan at motherduck if you want
               | to chat more.
        
             | threeseed wrote:
             | Let's assume your completely made-up 1000 organisations
             | claim is true.
             | 
             | Right now I work for one of them: a global investment bank.
             | 
             | Within that organisation we have at least 100+ Spark
             | clusters across the organisation doing distributed compute.
             | And at least in our teams we have tight SLAs where a simple
             | Python script simply can't deliver the results quick
             | enough. Those jobs underpins 10s of billions of dollars in
             | revenue and so for us money is not important, performance
             | is.
             | 
             | So 1000 x 100 = 100,000 teams, all of whom I speak for,
             | disagree with you.
        
             | doug_durham wrote:
             | Citations please? That's a pretty bold statement to make in
             | the face of observed reality.
        
               | beckingz wrote:
               | Even if this is off by two orders of magnitude and it's
               | only 100,000 companies that need distributed compute,
               | that means that almost all companies just need a single
               | large computer.
               | 
               | Looking at the distribution of companies by employee
               | count and assuming that data scales with employee count
               | (dangerous assumption, but probably true enough on
               | average), that means that companies don't need
               | distributed compute until they get several hundred
               | employees. [0]
               | 
               | [0] https://www.statista.com/statistics/487741/number-of-
               | firms-i...
        
               | fho wrote:
               | https://yourdatafitsinram.net/
        
               | threeseed wrote:
               | This is such a lazy response.
               | 
               | I/O performance is just one of many characteristics that
               | impact performance and from experience the one you least
               | need to worry about. RAID 0 across multiple high-end NVME
               | drives with OS file caching is going to be more than fast
               | enough for most use cases.
               | 
               | The issue is running out of CPU performance and being
               | able to seamlessly scale up/down compute with live
               | running workloads.
        
               | beckingz wrote:
               | A large computer is radically CPU overprovisioned for
               | most workloads.
        
           | deltarholamda wrote:
           | Don't forget the cool JS library you included to track mouse
           | movements so you can optimize your UI to make sure Important
           | Money Making Things are easily clickable.
           | 
           | That's 8.4 hojillion megabytes per second right there.
        
       | juujian wrote:
       | > Most data is rarely queried
       | 
       | Right on point. In the past I have been obsessed with big data,
       | looking for insights. Then I realized that a medium-sized
       | specific data set is always better than a gargantuan general big
       | data monster. There is so many applications in my field where
       | only outliers matter anyways, and everything is very
       | "centralized" to a few relevant observations. So the only thing
       | about big data is that you maybe throw away 99.9% of the data
       | right away and then you have some observations that you actually
       | care about. There is soooo much data out there that is just
       | noise, and so little that I actually care about. And that's why I
       | still end up hand collecting stuff every now and then.
        
       | donretag wrote:
       | My personal definition of Big Data has always been when you
       | gather/store data without having a planned use for it. Do we need
       | this data? Don't know, let's just store it for now.
       | 
       | The article does allude to this definition when it states that
       | "Most data is rarely queried". We have become data hoarders.
       | Technology has made it easy (and relatively cheap) to store data,
       | but the ideas of what to do with this data have not scaled in
       | comparison.
        
       | ThereIsNoWorry wrote:
       | Big Data is dead? Seems well and alive to me. If you're not a big
       | company with big customers, it never affected you to begin with.
        
         | dig1 wrote:
         | Big Data is far from dead. On the contrary, people (on most
         | daily projects) are more mindful now wrt all Big Data
         | liabilities and benefits (infrastructure cost vs. what you get
         | from it) thanks to the experience of the failed ones. But many
         | analytics companies are thriving.
         | 
         | Also, using BigQuery as a metric of how Big Data is used is,
         | IMHO, wrong. Real analytics companies usually have custom
         | solutions because BigQuery is too expensive for any serious
         | usage unless you are Google.
        
       | zzzeek wrote:
       | > The most surprising thing that I learned was that most of the
       | people using "Big Query" don't really have Big Data.
       | 
       | wow, ya think? Must have been eye opening to see all those
       | customers with a few million rows thinking they had "Big data"
       | huh?
        
       | ryadh wrote:
       | While I get that they're sometimes useful to trigger debate, I
       | don't really subscribe to very bold statements.
       | 
       | We are drowning in data, it's all around us. Information overload
       | is real. Data enables most of our daily digital experiences, from
       | operational data to insights in the form of user facing
       | analytics. Data systems are the backbone of the digital life.
       | 
       | It's is an ocean and it's all about the vessel you pick to
       | navigate it. I don't believe that the vessel should dictates the
       | size of the ocean, it's simply constrained by it's capabilities.
       | The trick is to pick the right vessel for the job, whether you
       | want to go fast, go far or fish for insights (ok, I need to stop
       | pushing on this metaphor )
       | 
       | This visionary paper from Michael Stonebreaker (2005) predicted
       | it quite accurately and I think is still relevant:
       | https://cs.brown.edu/~ugur/fits_all.pdf
       | 
       | Databases come in various flavours and the "trends" are simply a
       | reflection of what the current era needs
       | 
       | Disclaimer: I work at ClickHouse
        
         | fuziontech wrote:
         | 100% agree. One of the biggest assets we had at <driver and
         | rider marketplace app> was the data we collected. We built
         | models on it that would determine how markets were run and
         | whether drivers and passengers were safe. These were key
         | features that enabled us to bring a quality service to
         | customers (over ye ol' taxi). The same applied to the
         | autonomous cars, bikes, and scooters. We used data to improve
         | placement of vehicles to help us anticipate and meet demand. It
         | was insane how much data used to build these models.
         | 
         | To say big data is dead sounds to me like someone desperate for
         | eyeballs.
         | 
         | I do think there is a huge opportunity for DuckDB - running
         | analytics on 'not quite big data' is a market that has always
         | existed and is arguably growing. I've seen way too many people
         | trying to use Postgres for analyzing 10 Billion row tables and
         | people booting up an EMR cluster to hit the same 10 Billion
         | rows. There is a huge sweet spot for DuckDB here were you can
         | grab a slice of the data you are interested in, take it home
         | and slice and dice it as you please on your local computer. I
         | did this just this weekend on DuckDB _and_ ClickHouse!
         | 
         | Disclaimer: I work at a company that is entirely based on
         | ClickHouse.
        
           | vgt wrote:
           | Didn't know that Posthog is based on CH these days.
           | Interesting!
        
         | spopejoy wrote:
         | I guess the article title is a "bold statement" but maybe the
         | biggest insight in there is that people don't think hard enough
         | about throwing old data away, and it hurts them. This is a
         | liferaft for drowning in data and is more "bold"
         | organizationally, as it actually takes a certain kind of
         | courage to realize you should just throw stuff away instead of
         | succumb to the false comfort that "hey you never know when you
         | might need it".
         | 
         | Weirdly there's a similar thing that can happen to codebases,
         | specifically unit tests and test fixtures that outlive any of
         | their original programmers, nobody understands what's actually
         | being tested and before each release lose days/weeks hammering
         | to "fix the test". The only solution is to throw it away, but
         | good luck getting most teams to ever do that, because of the
         | false comfort they get -- even though that fixture is now just
         | testing itself and not protecting you from any actual bugs.
         | 
         | I mean how often does Netflix need to look a viewing habits
         | from 2015? Summarize and throw it away.
        
       | alluro2 wrote:
       | I'm quite surprised with data sizes mentioned in the article, and
       | wondering if I'm missing something...We are a very small 2yo
       | company, handling route optimization and delivery management /
       | field service. Even with our very small number of customers,
       | their relatively small sizes (e.g. number of "tasks" per day),
       | being very early in development in terms of data that we collect
       | - our database containing just customer data for 2 years is
       | ~100GB. Which I previously considered small, and if we collected
       | useful user metrics, had more elaborate analytics, location
       | tracking history etc, I would expect it to be at least 3x.
       | 
       | We don't use any "BigData" products yet, as there wasn't any need
       | for them, even when we provide full search and relatively nice
       | and rich set of analytics over all the data. Yet, based on the
       | article, we're way above most of the companies relying heavily on
       | such tools. Confusing.
        
       | ankrgyl wrote:
       | I love DuckDB and am cheering for MotherDuck, but I think
       | bragging about how fast you can query small data is really no
       | different than bragging about big data. In reality, big data's
       | success _is not_ about data volume. It 's about enabling people
       | to effectively collaborate on data and share a single source of
       | truth.
       | 
       | I don't know much about MotherDuck's plans, but I hope they're
       | focused on making it as easy to collaborate on "small data" as
       | Snowflake/etc. have made it to collaborate on "big data".
        
       | miguelazo wrote:
       | On to the next hype theme(AI)!
        
       | luckydata wrote:
       | It's kinda weird to read this. The whole argument is "we didn't
       | have databases that could handle the sizes and use cases
       | emerging, we worked on the problem for 20 years and now it's no
       | biggie".
       | 
       | Mission accomplished more than big data is dead IMHO.
        
       | andix wrote:
       | I see it all the time: people develop applications that will
       | never ever get a database size of over 100GB and are using big
       | data databases or distributed cloud databases. Often queries only
       | hit a small subset of the date (one customer, one user). So you
       | could easily fit everything into one SQL database.
       | 
       | Using any of the traditional SQL databases takes away a lot of
       | complications. You can do transactions, you can query whatever
       | you want, ...
       | 
       | And if the database may get up to 1TB, still no problem with SQL.
       | If exceed that, you may need a professional OPs team for your
       | database and a few giant servers, but they should easily be able
       | to go up to 10 TB, offload some queries to secondary servers, ...
        
         | primax wrote:
         | I think a key driver of this is not having to use SQL. I like
         | DynamoDB and EdgeDB because I can use a more modern and
         | reasonable language to interact with the database.
        
           | 0xB31B1B wrote:
           | its really difficult to do any kind of analysis without
           | relational queries. The standard way you do this is to have
           | an app datastore in DDB, and an ETL job that pipes your data
           | into some data warehouse env.
        
           | andix wrote:
           | That's a good point, I also think that there should be some
           | modern alternative to SQL. I really like how you can query
           | databases with LinqPad (c#) and how it renders it into a
           | nested table tree. All relations are clickable/expandable, so
           | if you find something interesting in your result set, you can
           | just expand additional rows from other tables. In the
           | background it just creates sql via an ORM, not only once I
           | more or less copy and pasted that generated sql into a view.
           | 
           | But linqpad is not useful if you don't get the pro version,
           | only then you get code completion. So it's not really the
           | answer to the problem.
        
         | [deleted]
        
         | [deleted]
        
         | tootie wrote:
         | I think a lot of data tech has come full circle is now mostly
         | just relational databases. Our org is invested in redshift
         | which lets us mostly pay as we go. The DB itself is just a
         | Postgres facade on scalable storage with some native connectors
         | to file stores and third-parties. After rolling over our stack
         | like three times, we're now just dumping tons of raw data into
         | staging tables, then creating views on top of them. It's 97%
         | raw SQL with a smattering of python for clunky extractions. And
         | we're now true believers in ELT vs ETL.
        
           | threeseed wrote:
           | Redshift with S3 storage is no different to Spark SQL with S3
           | storage.
           | 
           | Both are distributed compute. Except that Spark allows you to
           | mix/match code with SQL.
        
       | mikepk wrote:
       | We need to re-think how to make data _useful_. The fact that the
       | value hasn't materialized after decades of attempts, billions of
       | dollars, and lots of tools and technology points to the fact that
       | our core assumptions and patterns are wrong.
       | 
       | This post doesn't go far enough. It challenges the assumption
       | that everyone's data is "big data" or that every company's data
       | will eventually grow to be big data. I agree that "big data" was
       | the wrong model. We also need to challenge that all data should
       | be stored in one place (warehouse, lake, lakehouse). We need to
       | challenge that one tool can be used for every data need. We need
       | to challenge how we build systems both from a technology and
       | people standpoint. We need to embrace that the problems and needs
       | of companies _are always changing_.
       | 
       | We are living with conceptual inertia. Many of our patterns are
       | an evolution from the 70's and 80's and the first relational
       | databases. It's time to rethink how we "do data" from first
       | principles.
        
         | blakeburch wrote:
         | The problem is that no tool alone can make data useful. It
         | requires human ingenuity to come up with a theory, gather the
         | required data, then test and verify the theory.
         | 
         | We've gotten to a point where the first and last step get
         | skipped. Business leaders see other companies doing interesting
         | things with data, so the answer must be "gather all the data"!
         | Internal teams end up focused on gathering the data without the
         | context of how it might be used.
         | 
         | We need to train data teams to not focus on the data as the
         | product. Instead, they should be responsible for driving
         | business actions. Gathering and cleaning the data should just a
         | byproduct of that activity.
        
       | pier25 wrote:
       | > _Are you in the big data one percent?_
       | 
       | Exactly, and I'd go further.
       | 
       | Are you in the perf/scale/data one percent?
       | 
       | So many people worry about scaling when in reality 99% of web
       | apps will never reach above 100reqs/s.
       | 
       | I've been in web dev for 20+ years. Only once when working for a
       | big international corporate client I had to worry about traffic
       | spikes. And that was just for one of their multiple web apps.
        
       | CommieBobDole wrote:
       | It's not dead, it's just entered the plateau of productivity.
       | Where people use it for whatever it's useful for and don't try to
       | solve every problem with it just because it's the cool new thing.
        
       | taftster wrote:
       | This posting was great. Highly recommended reading through. It
       | gets really good when the author hits "Data is a Liability".
       | 
       |  _> An alternate definition of Big Data is "when the cost of
       | keeping data around is less than the cost of figuring out what to
       | throw away."_
       | 
       | This is exactly it. It's way too hard to go through and make
       | decisions about what to throw away. In many respects, companies
       | are the ultimate hoarders and can't fathom throwing any data way,
       | Just In Case.
       | 
       | Really appreciated the post overall. Very insightful.
       | 
       | As an anecdote to this article, when business folks have come up
       | to me and asked about storing their data in a Big Data facility,
       | I have never found the justification to recommend it. Like, if
       | your data can fit into RAM, what exactly are we talking about Big
       | Data for?
        
       | nemo44x wrote:
       | Why would I use DuckDB instead of Clickhouse or similar? Is it
       | just because I want to have the database embedded in my app and
       | not connect to a server?
        
         | tylerhannan wrote:
         | One great reason to use DuckDB was when ClickHouse took up too
         | much memory on Parquet files.
         | 
         | https://github.com/ClickHouse/ClickHouse/issues/45741#issuec...
         | helps with that though.
         | 
         | Also, clickhouse-local exists
         | https://clickhouse.com/blog/extracting-converting-querying-l...
         | as a thing.
         | 
         | But, yes, when I think of DuckDB...I think embedded use
         | cases...i'm also not a power user.
         | 
         | I also think of this very much as a 'horses for courses' or
         | 'different strokes, different folks' sort of scenario. There
         | is, naturally, overlap because 'analytical data.' But also,
         | there is naturally overlap with R and this giant scary mess of
         | data-munging PERL code I maintain for a side project.
         | 
         | The DuckDB team, the MotherDuck team, the ClickHouse team...we
         | all want your experience interacting with data to be amazing.
         | In some scenarios, ClickHouse is better. In some scenarios,
         | DuckDB. I'm biased (as I work for ClickHouse in DevRel), but I
         | <3 ClickHouse.
         | 
         | Try both. Pick the one that is best for you. Then...you
         | know...tell the other(s) why so that we all can get better at
         | what we do.
        
           | nemo44x wrote:
           | Thanks but I'm looking for specific use cases. Like I get
           | SQLite. And I get Clickhouse. But I just don't get why I'd
           | use DuckDB specifically. I'm sure it's awesome and super
           | useful but I have a gap in my understanding.
        
       | wizwit999 wrote:
       | Perhaps this is true for business data (though I'm skeptical of
       | the claims), but, for example, for security data, this isn't true
       | at all. Collecting cloud, identity, SaaS, and network logs/data
       | can easily exceed hundreds of terabytes. A big reason why we're
       | building Matano as a data lake for security.
       | 
       | It seems an odd pitch in general to say, hey my product
       | specifically performs poorly on large datasets.
        
         | simonw wrote:
         | Sounds like you're in the "Big Data One-Percenter" category
         | described at the very bottom of the article.
        
         | CobrastanJorji wrote:
         | On the contrary, identifying what your product is explicitly
         | not aiming to do is extremely helpful. "Big" adds a lot of
         | complexity and pain, most people don't do that, our product
         | avoids the complexity and pain and is the best choice for most
         | people. Seems like a good, simple pitch, and all it requires is
         | the humility to say that your solution isn't the best for some
         | use cases.
        
       | cmollis wrote:
       | we regularly run audits on over 12 years of customer order
       | histories. This requires scanning of about 40TB of data and
       | growing. They used to jump through hoops on the Oracle cluster
       | just to get data out for one customer. We pushed all of the order
       | history into s3 parquet using Spark and I can query this in about
       | 20 seconds using Spark or Presto. It's now streamed through kafka
       | and Spark structured streaming so it's up to date in about 3
       | minutes. The click-bait-y title notwithstanding, I get that not
       | all data is 'big' and duckdb (and datafusion, polars, etc) is
       | probably great for certain use-cases but what I work on every day
       | can't be done on a single machine.
        
       | sgt101 wrote:
       | Looks at 15 hr Spark job (running since this morning)
       | 
       | Sighs...
        
       | CobrastanJorji wrote:
       | Tableau's "Medium Data" April Fools Day ad from several years ago
       | still rings amazingly true.
        
       | lucidguppy wrote:
       | Some of mongo's leveling off is the adoption of good jsonb
       | columns in postgres.
       | 
       | mongo's got sharding out of the box - which is nice - but you
       | have to get your key right or it will suck.
       | 
       | Also no one should want to host a mongo db - unless that's your
       | business.
        
         | threeseed wrote:
         | MongoDB grew revenue 52.8% in the previous financial year [1].
         | 
         | And if there is any levelling off it's going to be because of
         | the move towards cloud managed options e.g. Snowflake,
         | DocumentDB rather than because PostgreSQL decided to add JSONB
         | support.
         | 
         | [1]
         | https://www.macrotrends.net/stocks/charts/MDB/mongodb/revenu...
        
       | gesman wrote:
       | Customer pays data analytics vendor to tackle bunch of their [low
       | quality, big size] data.
       | 
       | If you have no tangible capabilities to do above, asking customer
       | "ARE YOU IN THE BIG DATA ONE PERCENT?" will be the quickest way
       | out of the door.
        
       | andreygrehov wrote:
       | With all the LLM craziness, this is just the beginning. How else
       | are they going to train all those models? I'm not an expert, just
       | imho.
        
       | hinkley wrote:
       | It's always kind of amazed me how closely Big Data was followed
       | by the KonMari method and it really seems like the nerds were not
       | paying attention to that at all. Or just happy to take a paycheck
       | from people who weren't paying attention.
       | 
       | Hoarding is not a winning strategy.
        
       | fijiaarone wrote:
       | Somewhere along the line people were tricked into thinking that
       | logging was data, and that to we needed to turn up every trace
       | log to 11 on every production system.
       | 
       | Logs are where data goes to die.
        
       | posharma wrote:
       | We're going to reach a point where we might say the same thing
       | about large language models. Fine tuned LMs (based off of their
       | large parents) are going to be the bread and butter.
        
       | LeanderK wrote:
       | Who has ever believed those claims? There's a common saying
       | "garbage in, garbage out" about what happens with all those fancy
       | models if the data quality is not high. That's really independent
       | from dataset-size. There's no magic insight you get because your
       | dataset is bigger. You need a quality analyst to handle your
       | data, irrelevant of its size.
       | 
       | Also, who thought their company would cease to function because
       | surely they will hit google-scale dataset-sizes in the near
       | future? Impossible for most except the biggest of the biggest
        
       | articsputnik wrote:
       | I love DuckDB's simplicity and think it will solve many problems.
       | Still, transitioning from a local single file DB to concurrent
       | updates and serving it online will be different. I'm curious
       | about what MotherDuck will come up with to solve DuckDB at scale.
       | 
       | I love use cases like the Rill Data
       | (https://youtube.com/watch?v=XvP2-dJ4nVM), where you can suddenly
       | run analytics with a single cmd line prompt and see your data
       | just instantly visualized. Such use cases are only possible
       | because of the "small" data approach that DuckDB tries.
        
       | poorman wrote:
       | This entire post reads like "you probably don't actually have big
       | data".
       | 
       | What do these blockchains do that have to keep data around
       | forever, with high throughput, and need to expose it quickly do?
       | Are you saying they should delete parts of data in the chain?
       | 
       | Seriously, I've spent my career working on big data systems, and
       | while the answer is sometimes "yes you need to delete your data",
       | I don't think that's going to always work.
        
         | PeterisP wrote:
         | And what about these blockchains? The full history of Bitcoin
         | blockchain is less than 500gb, so for any analysis just getting
         | a machine with a terabyte of RAM is both simpler and cheaper
         | (once you include dev+ops time) than doing any horizontal
         | scaling across multiple machines with "Big Data" approaches.
         | 
         | "You probably don't actually have big data" is a very valid
         | point, not that many organizations do - most businesses haven't
         | generated enough actionable data in their lifetime to need more
         | than a single beefy machine without ever deleting data.
        
       | jacobsenscott wrote:
       | nosql is dead, client side SPAs are dead. Nice to see the
       | complexity pendulum swinging back to the correct side again.
       | Curious what the merchants of complexity will reach for next. Are
       | applets going to be the new hot thing?
        
       | low_tech_punk wrote:
       | Long live Big Model, I guess? Instead of independent data
       | warehouses, we are now moving towards a few centralized companies
       | using supercomputer in physical data centers. The "winner takes
       | all" effect will only increase as the trend goes on.
        
       | morelisp wrote:
       | To the extent "Big Data" originally and is still often claimed to
       | mean "data beyond what fits on a single [process/RAM/disk/etc]",
       | it's always been strange to me how much it's identified with
       | analytics pipelines doing largely trivial transformations
       | producing ultra-expensive "BI" pablum.
       | 
       | Yes, thank goodness that part is dead. But meanwhile - we've
       | still got more _actual data_ than ever to store, and ever-tighter
       | deadlines on finding and delivering it. If we can get back to
       | that and let the PySpark bootcampers fade away, maybe things can
       | get a little better for once.
       | 
       | In other words:
       | 
       |  _Even when querying giant tables, you rarely end up needing to
       | process very much data. Modern analytical databases can do column
       | projection to read only a subset of fields, and partition pruning
       | to read only a narrow date range. They can often go even further
       | with segment elimination to exploit locality in the data via
       | clustering or automatic micro partitioning. Other tricks like
       | computing over compressed data, projection, and predicate
       | pushdown are ways that you can do less IO at query time. And less
       | IO turns into less computation that needs to be done, which turns
       | into lower costs and latency._
       | 
       | Big data is "dead" because data engineers (the programming ones,
       | not the analysts-in-all-but-title) spent a ton of effort building
       | DBs with new techniques that scale better than before, with other
       | storage patterns than before. Someone still has to write and
       | maintain those! And it would be even better if those tools and
       | techniques could escape the half dozen major data cloud companies
       | and be more directly accessible to the average small team.
        
       | Flatcircle wrote:
       | Seems like just yesterday, every business magazine's cover story
       | was about "big data." Wonder what the next batch of business buzz
       | words will be?
        
       | fdgsdfogijq wrote:
       | "For more than a decade now, the fact that people have a hard
       | time gaining actionable insights from their data has been blamed
       | on its size."
       | 
       | The real issue is that business people usually ignore what the
       | data says. Wading through data takes a huge amount of thought,
       | which is in short supply. Data Scientists are commonly
       | disregarded by VPs in large corporations, despite the claims
       | about being "data driven". Most corporate decision making is
       | highly political, the needs of/whats best for the business is
       | just one parameter in a complex equation.
        
         | gymbeaux wrote:
         | Replace "data scientists" with "software engineers" and you
         | have another accurate insight. They don't want to listen to us
         | about how to write software or derive value from data.
        
         | hn_throwaway_99 wrote:
         | Agree with some of what you've said, but disagree with a lot:
         | 
         | > Most corporate decision making is highly political, the needs
         | of/whats best for the business is just one parameter in a
         | complex equation.
         | 
         | 100% Individual humans are emotional creatures with their own
         | wants and needs, and it's important to understand how
         | organizational incentives drive decision making.
         | 
         | > Data Scientists are commonly disregarded by VPs in large
         | corporations, despite the claims about being "data driven".
         | 
         | This has not been my experience, though. The more common thing
         | I've seen is that, sometimes data is boring and doesn't really
         | show much actionable insight, but as everyone wants to justify
         | their job, I've seen data scientists come up with really
         | questionable conclusions that fell apart on further inspection
         | (call it "p-hacking the enterprise").
         | 
         | Plus, a lot of this data in these data wearhouses is _messy_.
         | Often times data scientists are siloed at the end of the
         | process, but then you get  "garbage in/garbage out" results,
         | where there is some bug in data tracking that isn't understood
         | until it's too late. Much better in my opinion to have data
         | engineers and data scientists work much more closely with
         | product engineering teams up front so they can help ensure the
         | data they collect is accurate.
        
         | mritchie712 wrote:
         | Size isn't the real problem, it's time.
         | 
         | Are you going to take the time / money to set up a warehouse,
         | get all the data into with an ETL product, set up dbt or some
         | other transformation layer, set up a BI tool and build the
         | reports and dashboards, etc.
         | 
         | Regardless the size of your data, you still need to get it in
         | one place and model it in a way it's actually usable.
        
           | didgetmaster wrote:
           | Exactly. It isn't just time to set up all the data in a way
           | that makes the right query possible. It is also having
           | queries fast enough to be able to run a vast number of them
           | in order to find what you are looking for (or even things you
           | were not looking for).
           | 
           | https://didgets.substack.com/p/data-science-and-serendipity
        
           | olyjohn wrote:
           | Can't we just give it to that one IT guy down in the
           | basement?
        
             | mritchie712 wrote:
             | Hey, I used to be that guy (and still am).
        
         | tracerbulletx wrote:
         | I find a lot of organizations don't have the discipline to
         | harness whatever power their data may have. Sure collect
         | everything, but god forbid you have any sort of data
         | governance, or spend a single resource minute of time manually
         | tagging or organizing or validating it. Then they try to make
         | shitty ML models or products out of it, but don't care if the
         | models actually work or not, just that they have AI now. Then a
         | year later when the model has provided no value they are like,
         | oh well big data is worthless I guess.
        
           | hef19898 wrote:
           | Palantir, you have to have Palantir.
           | 
           | Oh, and a bunch of data scientists with zero domain knowledge
           | for whatever data they are analyzing, preferably with PhDs in
           | maths, but some ML background will do. And agile, because of
           | course all those Palantir dashboards can only be developed
           | using agile.
           | 
           | Once all is said and done, zero insight was created but a
           | whole lot of consultants, contractors and project managers
           | have been paid handsomely, while some higher ups can now put
           | "implemented agile and big data at X" on their LinkedIn
           | profiles.
        
         | snapetom wrote:
         | Agree on the size not being the issue. I transitioned from a
         | data engineering manager to a data product manager at a new
         | company. You know how much data a typical customer generates?
         | Less than 1TB a year.
         | 
         | I told my VP that the engineering foul ups in the current
         | product are easily fixable. Standard tooling and patterns exist
         | to re-architect and solve the bottlenecks. What is much harder
         | is a data architect to make sense of the complex data and make
         | sure there is good value for our customers.
         | 
         | Guess what position I don't have on the team, and won't have
         | due to budget issues.
        
         | aicharades wrote:
         | deleted- length
        
           | LeanderK wrote:
           | of course it won't and it's really ironic that you reply with
           | the next hype
        
         | slt2021 wrote:
         | I used to joke that Data Scientists exist not to uncover
         | insights or provide analysis, but merely to provide factoids
         | that confirm senior management's prior beliefs.
         | 
         | I did several experiments, and noticed that whenever I produced
         | analysis that was in line with what management expected - my
         | analysis was praised and widely disseminated. Nobody would even
         | question data completeness, quality, whatever. They would pick
         | some flashy metric like a percentage and run around with it.
         | 
         | Whenever my analysis contradicted - there was so much scrutiny
         | in numbers, data quality, etc, and even after answering all
         | questions and concerns - analysis would be tossed away as non-
         | actionable/useless/etc.
         | 
         | if you want to succeed as a Data Scientist and be praised by
         | management - you got to provide data analysis that supports
         | managements ideas (however wrong or ineffective they might be).
         | 
         | Data Scientist's job is to launder management's intuition using
         | quantitative methods :)
        
           | zeagle wrote:
           | Huh. I've not thought of it as laundering, but I think you've
           | basically summarised consulting in healthcare. Pay to
           | legitimize and push through a pre-existing idea (eg let's
           | close down a few ERs) or a delusion (e.g. lean, we don't need
           | a waiting room) and say it was recommended by consultants to
           | stakeholders and the public.
        
             | saltcured wrote:
             | Right, the more appropriate analogy is "parallel
             | construction"...
        
             | slt2021 wrote:
             | all consulting is like that, Partners/MDs at consulting
             | companies meet with Board/CEOs to get rough idea of what
             | they want/need, and quickly negotiates a consulting
             | engagement contract to create PowerPoint with all the
             | evidence and analysis gathered that supports CEO's initial
             | idea.
             | 
             | This is the only reason why a 60+ PowerPoint slide deck can
             | cost several millions dollars
        
           | dylan604 wrote:
           | how is this really different from any other aspect of life?
           | Very few people really like to be told counter information,
           | and it is always easier when providing data that aligns with
           | the current group think. Doesn't matter if it is business,
           | politics, or really anything. Being the outlier trying to
           | change the direction of things is a struggle.
        
           | sva_ wrote:
           | > I used to joke that Data Scientists exist not to uncover
           | insights or provide analysis, but merely to provide factoids
           | that confirm senior management's prior beliefs.
           | 
           | Also someone to blame if it doesn't work out
        
           | sidpatil wrote:
           | > Data Scientist's job is to launder management's intuition
           | using quantitative methods :)
           | 
           | https://www.youtube.com/watch?v=kAichhoZrKs
        
           | alar44 wrote:
           | I mean, that makes sense does it not? If you're confirming
           | something people already had a hunch about, why would they
           | challenge it? And if it does go against their belief, they
           | are going to want to make sure the data is correct before
           | they change the course of the ship.
        
           | pjmorris wrote:
           | > I used to joke that Data Scientists exist not to uncover
           | insights or provide analysis, but merely to provide factoids
           | that confirm senior management's prior beliefs.
           | 
           | So, a synonym for 'consultant?' :)
        
             | didgetmaster wrote:
             | In the news business, if your story or opinion backs up the
             | preconceived notions of the investigative reporter then you
             | are a 'source' otherwise you are a 'conspiracy theorist'.
        
               | suction wrote:
               | [dead]
        
             | stronglikedan wrote:
             | A consultant with the _data_ to back up their claims!
        
             | happymellon wrote:
             | My experience with consultants normally ended up with them
             | asking why they are there and what report should they
             | present to upper management.
             | 
             | I've always used them as "independent 3rd parties" who were
             | listened to.
        
               | EricE wrote:
               | "A prophet is not welcome in his hometown"
               | 
               | This has been going on for mellennia, unfortunately.
        
               | I_complete_me wrote:
               | Here's my take on this not listening to the "expert":
               | 
               | A few years ago there was a problem with storm-water
               | infiltration into my (elderly at the time) mother's
               | property from her neighbor. I, being a dutiful son and a
               | civil engineer, investigated it and came up with the
               | probable cause, the likely effects of non-action and the
               | most cost-effective solution. I presented it to my mother
               | in the most layman-like terms that I could. She said
               | she'd think about it - meaning she'd refer i.e. defer -
               | to her daughters. In the meantime I had a very layman-
               | like chat with my mother's carer and told her the
               | situation in layman's terms. The carer listened and said
               | that what I said I made total sense to her. Later on, one
               | of my sisters accosted me and stated that it was
               | completely obvious what the problem and the solution was
               | - "even the carer could see it". Human foreheads don't
               | have the real estate for where my eyebrows wanted to
               | ascend.
               | 
               | My advice is to consider whether the message should be
               | separated from the messenger somehow.
        
               | neuronic wrote:
               | As a consultant with roots in backend dev, I fully
               | understand the scrutiny that we receive because
               | unfortunately, it is often very warranted... It feels a
               | bit refreshing to read your comment and see someone
               | articulate what I am trying to convey to my clients. I am
               | a tool, and yes, this pun is intended.
        
               | jacobn wrote:
               | This sounds a lot like how my kids will listen to a
               | teacher/coach, but not their parents...
        
               | rxhernandez wrote:
               | Which is similar to how a lot of parents won't listen to
               | their kids but will listen to the coach, teacher, or
               | priest.
        
             | strbean wrote:
             | If Data Scientists are essentially in-house management
             | consultants, I wonder which is cheaper?
        
               | slt2021 wrote:
               | This could be a reason why Data Scientist as a job title
               | exploded in last years, every middle manager could afford
               | one/two/few headcounts of data scientists to produce
               | analysis that advances that middle manager's corporate
               | agenda (more growth, empire building, expansion to
               | certain de-novo areas, etc).
               | 
               | Recent tech layoffs is the other side of that growth,
               | when cheap money is gone and company is forced to stick
               | to core competencies and shutdown growth plans
        
             | htrp wrote:
             | All symptoms of the same problem..... you can hire McKinsey
             | to confirm your priors, massage the data to confirm your
             | priors, or anything in between.
        
           | disqard wrote:
           | An interesting essay that echoes these same sentiments:
           | 
           | https://ryxcommar.com/2022/11/27/goodbye-data-science/
        
           | monero-xmr wrote:
           | Also a big reason McKinsey and BCG exists - provide cover for
           | business plans intended by management to protect them from
           | shareholder lawsuits. My friend did a sojourn at McKinsey and
           | 6 months of his life was producing PowerPoints and memos
           | backing up an expansion to AIPAC region. Was already
           | happening but he was providing all manner of business
           | justification for board meetings and whatnot.
        
           | adasdasdas wrote:
           | Oh the experiment didn't go as expected? Rerun 5 more times
           | with minor tweaks. It definitely not p-hacking ;).
        
             | data-ottawa wrote:
             | I've been there, we wanted to release a feature, it kept
             | coming back with issues that made it perform much worse
             | than control, after 5 or so iterations with bug fixes it
             | came back positive.
             | 
             | It took a lot of analysis and time to clarify to higher ups
             | that we weren't just P-hacking , but at least they were
             | concerned about that.
        
           | peatmoss wrote:
           | > Data Scientist's job is to launder management's intuition
           | using quantitative methods
           | 
           | Ouch. This is savage, but sadly correct in many cases.
           | 
           | HOWEVER, to play devil's advocate here, I've also seen
           | corporate data scientists overstate the conclusions /
           | generalizability of their analysis. I've also seen data
           | scientists fall prey to believing that their analysis proves
           | would _should_ be done, rather than what is likely to happen.
           | 
           | The role of an executive or decision maker is to apply a
           | normative lens to problems. The role of the data scientist /
           | economist / whatever is to reduce the uncertainty that an
           | action will have the desired effect.
        
           | steveBK123 wrote:
           | Yes. I worked in the data org of a moderately sized financial
           | firms tech org. The tech org claimed to be hugely data
           | driven. Was in the org mottos and all of that.
           | 
           | Nonetheless, the CTO went on a multi-year, 10s of millions of
           | dollars, huge data tech stack & staffing reorg shake up...
           | with really zero data points explaining the driver, or what
           | we would measure to determine it was successful.
           | 
           | So it became a self referential decision that we are
           | successful by doing what he decided, and we are doing it
           | because he decided it.
        
           | throwaway15908 wrote:
           | Sounds like a good way to weed out middle management ;)
        
           | xkcd1963 wrote:
           | Reminds me of the book "Bullshit Jobs"
        
           | karmajunkie wrote:
           | "Data launderer" would be a good job title...
        
             | Phrenzy wrote:
             | That would imply the data is clean when they are finished.
             | GIGO
        
             | hermitcrab wrote:
             | Most data is very dirty.
        
             | wolf550e wrote:
             | The data is not laundered. Preconceived ideas and biases
             | are laundered and given scientific sounding justifications.
        
               | e12e wrote:
               | Concept Confirmer, Bias Booster?
        
               | barbecue_sauce wrote:
               | Context Provider.
        
               | e12e wrote:
               | Affirmation Artificier?
        
               | MonkeyClub wrote:
               | Assessment Assurance, for double the bang.
        
           | prometheus76 wrote:
           | I agree with you. I call Data Scientists "soothsayers for the
           | Pharoah".
        
           | chongli wrote:
           | Same goes for economists and the politicians who sponsor
           | them, just as it did for the astrologers and their patron
           | kings.
        
           | out_of_step wrote:
           | This phenomenon is true to varying degrees in academic
           | medicine (maybe all of academia) as well - personally have
           | seen excellent data and methods disregarded when they don't
           | confirm existing agendas. The choice for the researcher can
           | become one of burning out trying to do good work and getting
           | nowhere, or acquiesce and only present data that is
           | uncontroversial. Huge existential threat to knowledge
           | advancement.
        
           | terry_y_comb wrote:
           | Indeed, confirmation bias happens to almost everyone.
        
           | m463 wrote:
           | I wonder what a data scientist could really find out about
           | executive (over?) compensation. employee compensation.
           | working from home. office cubicle size and layout. tool
           | expenditure for employees vs productivity.
        
             | ProjectArcturis wrote:
             | How would you measure productivity at scale?
        
           | tpoacher wrote:
           | sadly same with academia and funding sources
        
           | koolba wrote:
           | > if you want to succeed as a Data Scientist and be praised
           | by management - you got to provide data analysis that
           | supports managements ideas (however wrong or ineffective they
           | might be).
           | 
           | > Data Scientist's job is to launder management's intuition
           | using quantitative methods :)
           | 
           | It's no different than the days when grey bearded wisemen
           | would read the stars and weave a tale about the great glory
           | that awaits the king if he proceeds with whatever he already
           | wants to do.
           | 
           | The beards might be a bit shorter or nonexistent, but the
           | story hasn't changed.
        
             | Guthur wrote:
             | And the alternative is to use the data as bones, throw it
             | up in the air and let it tell you what to do?
        
               | red-iron-pine wrote:
               | And you'd better hope the bones actually say something
               | useful.
               | 
               | I was the infra lead on a data lake project and got take
               | part all the way to breaking down the data and turning
               | into PowerBI reports. The result was "sell more" and to
               | clients who marketing already identified, years ago, as
               | whales.
               | 
               | There were some interesting other insights, esp. w/r/t to
               | niche products that sold around weird dates (Easter,
               | Memorial Day, 4th July -- but not obvious gift days like
               | Valentines or X-Mas), but it led to a lot of "you're
               | doing it wrong!" recriminations and follow up projects.
        
               | bronson wrote:
               | Absolutely. If you don't like what K-Means is telling
               | you, change a variable and re-run. (that's one great
               | thing about business data: there's no shortage of
               | variables! True, there's usually a shortage of
               | independent variables, but fixing that is difficult and
               | underfunded).
        
           | SkyBelow wrote:
           | This isn't just "Data Scientist" but scientist as well. The
           | more a finding is in contradiction, either with existing
           | scientific consensus or even with just popular culture, the
           | more the science is criticized. I've seen unequal criticism
           | based on how much people wanted the results to be true/false
           | and even after responding to the criticism I've seen people
           | just ignore science they don't like.
           | 
           | The skepticism isn't a problem, the unequal application of
           | it, the potential to harm careers, and the chilling effect as
           | people wisen to how best meet their own personal goals is.
        
           | timbaboon wrote:
           | As a data scientist at a large corporate I find this is often
           | the push... but I resist every time and tell people what they
           | don't want to hear. Maybe I'm playing this whole corporate
           | ladder thing wrong :/
        
           | dukeofdoom wrote:
           | That's like a portrait artist that finds success by painting
           | people more beautiful than they really are vs a starving one
           | that paints them true to life due to sense of artistic
           | integrity. Reminds me how Garth Brooks started doing metal
           | after becoming a country music star.
        
           | eska wrote:
           | This would match what psychologists say about humans in
           | general: we feel first, then we use our brain to justify that
           | feeling. We're not rational beings.
        
             | AnIdiotOnTheNet wrote:
             | We totally are, it's just that rationality is a tool, not a
             | guide. If you want to work out the truth, rationality will
             | help you do that, but if instead you want to justify a
             | decision you already made, well, it'll help you do that
             | too.
        
               | galangalalgol wrote:
               | Hypothesis don't come from rationality either, they
               | result from well informed intuition. All of the formality
               | of science is about tricking ourselves into discovering
               | our intuition is wrong using a rational series of steps
               | even when everything in our nature is to use that ability
               | to reason to do the opposite.
        
             | diognesofsinope wrote:
             | I think the answer is simpler: people care about their
             | careers and their family first. Think, "If the data says
             | something that gets in the way of my career well I don't
             | care about the data."
             | 
             | Had the same problem when I was an economics researcher --
             | publication bias for what stakeholders want to hear (often
             | the government) is rampant because that's where funding for
             | the economics department mostly comes from.
        
             | trieste92 wrote:
             | or is it that psychologists _feel_ that we aren 't rational
             | and use reason to justify this?
             | 
             | it isn't clear to me how the grounds for realizaing the
             | theory are reconciliable with its conclusion
        
             | moremetadata wrote:
             | Thats because psychologists dont understand or choose to
             | ignore how chemistry influences our personalities and
             | emotions. An extremely simple example from the same
             | medical/health profession is the use of SSRI's to make
             | people feel happy. The legal system recognises how
             | chemicals influence our feelings because of the laws that
             | exists on illegal drugs or drink driving.
             | 
             | The definition of rational is being informed enough to know
             | what said chemicals will do in the short term and long term
             | in order to make an informed decision, but then I'm
             | reminded we dont get taught any of the above unless we
             | specialise at a Uni, so most people cant make any sort of
             | informed decision.
        
             | ArjenM wrote:
             | A whole industry of emotional branding is thriving,
             | systematically overloading our brains so it hurt to even
             | think differently in the moment.
             | 
             | We are accepting a whole lot of assumptions every day.
        
           | remus wrote:
           | I think this depends a lot on the org. In a place I used to
           | work we collected and analysed a lot of data which convinced
           | management to significantly change the spec of the product
           | and spend a lot more time and effort on testing, because the
           | product was being used in unexpected ways.
           | 
           | I would say it was a very engineering driven org however, so
           | if you could present compelling data it could go a long way.
        
         | boh wrote:
         | The other thing to consider is that data simply has nothing of
         | value. Part of the marketing of big data is the almost fairy
         | tale belief in "insights" existing in any data set if you just
         | look hard enough.
        
           | ryguyrg wrote:
           | Correct, much data has no value. The cost for storing the
           | data, maintaining the data [in the day-and-age of privacy
           | requirements especially], and combing through the data is
           | often much greater than the value obtained from the data
           | itself.
           | 
           | The expertise we need in the industry is people who
           | understand applications in-and-out and make great decisions
           | on what data is worth keeping for present and future
           | applications. And what data is needed to be kept, but only in
           | aggregates (or anonymized, which reduces costs of
           | maintenance)
        
         | revolvingocelot wrote:
         | It's absolutely this. "Decision-based evidence-making" is what
         | I've seen it called.
        
           | hgsgm wrote:
           | "We make decisions based on data, so let's use mine."
        
             | midasuni wrote:
             | Take random words that Brent Spiner has said and mash them
             | into a video.
        
         | Consultant32452 wrote:
         | I don't want you to tell me what the data says, I want to tell
         | you what the data says and you go find data that confirms it.
         | 
         | kthnxbye
        
         | capableweb wrote:
         | Is that what's happening at Amazon as well? As they seemingly
         | is loosing more and more track of the "Customer Obsession"
         | schtick.
        
           | bluedays wrote:
           | Their mission hasn't changed. They're still obsessed with
           | customers, just not the way you think.
        
           | aintgonnatakeit wrote:
           | They are encouraging their customers to have a bias toward
           | action. Away from that asshole Bezos.
        
           | fuzzylightbulb wrote:
           | "customer obsession" was always at the mercy of the real
           | obsession: "making money hand over fist". The former will
           | ALWAYS lose out to the latter given enough cycles.
        
             | [deleted]
        
             | pphysch wrote:
             | Yeah, "customer obsession" really just means "market share
             | / growth obsession" which is a means to (eventually) making
             | monopoly profits. Which Amazon seems to have achieved.
        
               | LarryMullins wrote:
               | All the amazon corporate values are derived from making
               | profit. Two-pizza teams? More like "three slices isn't
               | frugal", one or two should be enough for you.
        
               | ethbr0 wrote:
               | One may follow the other, but not vice versus.
               | 
               | It's a pretty strong argument to say that Microsoft under
               | Gates was technically obsessed, but that really faltered
               | under Balmer.
               | 
               | Microsoft continued to win profits, but they made major
               | strategic missteps that cost the revenue.
               | 
               | Amazon feels like it's going down the same path:
               | empowering the tree-gazers without remembering that the
               | forest also matters.
        
               | [deleted]
        
         | ren_engineer wrote:
         | "data" is usually just twisted to make leadership look good or
         | justify what decision they wanted to make anyway. Analytics
         | data is sliced at arbitrary time periods to make growth in
         | whatever metric look good, certain subsets are just removed,
         | etc.
         | 
         | doesn't help that most of this data goes through multiple
         | layers of BS where each person is putting it through filters to
         | make themselves look better. And a good chunk of people don't
         | have enough understanding of stats to understand when they are
         | being tricked
        
         | commandlinefan wrote:
         | > a huge amount of thought
         | 
         | Which itself both takes time and is wildly unpredictable -
         | neither of play well with today's Taylorist managements
         | schemes.
        
       | ralph84 wrote:
       | Big Data got replaced by Big Parameters.
        
         | hgsgm wrote:
         | Parameters come from data.
        
       | therealbilly wrote:
       | I think server hardware solved the big data issue. The stuff we
       | have now can blitz through data in the blink of an eye. For
       | national governments like our own, mainframes still have a place.
       | For me personally, I don't even talk about big data anymore.
        
       | revskill wrote:
       | Main goal of Big Data as i see is to profile performance and
       | metrics. Number of user registration, number of converted
       | users,...
        
       | lern_too_spel wrote:
       | People don't want to deal with having to rearchitect when their
       | workload does not fit on a single instance. Yes, optimize for the
       | small data case, but if you build a product that can handle only
       | the small data case, you have a tough sell.
        
       | [deleted]
        
       | rvieira wrote:
       | What about IoT?
        
       | blakeburch wrote:
       | Great post and really resonates with my experience. Good to have
       | some confirmation that most organizations aren't using their
       | large swaths of data.
       | 
       | Although I don't think most organizations are blaming lack of
       | actionable insights on the data size. It's the lack of
       | prioritizing data usage over data accessibility. We need to be
       | teaching data people business levers and teaching business people
       | data levers.
       | 
       | Data should be a byproduct of an actionable idea that you want to
       | execute. It shouldn't exist until you have that experiment in
       | mind.
        
       | bfrog wrote:
       | This reminds me of a great blog post by Frank McSherry
       | (Materialize, timely dataflow, etc) talking about how using the
       | right tools on a laptop could beat out a bunch of these JVM
       | distributed querying tools because... data locality basically.
       | 
       | https://github.com/frankmcsherry/blog/blob/master/posts/2015...
        
       | cmrdporcupine wrote:
       | From about 2008/2009/2010 or so on there was perhaps an over-
       | emphasis on specialized tools for the mass acquisition of streams
       | of data. Maybe in large part due to the explosion of $$ in ad-
       | tech. Some people had legitimately insane click/impression
       | streams -- I worked at a couple companies like that. Development
       | of DBs based on LSM trees or other write-specialized storage
       | structures became important. Existing relational databases
       | weren't particularly well built for this stuff. This was part of,
       | but not the whole story with the whole NoSQL thing. People were
       | willing to go completely denormalized in order to gain some
       | advantage or ability here. It helped that much of the data looked
       | at was of perhaps little structural complexity.
       | 
       | In the meantime SSD storage took off, so the IOPS from a stock
       | drive have skyrocketed, business domains for large data sets have
       | broadened beyond click/impression streams, and the challenge now
       | is not "can I store all this data" it's "WTH do I do with it?"
       | 
       | Regardless of quantity of data, structuring and analysis and
       | querying of said data remains paramount. The challenge for
       | anybody working with data is to represent and extract knowledge.
       | I remain convinced that logic -- first order logic and its
       | offshoot in the relational model -- remains the best tool for
       | reasoning about knowledge. Codd's prognostications on data from
       | the 1970s are still profound.
       | 
       | I think we're in a space now where we can turn our attention to
       | knowledge management, not just accumulating streams of
       | unstructured data. The challenge in a business is to discover and
       | capture the rules and relationship in data. SQL is an existing
       | but poor tool for this, based on some of the concepts in the
       | relational model but tossing them together in a relatively
       | uncomposable and awkward way (though it remains better than the
       | dogs breakfast of "NoSQL" alternatives that were tossed together
       | for a while there.)
       | 
       | My employer is working in this space, I think they have a really
       | good product: https://relational.ai/
        
       | cubefox wrote:
       | This is a bit ironic given that generative AI models like GPT-3
       | and Dall-E only work because they were trained on very large
       | datasets.
        
       | hugesniff wrote:
       | "Very often when a data warehousing customer moves from an
       | environment where they didn't have separation of storage and
       | compute into one where they do have it, their storage usage grows
       | tremendously..."
       | 
       | Can someone explain why this is the case? Is it due to more
       | replications or maintaining more indices?
        
       | datan3rd wrote:
       | Detailed web event telemetry is where I have seen the "biggest"
       | data, not application-generated data. Orders, customers, products
       | will always be within reasonable limits. Generating 100s of
       | events (and their associated properties) for every single
       | page/app view to track impressions, clicks, scrolls, page-quality
       | measurements can get you to billions of rows and TBs of data
       | pretty quickly for a moderately popular site. Convincing
       | technical leaders to delete old, unused data has been difficult;
       | convincing product owners to instrument fewer events is even
       | harder.
        
       | AaronBBrown wrote:
       | The truth is that most "big data" problems aren't big and can
       | often be solved with awk and xargs.
        
       | zX41ZdbW wrote:
       | My presentation from FOSDEM 2023 is very sympathetic to the "Big
       | data is dead" statement:
       | https://www.youtube.com/watch?v=JlcI2Vfz_uk
       | 
       | It is about using modern tools (ClickHouse) for data engineering
       | without the fluff - when you can take whatever dataset or data
       | stream and make what you need without the need for complex
       | infrastructure.
       | 
       | Nevertheless, the statement "big data is dead" is short-sighted,
       | and I don't entirely follow this opinion.
       | 
       | For example, here is one of ClickHouse's use-case:
       | 
       | > Main cluster is 110PB nvme storage, 100k+ cpu cores, 800TB ram.
       | The uncompressed data size on the main cluster is 1EB.
       | 
       | And when you have this sort of data for realtime processing, no
       | other technology can help you.
        
       | fredliu wrote:
       | The title might be hyperbole (intentionally), but the
       | observations are more or less in line with what I experienced
       | through a few the Big Data initiatives over the years under
       | different enterprise environments (although I have reservation
       | about the one 1%er comment). To me, Big Data was never about how
       | "big" the data was, but more about the tools/system/practice
       | needed to overcome the limitation of the previous generation.
       | From that perspective, yes, the "monolith" may be having a
       | "coming back" for now due to the improvement of underlying single
       | node performance. But I do think Data size will keep growing,
       | everything needed to make Big Data work would still be there when
       | the pendulum swings back where a single node can't handle it
       | anymore.
        
       | freedude wrote:
       | "Among customers who were using the service heavily, the median
       | data storage size was much less than 100 GB"
       | 
       | Eye-opening. Especially when combined with a recent quote from
       | Satya Nadella, "First, as we saw customers accelerate their
       | digital spend during the pandemic, we're now seeing them optimize
       | their digital spend to do more with less."
       | 
       | Conclusion: SaaS is easy to drop off in downturns. Just as easy
       | as it is to buy initially.
        
       | carlineng wrote:
       | MotherDuck has been making the rounds with a big funding
       | announcement [1], and a lot of posts like this one. As a life-
       | long data industry person, I agree with nearly all of what Jordan
       | and Ryan are saying. It all tracks with my personal experience on
       | both the customer and vendor side of "Big Data".
       | 
       | That being said, what's the product? The website says
       | "Commercializing DuckDB", but that doesn't give much of an idea
       | of what they're offering. DuckDB is already super easy to use out
       | of the box, so what's their value-add? It's still a super young
       | company, so I'm sure all that is being figured out as we speak,
       | but if any MotherDuckers are on here, I'd love to hear more about
       | the actual thing that you're building.
       | 
       | [1]: https://techcrunch.com/2022/11/15/motherduck-secures-
       | investm...
        
         | dangwhy wrote:
         | > DuckDB is already super easy to use out of the box, so what's
         | their value-add?
         | 
         | I think this is analytics equivalent of edge computing. Instead
         | of one big-cluster cruching numbers.
         | 
         | 1. User requests bunch of analytics
         | 
         | 2. Server assembles a duckdb file
         | 
         | 3. Sends this down to users laptop
         | 
         | 4. User runs local queries on the duckfile
         | 
         | 5. Go to step 1 for more analytics
        
         | jtigani wrote:
         | We're being a bit hand-wavy with the offering while we're in
         | "build" mode, because we don't want to sell vaporware. DuckDB
         | is easy to use out of the box, but so is Postgres, and there
         | are plenty of folks building interesting cloud services using
         | Postgres, from Aurora to Neon. And as many people will point
         | out, DuckDB is not a data platform on its own.
         | 
         | For a preview of what we're doing, on the technical side, a
         | couple of our engineers gave a talk at DuckCon last week in
         | Brussels, it is on youtube here:
         | https://www.youtube.com/watch?v=tNNaG7e8_n8
         | 
         | (for context I'm the author of this blog post and co-founder of
         | MotherDuck)
        
         | danielmarkbruce wrote:
         | Deliberately speculating so someone will correct it: I'd guess
         | they'll make a bunch of enterprise tools to do things like:
         | enable access and synch the data in a way which complies with
         | various policy, encrypt/tokenize/hide certain columns etc,
         | monitor queries, ensure data is encrypted at rest, stuff like
         | that.
         | 
         | Assuming the above it true: I'll bet the reason they aren't so
         | loud about exactly what they are doing is they want to get a
         | head start on it. In theory anyone can build this stuff around
         | DuckDB. From a marketing perspective the clever thing to do
         | would be drive up usage of DuckDB while they build out all this
         | functionality and then the minute corporates start seeing
         | problems with their people using it (compliance etc), they have
         | the solutions.
        
           | carlineng wrote:
           | I'd wager you're right. All the "boring" stuff that's
           | actually very complicated/difficult, and without which no
           | large enterprise will adopt a technology.
        
             | threeseed wrote:
             | Especially since enterprise companies hate the idea of
             | shifting large amounts of highly sensitive company data
             | onto commonly lost and misplaced work laptops.
             | 
             | If you're going to do that you better have your security
             | and governance on point.
        
             | vgt wrote:
             | Shoot me a note and I'm happy to fill you in!
             | 
             | (PM at MotherDuck)
        
               | carlineng wrote:
               | Would love to! How do I get in touch? My contact info is
               | in my profile.
        
               | vgt wrote:
               | done!
        
               | danielmarkbruce wrote:
               | Is there a reason you can't post it here?
        
       | anon223345 wrote:
       | Long live big data!
        
       | Agingcoder wrote:
       | I remember the big data craze. People had very little data and
       | low quality at that so they had a data problem before they had a
       | big data one!
        
         | mejakethomas wrote:
         | Yes! This!!!
         | 
         | Volume != Quality
        
       | kthejoker2 wrote:
       | So the argument is you can do everything with an OLAP Database
       | because we shrunk "Big Data" back inside RAM?
       | 
       | K, good luck!
        
       | spaintech wrote:
       | Not that big data is dead, more like real time data is coming to
       | life, but you need the old stuff around to make a buck or two...
       | Well, that my view. LLMs are transformer model technique are
       | making data more relevant than ever. If you are a business, well
       | you are in for a "now real" digital transformation.
       | 
       | Making data the centerpiece of your business business could mean
       | that your effectiveness of business process could increase
       | several order of magnitudes. Funny thing is, you will not use
       | some else's model, unless you are building a ChatBox to infer,
       | but you will need to build your own model and be trained in your
       | own business process to be successful.
       | 
       | Consider a bank, here is my prediction of expected outcomes:
       | 
       | Enhanced Customer Experience: The system can act as a virtual
       | banking assistant, providing customers with instant access to
       | their account information, real-time transactions, and balance
       | updates. The system can also answer customer inquiries and
       | provide relevant information, improving the overall customer
       | experience. Improved Fraud Detection: The system can monitor the
       | bank's financial transactions in real-time and identify any
       | potential fraud, helping the bank reduce its exposure to
       | financial losses.
       | 
       | Automated Loan Processing: The system can analyze loan
       | applications, credit scores, and other relevant data to approve
       | or reject loan applications in real-time, reducing the time and
       | effort required for manual loan processing. Personalized
       | Marketing: The system can analyze customer behavior, transaction
       | history, and demographic information to provide personalized
       | marketing and cross-selling opportunities, increasing the bank's
       | revenue and customer loyalty.
       | 
       | Real-Time Insights: The system can provide real-time insights
       | into the bank's financial performance, customer behavior, and
       | market trends, enabling the bank to make informed decisions and
       | respond to market changes quickly.
       | 
       | What is interesting to me is, this is just the beginning of what
       | could be...
        
         | mr_tristan wrote:
         | Yeah, I've noticed more applications just need to focus on
         | making sense of raw information really quickly, but usually
         | don't need an archive to make decisions.
         | 
         | There are lots of interesting things that can happen with "big
         | streaming" than necessarily "big data". Like, cybersecurity is
         | evolving to monitoring and reacting what everyone's machine is
         | doing in the last 15 minutes, instead of having a huge database
         | of hashes you trust. But not a ton of things really utilize
         | what happened, say, 10 years ago on people's machines.
         | 
         | There's definitely some things that can use massive archives of
         | old data, but I have found far, far fewer things that would
         | benefit from it, and often that comes with some very big
         | maintenance hassles. Most of the time, you can just set data
         | retention to 30 days and be done.
        
         | threeseed wrote:
         | I assume you've never actually worked at a bank.
         | 
         | They've been working to implement your ideas for decades and
         | none of it requires LLMs or any machine learning techniques.
         | Basic old ETL is more than sufficient.
         | 
         | The issue is that (a) the calculations they need to perform are
         | complex and take time to run (b) there are financial
         | regulations that weave its way through those system and (c)
         | there is a lot of legacy code especially in the core ledger
         | system which "just works" and people are reluctant to touch.
         | 
         | That said depending on your bank you can get real-time account
         | activity, loan approvals in < 5 minutes etc.
        
           | spaintech wrote:
           | Well, that is an understatement. I do agree with you that
           | banks have been trying to fix decades old application.
           | 
           | But in this process, you don't need ETL, nor all the process
           | and development to accomplish these ideas. Conceptually the
           | idea builds its self (it learns) how to threat the data,
           | quite revealing and near real time. Considering you account
           | for security and privacy, then you basically shift your input
           | into the data stream and using a natural language get the
           | data output you need, not clunky apps.
           | 
           | Imagine I just login, and say: me>how much do I have?
           | bank>You have 100$ me>Please send 50$ to 1003 bank> Are you
           | sure? Please add your security code to confirm
           | 
           | bla bla...
           | 
           | All this with little intervention.
           | 
           | Banks spend hundreds of man hours developing a lacking
           | application while delivering a very poor customer experience.
           | They spend millions on running decades old applications
           | because it so expensive to change them... and thus the circle
           | continues...
           | 
           | I'm really exited to see DataBases disappear conceptualy,
           | data entry, mostly all that just disappear... I will ask my
           | ChapBot for statement, give me a personal investment advice,
           | and classify all my purchases and see where my wife has been
           | spending all my money, all from the confort of my phone.
           | 
           | it's a brave new wold we are wakening up to, that to me is
           | exciting. And coming from having helped several major banks
           | build their infrastructure, it's just a boost to talk about
           | something fresh, no more Hypervisor, core count, db licenses,
           | ect. Ok, I'll concede it's pretty much the same old, just the
           | nemonics will be different... How many GPUs, how quickly can
           | you spin a container, how fast if your S3 datastore... oh
           | wait, there is that circle again... >:D
        
             | threeseed wrote:
             | So you're not actually talking about back-ends system but
             | about the front-end.
             | 
             | In that case, chat-bots have existed for years and
             | consumers largely don't like them.
             | 
             | In your scenario you can transfer money in a few clicks
             | rather than having to write out an entire conversation.
        
       ___________________________________________________________________
       (page generated 2023-02-07 23:00 UTC)