[HN Gopher] Big data is dead ___________________________________________________________________ Big data is dead Author : munchor Score : 554 points Date : 2023-02-07 16:34 UTC (6 hours ago) (HTM) web link (motherduck.com) (TXT) w3m dump (motherduck.com) | glogla wrote: | I agree with a lot of the sentiments of the MotherDuck people, | but boy are they loud and proud for someone who never delivered | anything more than blogposts and vague promise to somehow exploit | the MIT licensed DuckDB. | | Meanwhile for example boilingdata.com seems to have already done | that - by using AWS Lambda + DuckDB as distributed compute engine | which I can't decide if its awesome, deranged or both. | mytherin wrote: | We (the DuckDB team) are very happy working together with | MotherDuck in a close partnership [1]. | | [1] https://duckdblabs.com/news/2022/11/15/motherduck- | partnershi... | singularity2001 wrote: | Big Data lives on in LLMs. | KaiserPro wrote: | big data isn't big anymore. | | 1) 10 years ago, having access to 300tb of data that could | sustain 10gigabytes/s of throughput would require something like | two racks of disks with some SSD cache and junk. | | 2) people thought hadoop was a good idea | | 3) People assumed that everything could be solved with map:reduce | | 3) machine learning was much less of a thing. | | 4) people realised that postgres does virtually everything that | mongo claimed it could. | | 5) people realised that cassandra was a very expensive way to | make a write only database. | | I gave a talk about using big data, and basically at the time the | best definition I could come up with was "anything that's too big | to reasonably fit in one computer. so think 4, 60 disk direct | attached SAS boxes". | | Most of the time people were chasing the stuff for the CV, rather | than actually stopping to think if it was a good idea. (think k8s | two years ago, chatGPT now, chat bots in 2020). Most buisnesses | just wanted metrics, and instead of building metrics into the | app, they decided to boil the ocean by parsing unstructured logs. | | Not surprisingly it turned to shit pretty quick. Nowadays people | are much better at building metrics generation directly into | apps, so its much easier to easily plot and correlate stuff. | jl6 wrote: | To add to the "the real issue is..." pile: | | Most orgs collect the data that is easy to collect, and they are | extremely lucky if that happens to be the data that enables the | insights they desire. When the data they really _need_ looks too | hard to get, the org tries to compensate by collecting more of | the easy stuff, and hoping that if blood can't be squeezed out of | a stone, maybe it can be squeezed out of 100bn stones. | meindnoch wrote: | Good riddance. | itamarst wrote: | This is an excellent summary, but it glosses over part of the | problem (perhaps because the author has an obvious, and often | quite good solution, namely DuckDB). | | The implicit problem is that even if the dataset fits in memory, | the software processing that data often uses more RAM than the | machine has. And unlike using too much CPU, which just slows you | down, using too much memory means your process is either dead or | so slow it may as well be. It's _really easy_ to use way too much | memory with e.g. Pandas. And there's three ways to approach this: | | * As mentioned in the article, throw more money at the problem | with cloud VMs. This gets expensive at scale, and can be a pain, | and (unless you pursue the next two solutions) is in some sense a | workaround. | | * Better data processing tools: Use a smart enough tool that it | can use efficient query planning and streaming algorithms to | limit data usage. There's DuckDB, obviously, and Polars; here's a | writeup I did showing how Polars uses much less memory than | Pandas for the same query: | https://pythonspeed.com/articles/polars-memory-pandas/ | | * Better visibility/observability: Make it easier to actually see | where memory usage is coming from, so that the problems can be | fixed. It's often very difficult to get good visibility here, | partially because the tooling for performance and memory is often | biased towards web apps, that have different requirements than | data processing. In particular, the bottleneck is _peak_ memory, | which requires a particular kind of memory profiling. | | In the Python world, relevant memory profilers are pretty new. | The most popular open source one at this point is Memray | (https://bloomberg.github.io/memray/), but I also maintain Fil | (https://pythonspeed.com/fil/). Both can give you visibility into | sources of memory usage that was previous painfully difficult to | get. On the commercial side, I'm working on | https://sciagraph.com, which does memory and also performance | profiling for Python data processing applications, and is | designed to support running in development but also in | production. | H8crilA wrote: | Big data starts somewhere around a petabyte, maybe a bit lower | than that. That's when you need some serious, dedicated systems. | But as always everyone wants to (pretend to) do what the big | players do. | travisgriggs wrote: | I've made anecdotal observations similiar to this over the last | 10 years. I work in AgTech. A big push for a while here has been | "more and more more data". Sensor-the-heck out of your farm, and | We'll Tell You Things(tm). | | Most of what we as an industry are able to tell growers is stuff | they already know or suspect. There is the occasional suprise or | "Aha" moment where some correlation becomes apparent, but the | thing about these is that once they've been observed and | understood, the value of ongoing observation drops rapidly. | | A great example of this is soil moisture sensors. Every farmer | that puts these in goes geek-crazy for the first year or so. It's | so cool to see charts that illustrate the effect of their | irrigation efforts. They may even learn a little and make some | adjustments. But once those adjustments and knowledge have been | applied, it's not like they really need this ongoing telementry | as much anymore. They'll check periodically (maybe) to continue | to validate their new assumptions, but 3 years later, the probes | are often forgotten and left to rot, or reduced in count. | jschveibinz wrote: | This analysis reminds me of the big interest in the use of | hyperspectral imaging for agriculture. The idea was the greater | spectral resolution (greater than Landsat) would result in more | interesting information. Agriculture was one of the | applications. But, once you did find the interesting stuff, you | no longer needed a hyperspectral sensor. You could just look at | one spot with a much lower cost sensor. | | So hyperspectral, like big data, is useful up front. But in the | end, much simpler tools and algorithms will solve the problem | on a continuing basis. | barathr wrote: | Classic paper on soil moisture sensors (from 2010!) -- the | title says it all: | | "Mate, we don't need a chip to tell us the soil's dry" | | https://doi.org/10.1145/1858171.1858211 | calvinmorrison wrote: | On the flip side, there's some great action coming from data | insights. Look at Strella Biotech - they're putting sensors in | sealed warehouses to detect spoilage for certain vegetables and | fruits. That's something that can have great returns with just | a few IoT devices and a novel sensor. | didip wrote: | I've been telling folks, storing everything all the time is | wasteful, a better alternative is: | | 1. Keep the raw full data for short period of time, at most 1 | month. | | 2. Downsample what you need for longer period of time (5-10% of | the full data). | | 3. Aggregate your metrics on a yearly basis to save money and | compute costs. | ff317 wrote: | I tend to think the problem is the "random digging for | correlations" part. | | Having tons of data is a Good Thing, so long as you can afford | the marginal cost of gathering and managing all that data so | that it's ready at hand when you need it later. | | It's how you use the data that makes all the difference. If | you're facing an issue you don't understand at all, don't go | digging for random correlations in your mountain of data to | find an explanation. | | Think like a scientist: you need a valid hypothesis first! Once | you have a hypothesis about what your issue might plausibly be, | then you make a prediction: "If I'm right, I suspect our Foobar | data will show very low values of Xyzzy around 3AM every | weekday night". Only then do you go look at that specific data | to confirm or refute the hypothesis. If you don't get a | confirmation, you need to go back to hypothesizing and | predicting before you look again. You can't prove causation by | merely correlating data. | counters wrote: | > It's how you use the data that makes all the difference. If | you're facing an issue you don't understand at all, don't go | digging for random correlations in your mountain of data to | find an explanation. | | Absolutely. But in my experience, there's this massive trend | across the tech world that flat out rejects the value of | domain/subject matter expertise. Instead, all you need is an | engineer who can throw some ML at the uncurated mountain of | data your organization has collected. Little to no value is | placed on the resources that can frame an actionable | hypothesis, even though the entire value proposition arises | from this exercise! | | Maybe I'm just jaded. I end up wasting a lot of time trying | to re-direct data scientists and engineers down more | appropriate pathways than if the problem they're solving was | just brought to my attention earlier. Sorry, I understand you | spent two weeks shoe-horning dataset X into our analysis | system for your work, but it's invalid for the question | you're asking - use dataset Y instead, and you'll have an | answer in an hour or two. | PeterisP wrote: | Fine-grained measurement is useful when you have options for | fine-grained action. | | You don't need a chip to tell you that the soil is dry, but if | you can use that chip to regulate drip irrigation that can | apply substantially different flow to different plants, then | you can get a not-too-much, not-too-little watering even if you | have a big variation in conditions. | | You don't need a big analysis to acknowledge that everybody | knows that a particular competitor has lower or higher prices | and adjust your pricing; but doing that continuously on a per- | product basis does require data and analysis. | ryguyrg wrote: | Agreed. But how many executives will agree to take these | fine-grained actions to achieve value from the data? How many | data teams are able to build up a strong-enough argument to | convince them? | | I've worked on many product-led-growth initiatives in the | software industry. The software industry is probably the | biggest 'believer in data' there is -- many scientific- | forward minds who understand the value. However, even in the | software industry, it's really hard to convince folks that if | you make 5 improvements that net 1% conversion gain each, you | can dramatically improve revenue. | chudi wrote: | Most of the time this story is true, but think this way, the | person that was using the system was an expert on the subject. | If you can replace the expert with a person just looking at a | graph from time to time to know if you have to irrigate the | soils it's a different thing. Most of the data or ML tools show | us something that the client as an expert already knows, but | the true power of this tools is to give them to a non expert | user and have roughly the same level of proficiency | e12e wrote: | But isn't this the essence of industrialization and automation? | Messure, adjust process; repeat until feedback loop is stable - | document and keep doing the thing that works, over and over? | | If you want Toyota style continuous improvement you would need | to improve in new areas of the process / new metrics, most of | the time? | azubinski wrote: | Oh, those soil moisture sensors, they are so fascinating. | | I spent a number of exciting year developing a high frequency | soil impedance scanner and finally understood why I was doing | it. To confirm the obvious :) | 0xdeadbeefbabe wrote: | > goes geek-crazy for the first year or so | | The problem is that they don't stay geek-crazy? | ladyattis wrote: | I think there's a problem at the heart of the matter, | specifically the idea that the act of measurement is in itself | powerful when in point of fact that this isn't universally the | case. As the old adage goes: "garbage in, garbage out." Even | more troubling, there is a physical limit to our ability to | model what we measure. Take the retina, it has around a million | light receptors and even if you assumed they only have two | valid states then you're left with around 10^300,000 bits of | information to process, so good luck with that. Same thing | applies to whatever firms are measuring and what they think is | conveying relevant information as they'll have similarly | exponential increases if they don't filter out the vast | majority of irrelevant data points and states. | gffrd wrote: | I've observed the same in manufacturing ... and fitness | trackers a la FitBit. | | There's initial value from training yourself on what something | looks/feels like ... but diminishing returns after that. | Whether there is more value to be found doesn't seem to matter. | | Factories would sensor up, go nuts with data, find one or two | major insight, tire of data, and then just continue operating | how they were before ... but with a few new operational tools | in their quiver. | | Same is true of fitness trackers: you excitedly get one, learn | how much you really are sitting(!), adjust your patterns, time | passes ... then one day you realize you haven't put it on for a | week. It stays in the drawer. | | Not unless they're threatened with ruin will people make | changes to the standard way of doing things. This is actually | ... not bad! Continuity is important, and this is kind of a | subconscious gating function to prevent deviation from a proven | way of working. So, the change has to be so compelling or so | pressing that they're forced to. Not a bad thing. | | While we think things change overnight in this world, they | generally take awhile ... stay patient ... it's worth it. | [deleted] | pradn wrote: | I went on a diet a few years ago. I obsessively recorded | every food I ate in MyFitnessPal. To this day, I know roughly | how many calories pretty much everything I eat is. So, I've | learned from the process and don't need the process as much | any more. (I'm kidding about that - it's easy to | underestimate how much you eat, and an extra 200 cal a day | adds up over the years.) | carabiner wrote: | I used to track sooo much health and fitness data... Then I | realized it mostly wasn't actionable, or at least, I wasn't | altering my decisions based on it. The answer was always, | "more training." So I stopped. | esel2k wrote: | Very interesting. Left AgTech last year but had similar | experiences, even worse where often the single most prominent | use-case was to follow some painful necessary documentation of | ag inputs (chemical, seeds, fertiliser) to get subsidies. Real | inputs from data? Nah! | guardiangod wrote: | There is literally a post on front page on ChatGPT, and Microsoft | and Google are preparing to duke it out starting in the _next 2 | days_ over big-data generated 'chat' result. | | Big data was never going to be useful to even medium size | enterprises, unless anyone can get public access to PBs of data, | but that doesn't mean big data is dead. ChatGPT is literally | changing how school will test their students, for a start. | | Maybe what the author is trying to say is 'small-scale big data | is dead, but big data chugs on.' | gardenhedge wrote: | That is what the author said. From the article: "Big Data is | real, but most people may not need to worry about it." | hn_throwaway_99 wrote: | > Maybe what the author is trying to say is 'small-scale big | data is dead, but big data chugs on.' | | That's pretty much exactly what the author says in the article. | zmmmmm wrote: | Yes, this occurred to me as well. The counter narrative here is | in fact that the story of the last 2-3 years has been break | throughs in AI have come about mostly by scaling up their | network sizes and training data sets 5 orders of magnitude or | so. | | I guess the take away however is still that regular businesses | really just can't play in this game and should not be assuming | they have big data until that fact asserts itself out of | necessity rather than the other way around. | eppp wrote: | I kind of doubt they trained chatgpt on petabytes of | application logs and web server logs. Is keeping all of this | crap even useful for more than a small amount of time at this | scale? | | Actual good information will always be useful, most of this | "big data" seems to be the equivalent of recording background | static. | tootie wrote: | That's a completely different topic. "Data" is obviously a | pretty generic term and "large sets of data" are going to be | more and more relevant to the world in general. What he's | talking about is the Big Data trend in industry specifically | around Business Intelligence (BI). That is, collecting as much | data as possible on your users to optimize your product | experience and profits. Tracking clicks, purchases, form | dropoffs, email opens, ad impressions. It's mostly going to be | first-party data (ie what did they do with our own products and | content). | | ChatGPT and the like are not going to get much use from that | kind of data and instead are looking at a giant corpus of text | and images scraped from a variety of public sources to infer | what humans might think sounds smart. It's possible the two | worlds will meet, but that's probably not what's going to be | announced this week. | miguelazo wrote: | >ChatGPT is literally changing how school will test their | students, for a start. | | Sure, instead of schools checking for plagiarism from other | students' papers using turnitin.com, they'll check for | plagiarism using ChatGPT tools that scan for known output from | their industrial-scale amalgamation of plagiarized materials. | Big whoop. | humanizersequel wrote: | All math homework through the high school level is now as | simple as figuring out how to describe it to ChatGPT (or | maybe ChatGPT 2.0 for particularly tricky examples). Paper- | writing is now a matter of figuring out how to rephrase LLM | output in your own words to get around any watermarking or | pattern detection. | eganist wrote: | Wolfram alpha has been around for math cheats (and people | like me who just needed a more visual representation to | learn) for a while now. Including proof of work. | LarryMullins wrote: | Years before wolfram alpha, we had TI-89s with computer | algebra systems for cheating your way through highschool | math. | eganist wrote: | Oh yeah, it's why most of my classes restricted us to | TI-83s. The TI-89 was restricted in schools to basically | calc and above, and the TI-92 was just banned. Lol | LarryMullins wrote: | In my school TI-89s were unrestricted, but I think that | was mostly because teachers were only trained with 83s | and assumed the 89 had equivalent capabilities. The SAT | permitting the 89 probably had something to do with it | too, since the 92 was banned (because qwerty as I | understand it.) | frgtpsswrdlame wrote: | You can't figure out how to describe a math problem | correctly to ChatGPT without already knowing the solution. | bhhaskin wrote: | Or just require students to use software that keeps track of | version history or require spyware installed. | SoftTalker wrote: | Go back to longhand. Even if they are plagiarizing, they | might learn something from the exercise of rewriting by | hand. | suddenclarity wrote: | That's solved by putting a pen on your 3d printer. | Loughla wrote: | I mean, that's certainly part of it. But you're also seeing | very fast restructuring of individual courses (because | programs overall will definitely take years to restructure | because higher education moves so damned slow) to account for | these tools. | | In the small institution I am currently working with, the | English courses, in one week, integrated chatgpt as a tool | for students to work with. It's part of the collaborative | idea building and development process now for every student | enrolled in creative writing and writing analysis classes, | and that happened in one week. I cannot stress enough how | unbelievably fast that is for higher ed. That's faster than | light speed. | | And we're not even that well resourced. I have to imagine | there are other examples where it's more than just running | through a bot to scan for known outputs. | stcroixx wrote: | Sounds like they simply panicked and threw something | together with little thought or preparation, what, right in | the middle of an actual course? And they want to charge | kids for this kind of 'expert instruction'? I'd be pissed | as a student. | Loughla wrote: | What a weird assumption you made there. In what way does | what I wrote sound panicked? Because it was a week? Yes | it was fast, but it was a massive effort of the entire | english faculty. | | It's integrated into existing assignments, modifying | processes that are super well established already. It was | like integrating a new person into the class. Also, it | was before semester, so the students literally saw | nothing weird; that was another strange assumption for | you to make. | | Just a tip, and don't read tone in this statement, but | don't assume things. 9/10 times you're going to be | incorrect. It's much better to ask questions, instead of | making statements with question marks at the end of them. | pcthrowaway wrote: | > don't assume things. 9/10 times you're going to be | incorrect | | Isn't that... an assumption? | Loughla wrote: | No, it's an assertion. It's like the bastard cousin of an | assumption, in that it's only incorrect 8.67 times out of | 10. | stcroixx wrote: | I simply don't buy that they had anywhere close to enough | time to integrate this into an existing curriculum in a | way that would meet the high standards paying students | should have for a university education. I don't have to | ask how an English department became experts in using a | just released AI in the classroom in a week because I | don't think that's what they actually achieved. | Loughla wrote: | Again, stop putting words in my mouth, please. I never | said they became experts. I said they integrated it into | existing processes. I said that they were doing something | other than just scanning for chatgpt hits like plagiarism | checkers. | | >It's part of the collaborative idea building and | development process now for every student enrolled in | creative writing and writing analysis classes | | >It's integrated into existing assignments, modifying | processes that are super well established already. It was | like integrating a new person into the class. | | I don't honestly know how else to say that. I | legitimately do not know how to help you understand what | I'm saying. | hgsgm wrote: | Wow, instead of learning and being creative, just write a | out whatever the generic safe whitewashed chatbot says? | | Why pay for the college? | scottyah wrote: | > industrial-scale amalgamation of plagiarized materials | | More Big Data! | criddell wrote: | Is it crazy to think that instead of stepping up in the war | against AI we instead try to figure out a way to teach kids | assuming they will use AIs? | SoftTalker wrote: | Are we trying to produce adults who are able to think | critically and creatively, and who reach their full | intellectual potential, or are we trying to produce adults | who can push a few buttons and blindly believe what the | machine tells them? | NineStarPoint wrote: | While likely not how it would be work out in practice, | you would hope that with better tools would also come | higher standards. If you expect more complex, more | thorough, and/or less error-prone output from students | using AI then you don't necessarily have to lower how | much critical and creative insight they need to have. | Like the difference in a test that does and doesn't allow | calculators, you always have to fit the assignments to | the tools that are used for them. | ilyt wrote: | Definitely the second one | [deleted] | nickersonm wrote: | > In fact, [writing] will introduce forgetfulness into | the soul of those who learn it: they will not practice | using their memory because they will put their trust in | writing, which is external and depends on signs that | belong to others, instead of trying to remember from the | inside, completely on their own. | | - attributed to Socrates by Plato. c.399-347 BCE. | "Phaedrus." | drowsspa wrote: | Pretty sure there's a fallacy named after this whole "hey | this is just exactly like before so we have nothing to be | concerned about". | nickersonm wrote: | The point is that all technology is a tool. Whether it be | writing, calculators, or various narrow AI software. We | can either bemoan the loss of a now-less-useful skill | (memorization, long division, longform writing), or learn | how to use these tools to better achieve our goals. | qup wrote: | The average human hates changes of things they've grown | used to. | | People are very attached to school being just like it was | when they went. | maliker wrote: | It appears that it's hard to detect AI generated content. | E.g. true detection rates are only around 25% and there are | also techniques to further mask output [1]. | | [1] https://www.nbcnews.com/tech/innovation/chatgpt-can-help- | foo... | penjelly wrote: | what if teachers ask chatgpt how best to test students | despite the existence of a tool like chatgpt enabling | cheating | pphysch wrote: | OpenAI and Google are clearly in the 1% as TFA describes. | snarf21 wrote: | Big Data drives the most profitable and society bending changes | of all time, just to serve us better Ads. | ryguyrg wrote: | Okay, Google as a company and as a product is definitely in | the top 1%, or top 0.0001% where big data drives profitable | and bending changes ;-) | epicureanideal wrote: | I wonder how long until training today's ChatGPT will cost | $1000 of AWS compute. 10 years? | | At that point, does it keep scaling or is there an S curve | where 100x more data and compute only leads to a 2x | improvement? | thfuran wrote: | I think you're underestimating the scale by a few orders of | magnitude. | the8472 wrote: | > At that point, does it keep scaling or is there an S curve | where 100x more data and compute only leads to a 2x | improvement? | | Careful with the scales. 2x improvement could be interpreted | from 80% human performance to 160% human performance. Or | going from 10% error rate to 5% (which again crosses into | superhuman territory on some tasks). Those last few bits are | the critical ones. | pphysch wrote: | We would need to see incredible advances in energy efficiency | for that to happen. | simplyluke wrote: | We are! Don't even have to reach for fusion potentially | being commercial technology to show it. Solar is already | approaching $0.03/kilowatt hour and likely to be half of | that by the end of the decade. Energy getting very cheap | coupled with computing capacity continuing to go way up is | going to enable lots of interesting new technologies beyond | LLMs | airstrike wrote: | _> ChatGPT is literally changing how school will test their | students, for a start._ | | Here's a novel idea: test students using pen and paper? | 0cf8612b2e1e wrote: | Teachers assign better scores to papers with better | penmanship. I forget how strong the effect was, but using a | keyboard does help equalize some biases. | politician wrote: | Scan the hand-written work back into a digital format and | present the results to the teachers for evaluation in their | preferred typeface. | albert_e wrote: | Keyboards and screens in an exam hall then? | 0xdeadbeefbabe wrote: | Why equalize that bias? | slaymaker1907 wrote: | Because there are a plethora of disabilities which make | neat penmanship difficult. | MonkeyMalarky wrote: | Typewriters for everyone! | Mizza wrote: | Or, preferably, admit that testing wasn't a good idea to | begin with and focus on optimizing children for learning, not | test-taking. | taftster wrote: | I don't even think we're just talking about children | either. Test taking in academia (university and above) | could stand a much needed fresh look. | | I am hopeful that a change happens in academia to prepare | students for jobs, which is why they are going to school in | the first place. Yes, students need to learn how to | "think", but really they are wanting to get the technical | skills to perform their duties more than anything. | | We have bestowed too much credence in traditional academia | not useful to the average person or average job. College is | a "game" for most students, and they put up going through | the motions of testing, etc. for the sake of the diploma at | the end. | | I hope we're going to enter a new era of what college means | for those looking to get something different out of it. | throwmeout123 wrote: | In Italy we test with pen and paper and the quality of | italian schools is abysmal. The point is not the test, is the | quality of the teachers. They are the only hope to form good | humans, not some standardized test. | winterismute wrote: | The database was the key technology in the 2001-2011 decade: it | allowed companies to store massive amount of data in an organized | way, so that they could provide basic functionality (search, | monitoring) to users. Statistical learning is being the key | "technology" of 2011-today: it allowed companies, which had | stored massive amount of data, to feedback predictions to users. | I think AR/Computer Graphics will be the key technology of the | next decade: it will allow users to interact directly and | seamlessly with the insights produced by ML systems, and possibly | feed-back information. | idlewords wrote: | Pretty funny to see this when every other headline on this site | is about how large language models are about to revolutionize | dentistry, beekeeping, etc. | nerpderp82 wrote: | Big Data was whatever someone couldn't handle in a spreadsheet or | on their laptop using R. | | This paper is 8 years old and it was somewhat obvious then. | | Scalability! But at what COST? | https://www.usenix.org/system/files/conference/hotos15/hotos... | | A big single machine can handle 98% of peoples data reduction | needs. This has always been true. Just because your laptop only | has 16GB doesn't mean you need a Hadoop (or Spark, or Snowflake) | cluster. | | And it was always in the best interest of the BD vendors and | Cloud vendors to say, "collect it all" and analyze on/or using | our platform. | | The future of data analysis is doing it at the point of use and | incorporating it into your system directly. Your actionable | insights should be ON your grafana dashboard seconds after the | event occurred. | angry_moose wrote: | My experience with "Big Data" is it was something that couldn't | be handled in a spreadsheet or on their laptop using R because | it was so inefficiently coded. | | I got sucked into "weekly key metric takes over 14 hours to run | on our multi-node kubernetes cluster" a while back. I'm not | sure how many nodes it actually used, nor did I really care. | | Digging into it, the python code ingested about ~50GB of | various files, made well over a dozen copies of everything, | leaving the whole thing extremely memory starved. I replaced | almost all of the program with some "grep | sed | awk | sed | | grep" abomination that stripped about 98% of the unnecessary | info first and it ran in under 2 minutes on my laptop. I | probably should have tightened it up more but I was more than | happy to wash my hands of the whole thing by that point. | | Instead of improving the code, they just kept tossing more | compute at it. Still heard all kinds of grumbling about | os.system('grep | sed | awk | sed | grep') not being "pythonic" | and "bad practice"; but not enough that they actually bothered | to fix it. | nerpderp82 wrote: | That is one of the selling points of Hadoop, you can write | garbage code and scale your way out of any problem, turning | the $$$ knob up to more nodes. | angry_moose wrote: | Yeah, that's why I got involved (I was infrastructure at | the time) - how can we throw more hardware at it as the | kubernetes setup they had wasn't cutting it. | | One of the "data scientists" point blank said in a meeting | "My time is too valuable to be spent optimizing the code, I | should be solving problems. We can always just buy more | hardware". | | Admittedly the last little bit of analysis was pretty cool, | but >>99% of that runtime was massaging all of the data | into a format that allowed the last step to happen. | mywittyname wrote: | You can do a petabytes of analysis with regular old BigQuery | just as easily as you can analyze megabytes of data. This | solves the scalability issue for a lot of companies, IMHO. | nerpderp82 wrote: | I agree, BQ is a gem on GCP. You pay for storage (or not, you | can use federated queries) and don't pay anything when you | aren't using it. The ability to dynamically scale | reservations is pretty nice as well. | mejakethomas wrote: | So what I'm hearing is it's not the size of your data that | matters, it's how you use it? | zmmmmm wrote: | To be honest, I slightly disagree about data size. I think the | big data is there to be had, the real story is that data science | itself has not panned out to provide the business value that | people asserted would come from it. Data volumes haven't risen | more because in the end, it turns out most of the things | businesses need to know are easily ascertainable from much | smaller data and their ability to action even these smaller very | obvious things is already saturated. | | It doesn't help that we've shifted into a climate where hoarding | data comes with a huge regulatory and compliance price tag, not | to mention risk. But if the value was there we would do it, so | this is not the primary driver. | papito wrote: | First they came for the sacred microservices, now they are after | Bid Data. What. Is. Happening. | | Don't get me wrong, I love it. It's about time people got off | these stupid and shockingly expensive bandwagons. | heisenbit wrote: | Sampling has proven extremely useful. Pi can be approximated with | it as were nuclear bombs designed using statistical methods. | Flame graphs based on stack samples are used to optimize servers. | Government does planning with it. Management does its thing by | wandering around. | | It usually does not take many data points for an actionable | insight and most actions then will invalidate small details in | old data anyhow. Better to start every round with fresh eyes. | edpichler wrote: | I believe we are living in the "emotional era", so data has being | ignored and 'feelings' come first when making decisions or | creating processes. This is happening not only in companies but | in our current society in general. | mordechai9000 wrote: | Perhaps I'm somewhat cynical, but I believe this is a feature | of the human condition, not an attribute of our age in | particular. Reason and analysis are tools that are used to | justify what we already believe. | maxfurman wrote: | Agreed! The so-called "Age of Reason" was the anomaly, and | probably not that much more reasonable than our own time. | tootie wrote: | I think there's absolutely a place for this. I often of the old | Henry Ford quote about people wanting faster horses. Data and | analytics are great for optimization, but sometimes you need to | trust your gut and give people something they didn't ask for to | have a breakthrough. | alexpetralia wrote: | I am writing an essay series on this topic: last-mile analytics | and how an abundance of data must be ultimately converted into | (measurably correct) action. | | If anyone wants to follow along, the series is here! | | https://alexpetralia.com/2023/01/19/working-with-data-from-s... | blakeburch wrote: | That looks like a huge undertaking, but kudos for taking the | time. I'll be following along. Totally agree that all data | should be tied to the business value that it's driving. | | Unfortunately, I've found that many data teams focus more on | making the data clean and available. They never drive the | conversation about what actions are being taken with the data. | That leads to them being treated as cost centers. Wrote a | similar post about my perspective on it - | https://bytesdataaction.substack.com/p/transform-your-data-t... | | I'd love to chat about the space more with you if you're | interested! Email in bio. | imachine1980_ wrote: | This sounds like "sane planning, sensible tomorrow." Book for | Al gore | moooo99 wrote: | I feel like big data has rarely lived in most organizations. My | own experience working in large orgs largely supports the point | that collected data is rarely queried. But this is rarely due to | a lack of interest, it is mostly because a) nobody really has a | great overview over what even is collected b) even if you | know/assume something is collected, you usually have no idea | where c) if you find the data, there is a decent chance that it | is in some sort of weird format that requires a ton of processing | to be usable. | | This has been - to varying extends - my own experience working in | large organizations that don't have tech as their core business. | | Although there are some successful data analysis project, the | potential of the collected data remains largely underutilized. | blipvert wrote: | Listen to "Reason" | danuker wrote: | I agree with many of the points here. | | My cheap no-name old laptop SSD writes with 170MB/s. | | A customer has a name, address, email and order. Let's say 200 | bytes for each. That means I can write 844000 new customers per | second, far outside my personal marketing reach. | | My disk is 240GB, which means I can store data for 1.2 billion | customers. It'll take a while until I become that successful. | hinkley wrote: | One of my "computers are really fucking fast" experiments, | almost a decade ago, was when I was trying to do a histogram | plot of a function that I was 98% sure was terribly broken. It | was expected to give a uniform distribution so I figured I'll | just plot a bunch of values into a 2d space and then convert it | to a greyscale image. | | At first I tried to puzzle out a good sampling strategy to make | sure I didn't bias the output, then on a whim I tried 2^32 | samples and went to lunch. It took something like a half an | hour to do 4 billion samples. Took me a couple times to figure | out how to squeeze 4k megapixels into a graph so I ran it a few | more times, but the results showed a very distinct banding | pattern that confirmed that the problem was every bit as bad as | I suspected, which was a blocking issue for our release. A | couple of hours well spent, running through an 'intractable | problem' that really wasn't. | lostmsu wrote: | > 170MB/s | | That is not random access speed. For random access my | relatively high-performance SSD only does 42MB/s reading and | 80MB/s writing. | danuker wrote: | Indeed there's probably some caching going on. | ravivooda wrote: | It never sat well with me that none of the production services | could leverage my local computation and storage power. I don't | need to store my contacts on a remote server that could index | my contacts when mixed with every other contact in a single | table. That's a blatantly oversimplified example but you get | the gist. | ilyt wrote: | Developing apps as local-mostly with remote being "just | storage" might've been interesting approach but oh so many | stuff moved to webshit from native apps and browsers still | don't even have decent data management. | ravivooda wrote: | Well said! I wonder if Web3 could solve such a problem (or | a zero trust solution). Where you provide your service that | can run in a special container | tomwheeler wrote: | Presumably the "order" you mention is a primary key to another | table, likely one that references the individual items that | make up that order, so the data will be much larger than you | estimate. | | It will grow larger still if you include web logs from your | e-commerce site and event data from your mobile app so that you | can correlate these orders with items that customers considered | but ultimately didn't buy. How will your laptop and SSD perform | when you then build a user-item matrix to generate product | recommendations for each of those 1.2 billion customers? | | While plenty of organizations unnecessarily use Big Data tools | to store and analyze relatively small amounts of data, there | are plenty of customers with enough data to require them. I've | seen plenty of them firsthand. | ilyt wrote: | That's still well within 1U server with some RAM and bunch of | NVMes reach | 0xB31B1B wrote: | There are functionally less than 1000 organizations that | currently require distributed compute for data analysis. You | can get off the shelf AWS units with 1000 cores, terabytes of | ram and storage, etc. The cost of compute has decreased | faster than the amount of data we have to store and process. | What we used to do with spark jobs we can do with python on a | single box. | pocket_cheese wrote: | This is not true. Any column store database (bigquery, | Redshift, snowflake) implements distributed compute behind | the scenes. When an analyst/business intelligence people | have a query return in 3 seconds instead of 15 seconds, | it's actually huge. Not just in aggregate amount of time | saved, but in creating a quick feedback loop in testing | hypothesizes. This is especially true considering that most | analyst type people look at data as aggregates across some | dimension (e.g. sales per month , unique visitors per | region, etc...) | | These types of questions are orders of magnitude faster | with a distributed backend. | glogla wrote: | Yup. | | I was just playing with some data from our manufacturing | system, about 30 GB. I pulled the data to my laptop (very | expensive Apple one) and while it fits on my disk just | fine, it took about 15 minutes to download. | | I imported it to ClickHouse which took a while due to | figuring out whatever compression and LowCardinality() | and so on. I ran a query and it took ClickHouse about 15 | seconds. DuckDB pointed to the parquet files on my SSD | took 19 seconds to do the same. Our big data tool took 2 | seconds, while working with data directly in cloud | storage. | | Now of course this is entirely unfair - the big data | thingie has over twenty times more CPUs than my laptop, | and cloud storage is also quite fast when accessed from | many machines at once. If I ran ClickHouse or DuckDB on | 100 CPU machine with terabyte of RAM it might have still | turned out faster. | | But this experiment (I was thinking of using some of the | new fancy tech to serve interactive applications with | less latency) made me realize that big data is still a | thing. This was a sample - one building from one site, | which we have quite a few of. | ryguyrg wrote: | I'd love to understand the shape of this data and some of | the types of queries you're performing. It would be very | helpful as we build our product here at motherduck. | | I have no doubt that there are situations where the cloud | will be faster, especially when provisioned for max usage | [which many companies do not]. However, there are a lot | of these situations even where the local machine can | supplement the cloud resources [think re decisions a | query planner can make]. | | Feel free to reach out at ryan at motherduck if you want | to chat more. | threeseed wrote: | Let's assume your completely made-up 1000 organisations | claim is true. | | Right now I work for one of them: a global investment bank. | | Within that organisation we have at least 100+ Spark | clusters across the organisation doing distributed compute. | And at least in our teams we have tight SLAs where a simple | Python script simply can't deliver the results quick | enough. Those jobs underpins 10s of billions of dollars in | revenue and so for us money is not important, performance | is. | | So 1000 x 100 = 100,000 teams, all of whom I speak for, | disagree with you. | doug_durham wrote: | Citations please? That's a pretty bold statement to make in | the face of observed reality. | beckingz wrote: | Even if this is off by two orders of magnitude and it's | only 100,000 companies that need distributed compute, | that means that almost all companies just need a single | large computer. | | Looking at the distribution of companies by employee | count and assuming that data scales with employee count | (dangerous assumption, but probably true enough on | average), that means that companies don't need | distributed compute until they get several hundred | employees. [0] | | [0] https://www.statista.com/statistics/487741/number-of- | firms-i... | fho wrote: | https://yourdatafitsinram.net/ | threeseed wrote: | This is such a lazy response. | | I/O performance is just one of many characteristics that | impact performance and from experience the one you least | need to worry about. RAID 0 across multiple high-end NVME | drives with OS file caching is going to be more than fast | enough for most use cases. | | The issue is running out of CPU performance and being | able to seamlessly scale up/down compute with live | running workloads. | beckingz wrote: | A large computer is radically CPU overprovisioned for | most workloads. | deltarholamda wrote: | Don't forget the cool JS library you included to track mouse | movements so you can optimize your UI to make sure Important | Money Making Things are easily clickable. | | That's 8.4 hojillion megabytes per second right there. | juujian wrote: | > Most data is rarely queried | | Right on point. In the past I have been obsessed with big data, | looking for insights. Then I realized that a medium-sized | specific data set is always better than a gargantuan general big | data monster. There is so many applications in my field where | only outliers matter anyways, and everything is very | "centralized" to a few relevant observations. So the only thing | about big data is that you maybe throw away 99.9% of the data | right away and then you have some observations that you actually | care about. There is soooo much data out there that is just | noise, and so little that I actually care about. And that's why I | still end up hand collecting stuff every now and then. | donretag wrote: | My personal definition of Big Data has always been when you | gather/store data without having a planned use for it. Do we need | this data? Don't know, let's just store it for now. | | The article does allude to this definition when it states that | "Most data is rarely queried". We have become data hoarders. | Technology has made it easy (and relatively cheap) to store data, | but the ideas of what to do with this data have not scaled in | comparison. | ThereIsNoWorry wrote: | Big Data is dead? Seems well and alive to me. If you're not a big | company with big customers, it never affected you to begin with. | dig1 wrote: | Big Data is far from dead. On the contrary, people (on most | daily projects) are more mindful now wrt all Big Data | liabilities and benefits (infrastructure cost vs. what you get | from it) thanks to the experience of the failed ones. But many | analytics companies are thriving. | | Also, using BigQuery as a metric of how Big Data is used is, | IMHO, wrong. Real analytics companies usually have custom | solutions because BigQuery is too expensive for any serious | usage unless you are Google. | zzzeek wrote: | > The most surprising thing that I learned was that most of the | people using "Big Query" don't really have Big Data. | | wow, ya think? Must have been eye opening to see all those | customers with a few million rows thinking they had "Big data" | huh? | ryadh wrote: | While I get that they're sometimes useful to trigger debate, I | don't really subscribe to very bold statements. | | We are drowning in data, it's all around us. Information overload | is real. Data enables most of our daily digital experiences, from | operational data to insights in the form of user facing | analytics. Data systems are the backbone of the digital life. | | It's is an ocean and it's all about the vessel you pick to | navigate it. I don't believe that the vessel should dictates the | size of the ocean, it's simply constrained by it's capabilities. | The trick is to pick the right vessel for the job, whether you | want to go fast, go far or fish for insights (ok, I need to stop | pushing on this metaphor ) | | This visionary paper from Michael Stonebreaker (2005) predicted | it quite accurately and I think is still relevant: | https://cs.brown.edu/~ugur/fits_all.pdf | | Databases come in various flavours and the "trends" are simply a | reflection of what the current era needs | | Disclaimer: I work at ClickHouse | fuziontech wrote: | 100% agree. One of the biggest assets we had at <driver and | rider marketplace app> was the data we collected. We built | models on it that would determine how markets were run and | whether drivers and passengers were safe. These were key | features that enabled us to bring a quality service to | customers (over ye ol' taxi). The same applied to the | autonomous cars, bikes, and scooters. We used data to improve | placement of vehicles to help us anticipate and meet demand. It | was insane how much data used to build these models. | | To say big data is dead sounds to me like someone desperate for | eyeballs. | | I do think there is a huge opportunity for DuckDB - running | analytics on 'not quite big data' is a market that has always | existed and is arguably growing. I've seen way too many people | trying to use Postgres for analyzing 10 Billion row tables and | people booting up an EMR cluster to hit the same 10 Billion | rows. There is a huge sweet spot for DuckDB here were you can | grab a slice of the data you are interested in, take it home | and slice and dice it as you please on your local computer. I | did this just this weekend on DuckDB _and_ ClickHouse! | | Disclaimer: I work at a company that is entirely based on | ClickHouse. | vgt wrote: | Didn't know that Posthog is based on CH these days. | Interesting! | spopejoy wrote: | I guess the article title is a "bold statement" but maybe the | biggest insight in there is that people don't think hard enough | about throwing old data away, and it hurts them. This is a | liferaft for drowning in data and is more "bold" | organizationally, as it actually takes a certain kind of | courage to realize you should just throw stuff away instead of | succumb to the false comfort that "hey you never know when you | might need it". | | Weirdly there's a similar thing that can happen to codebases, | specifically unit tests and test fixtures that outlive any of | their original programmers, nobody understands what's actually | being tested and before each release lose days/weeks hammering | to "fix the test". The only solution is to throw it away, but | good luck getting most teams to ever do that, because of the | false comfort they get -- even though that fixture is now just | testing itself and not protecting you from any actual bugs. | | I mean how often does Netflix need to look a viewing habits | from 2015? Summarize and throw it away. | alluro2 wrote: | I'm quite surprised with data sizes mentioned in the article, and | wondering if I'm missing something...We are a very small 2yo | company, handling route optimization and delivery management / | field service. Even with our very small number of customers, | their relatively small sizes (e.g. number of "tasks" per day), | being very early in development in terms of data that we collect | - our database containing just customer data for 2 years is | ~100GB. Which I previously considered small, and if we collected | useful user metrics, had more elaborate analytics, location | tracking history etc, I would expect it to be at least 3x. | | We don't use any "BigData" products yet, as there wasn't any need | for them, even when we provide full search and relatively nice | and rich set of analytics over all the data. Yet, based on the | article, we're way above most of the companies relying heavily on | such tools. Confusing. | ankrgyl wrote: | I love DuckDB and am cheering for MotherDuck, but I think | bragging about how fast you can query small data is really no | different than bragging about big data. In reality, big data's | success _is not_ about data volume. It 's about enabling people | to effectively collaborate on data and share a single source of | truth. | | I don't know much about MotherDuck's plans, but I hope they're | focused on making it as easy to collaborate on "small data" as | Snowflake/etc. have made it to collaborate on "big data". | miguelazo wrote: | On to the next hype theme(AI)! | luckydata wrote: | It's kinda weird to read this. The whole argument is "we didn't | have databases that could handle the sizes and use cases | emerging, we worked on the problem for 20 years and now it's no | biggie". | | Mission accomplished more than big data is dead IMHO. | andix wrote: | I see it all the time: people develop applications that will | never ever get a database size of over 100GB and are using big | data databases or distributed cloud databases. Often queries only | hit a small subset of the date (one customer, one user). So you | could easily fit everything into one SQL database. | | Using any of the traditional SQL databases takes away a lot of | complications. You can do transactions, you can query whatever | you want, ... | | And if the database may get up to 1TB, still no problem with SQL. | If exceed that, you may need a professional OPs team for your | database and a few giant servers, but they should easily be able | to go up to 10 TB, offload some queries to secondary servers, ... | primax wrote: | I think a key driver of this is not having to use SQL. I like | DynamoDB and EdgeDB because I can use a more modern and | reasonable language to interact with the database. | 0xB31B1B wrote: | its really difficult to do any kind of analysis without | relational queries. The standard way you do this is to have | an app datastore in DDB, and an ETL job that pipes your data | into some data warehouse env. | andix wrote: | That's a good point, I also think that there should be some | modern alternative to SQL. I really like how you can query | databases with LinqPad (c#) and how it renders it into a | nested table tree. All relations are clickable/expandable, so | if you find something interesting in your result set, you can | just expand additional rows from other tables. In the | background it just creates sql via an ORM, not only once I | more or less copy and pasted that generated sql into a view. | | But linqpad is not useful if you don't get the pro version, | only then you get code completion. So it's not really the | answer to the problem. | [deleted] | [deleted] | tootie wrote: | I think a lot of data tech has come full circle is now mostly | just relational databases. Our org is invested in redshift | which lets us mostly pay as we go. The DB itself is just a | Postgres facade on scalable storage with some native connectors | to file stores and third-parties. After rolling over our stack | like three times, we're now just dumping tons of raw data into | staging tables, then creating views on top of them. It's 97% | raw SQL with a smattering of python for clunky extractions. And | we're now true believers in ELT vs ETL. | threeseed wrote: | Redshift with S3 storage is no different to Spark SQL with S3 | storage. | | Both are distributed compute. Except that Spark allows you to | mix/match code with SQL. | mikepk wrote: | We need to re-think how to make data _useful_. The fact that the | value hasn't materialized after decades of attempts, billions of | dollars, and lots of tools and technology points to the fact that | our core assumptions and patterns are wrong. | | This post doesn't go far enough. It challenges the assumption | that everyone's data is "big data" or that every company's data | will eventually grow to be big data. I agree that "big data" was | the wrong model. We also need to challenge that all data should | be stored in one place (warehouse, lake, lakehouse). We need to | challenge that one tool can be used for every data need. We need | to challenge how we build systems both from a technology and | people standpoint. We need to embrace that the problems and needs | of companies _are always changing_. | | We are living with conceptual inertia. Many of our patterns are | an evolution from the 70's and 80's and the first relational | databases. It's time to rethink how we "do data" from first | principles. | blakeburch wrote: | The problem is that no tool alone can make data useful. It | requires human ingenuity to come up with a theory, gather the | required data, then test and verify the theory. | | We've gotten to a point where the first and last step get | skipped. Business leaders see other companies doing interesting | things with data, so the answer must be "gather all the data"! | Internal teams end up focused on gathering the data without the | context of how it might be used. | | We need to train data teams to not focus on the data as the | product. Instead, they should be responsible for driving | business actions. Gathering and cleaning the data should just a | byproduct of that activity. | pier25 wrote: | > _Are you in the big data one percent?_ | | Exactly, and I'd go further. | | Are you in the perf/scale/data one percent? | | So many people worry about scaling when in reality 99% of web | apps will never reach above 100reqs/s. | | I've been in web dev for 20+ years. Only once when working for a | big international corporate client I had to worry about traffic | spikes. And that was just for one of their multiple web apps. | CommieBobDole wrote: | It's not dead, it's just entered the plateau of productivity. | Where people use it for whatever it's useful for and don't try to | solve every problem with it just because it's the cool new thing. | taftster wrote: | This posting was great. Highly recommended reading through. It | gets really good when the author hits "Data is a Liability". | | _> An alternate definition of Big Data is "when the cost of | keeping data around is less than the cost of figuring out what to | throw away."_ | | This is exactly it. It's way too hard to go through and make | decisions about what to throw away. In many respects, companies | are the ultimate hoarders and can't fathom throwing any data way, | Just In Case. | | Really appreciated the post overall. Very insightful. | | As an anecdote to this article, when business folks have come up | to me and asked about storing their data in a Big Data facility, | I have never found the justification to recommend it. Like, if | your data can fit into RAM, what exactly are we talking about Big | Data for? | nemo44x wrote: | Why would I use DuckDB instead of Clickhouse or similar? Is it | just because I want to have the database embedded in my app and | not connect to a server? | tylerhannan wrote: | One great reason to use DuckDB was when ClickHouse took up too | much memory on Parquet files. | | https://github.com/ClickHouse/ClickHouse/issues/45741#issuec... | helps with that though. | | Also, clickhouse-local exists | https://clickhouse.com/blog/extracting-converting-querying-l... | as a thing. | | But, yes, when I think of DuckDB...I think embedded use | cases...i'm also not a power user. | | I also think of this very much as a 'horses for courses' or | 'different strokes, different folks' sort of scenario. There | is, naturally, overlap because 'analytical data.' But also, | there is naturally overlap with R and this giant scary mess of | data-munging PERL code I maintain for a side project. | | The DuckDB team, the MotherDuck team, the ClickHouse team...we | all want your experience interacting with data to be amazing. | In some scenarios, ClickHouse is better. In some scenarios, | DuckDB. I'm biased (as I work for ClickHouse in DevRel), but I | <3 ClickHouse. | | Try both. Pick the one that is best for you. Then...you | know...tell the other(s) why so that we all can get better at | what we do. | nemo44x wrote: | Thanks but I'm looking for specific use cases. Like I get | SQLite. And I get Clickhouse. But I just don't get why I'd | use DuckDB specifically. I'm sure it's awesome and super | useful but I have a gap in my understanding. | wizwit999 wrote: | Perhaps this is true for business data (though I'm skeptical of | the claims), but, for example, for security data, this isn't true | at all. Collecting cloud, identity, SaaS, and network logs/data | can easily exceed hundreds of terabytes. A big reason why we're | building Matano as a data lake for security. | | It seems an odd pitch in general to say, hey my product | specifically performs poorly on large datasets. | simonw wrote: | Sounds like you're in the "Big Data One-Percenter" category | described at the very bottom of the article. | CobrastanJorji wrote: | On the contrary, identifying what your product is explicitly | not aiming to do is extremely helpful. "Big" adds a lot of | complexity and pain, most people don't do that, our product | avoids the complexity and pain and is the best choice for most | people. Seems like a good, simple pitch, and all it requires is | the humility to say that your solution isn't the best for some | use cases. | cmollis wrote: | we regularly run audits on over 12 years of customer order | histories. This requires scanning of about 40TB of data and | growing. They used to jump through hoops on the Oracle cluster | just to get data out for one customer. We pushed all of the order | history into s3 parquet using Spark and I can query this in about | 20 seconds using Spark or Presto. It's now streamed through kafka | and Spark structured streaming so it's up to date in about 3 | minutes. The click-bait-y title notwithstanding, I get that not | all data is 'big' and duckdb (and datafusion, polars, etc) is | probably great for certain use-cases but what I work on every day | can't be done on a single machine. | sgt101 wrote: | Looks at 15 hr Spark job (running since this morning) | | Sighs... | CobrastanJorji wrote: | Tableau's "Medium Data" April Fools Day ad from several years ago | still rings amazingly true. | lucidguppy wrote: | Some of mongo's leveling off is the adoption of good jsonb | columns in postgres. | | mongo's got sharding out of the box - which is nice - but you | have to get your key right or it will suck. | | Also no one should want to host a mongo db - unless that's your | business. | threeseed wrote: | MongoDB grew revenue 52.8% in the previous financial year [1]. | | And if there is any levelling off it's going to be because of | the move towards cloud managed options e.g. Snowflake, | DocumentDB rather than because PostgreSQL decided to add JSONB | support. | | [1] | https://www.macrotrends.net/stocks/charts/MDB/mongodb/revenu... | gesman wrote: | Customer pays data analytics vendor to tackle bunch of their [low | quality, big size] data. | | If you have no tangible capabilities to do above, asking customer | "ARE YOU IN THE BIG DATA ONE PERCENT?" will be the quickest way | out of the door. | andreygrehov wrote: | With all the LLM craziness, this is just the beginning. How else | are they going to train all those models? I'm not an expert, just | imho. | hinkley wrote: | It's always kind of amazed me how closely Big Data was followed | by the KonMari method and it really seems like the nerds were not | paying attention to that at all. Or just happy to take a paycheck | from people who weren't paying attention. | | Hoarding is not a winning strategy. | fijiaarone wrote: | Somewhere along the line people were tricked into thinking that | logging was data, and that to we needed to turn up every trace | log to 11 on every production system. | | Logs are where data goes to die. | posharma wrote: | We're going to reach a point where we might say the same thing | about large language models. Fine tuned LMs (based off of their | large parents) are going to be the bread and butter. | LeanderK wrote: | Who has ever believed those claims? There's a common saying | "garbage in, garbage out" about what happens with all those fancy | models if the data quality is not high. That's really independent | from dataset-size. There's no magic insight you get because your | dataset is bigger. You need a quality analyst to handle your | data, irrelevant of its size. | | Also, who thought their company would cease to function because | surely they will hit google-scale dataset-sizes in the near | future? Impossible for most except the biggest of the biggest | articsputnik wrote: | I love DuckDB's simplicity and think it will solve many problems. | Still, transitioning from a local single file DB to concurrent | updates and serving it online will be different. I'm curious | about what MotherDuck will come up with to solve DuckDB at scale. | | I love use cases like the Rill Data | (https://youtube.com/watch?v=XvP2-dJ4nVM), where you can suddenly | run analytics with a single cmd line prompt and see your data | just instantly visualized. Such use cases are only possible | because of the "small" data approach that DuckDB tries. | poorman wrote: | This entire post reads like "you probably don't actually have big | data". | | What do these blockchains do that have to keep data around | forever, with high throughput, and need to expose it quickly do? | Are you saying they should delete parts of data in the chain? | | Seriously, I've spent my career working on big data systems, and | while the answer is sometimes "yes you need to delete your data", | I don't think that's going to always work. | PeterisP wrote: | And what about these blockchains? The full history of Bitcoin | blockchain is less than 500gb, so for any analysis just getting | a machine with a terabyte of RAM is both simpler and cheaper | (once you include dev+ops time) than doing any horizontal | scaling across multiple machines with "Big Data" approaches. | | "You probably don't actually have big data" is a very valid | point, not that many organizations do - most businesses haven't | generated enough actionable data in their lifetime to need more | than a single beefy machine without ever deleting data. | jacobsenscott wrote: | nosql is dead, client side SPAs are dead. Nice to see the | complexity pendulum swinging back to the correct side again. | Curious what the merchants of complexity will reach for next. Are | applets going to be the new hot thing? | low_tech_punk wrote: | Long live Big Model, I guess? Instead of independent data | warehouses, we are now moving towards a few centralized companies | using supercomputer in physical data centers. The "winner takes | all" effect will only increase as the trend goes on. | morelisp wrote: | To the extent "Big Data" originally and is still often claimed to | mean "data beyond what fits on a single [process/RAM/disk/etc]", | it's always been strange to me how much it's identified with | analytics pipelines doing largely trivial transformations | producing ultra-expensive "BI" pablum. | | Yes, thank goodness that part is dead. But meanwhile - we've | still got more _actual data_ than ever to store, and ever-tighter | deadlines on finding and delivering it. If we can get back to | that and let the PySpark bootcampers fade away, maybe things can | get a little better for once. | | In other words: | | _Even when querying giant tables, you rarely end up needing to | process very much data. Modern analytical databases can do column | projection to read only a subset of fields, and partition pruning | to read only a narrow date range. They can often go even further | with segment elimination to exploit locality in the data via | clustering or automatic micro partitioning. Other tricks like | computing over compressed data, projection, and predicate | pushdown are ways that you can do less IO at query time. And less | IO turns into less computation that needs to be done, which turns | into lower costs and latency._ | | Big data is "dead" because data engineers (the programming ones, | not the analysts-in-all-but-title) spent a ton of effort building | DBs with new techniques that scale better than before, with other | storage patterns than before. Someone still has to write and | maintain those! And it would be even better if those tools and | techniques could escape the half dozen major data cloud companies | and be more directly accessible to the average small team. | Flatcircle wrote: | Seems like just yesterday, every business magazine's cover story | was about "big data." Wonder what the next batch of business buzz | words will be? | fdgsdfogijq wrote: | "For more than a decade now, the fact that people have a hard | time gaining actionable insights from their data has been blamed | on its size." | | The real issue is that business people usually ignore what the | data says. Wading through data takes a huge amount of thought, | which is in short supply. Data Scientists are commonly | disregarded by VPs in large corporations, despite the claims | about being "data driven". Most corporate decision making is | highly political, the needs of/whats best for the business is | just one parameter in a complex equation. | gymbeaux wrote: | Replace "data scientists" with "software engineers" and you | have another accurate insight. They don't want to listen to us | about how to write software or derive value from data. | hn_throwaway_99 wrote: | Agree with some of what you've said, but disagree with a lot: | | > Most corporate decision making is highly political, the needs | of/whats best for the business is just one parameter in a | complex equation. | | 100% Individual humans are emotional creatures with their own | wants and needs, and it's important to understand how | organizational incentives drive decision making. | | > Data Scientists are commonly disregarded by VPs in large | corporations, despite the claims about being "data driven". | | This has not been my experience, though. The more common thing | I've seen is that, sometimes data is boring and doesn't really | show much actionable insight, but as everyone wants to justify | their job, I've seen data scientists come up with really | questionable conclusions that fell apart on further inspection | (call it "p-hacking the enterprise"). | | Plus, a lot of this data in these data wearhouses is _messy_. | Often times data scientists are siloed at the end of the | process, but then you get "garbage in/garbage out" results, | where there is some bug in data tracking that isn't understood | until it's too late. Much better in my opinion to have data | engineers and data scientists work much more closely with | product engineering teams up front so they can help ensure the | data they collect is accurate. | mritchie712 wrote: | Size isn't the real problem, it's time. | | Are you going to take the time / money to set up a warehouse, | get all the data into with an ETL product, set up dbt or some | other transformation layer, set up a BI tool and build the | reports and dashboards, etc. | | Regardless the size of your data, you still need to get it in | one place and model it in a way it's actually usable. | didgetmaster wrote: | Exactly. It isn't just time to set up all the data in a way | that makes the right query possible. It is also having | queries fast enough to be able to run a vast number of them | in order to find what you are looking for (or even things you | were not looking for). | | https://didgets.substack.com/p/data-science-and-serendipity | olyjohn wrote: | Can't we just give it to that one IT guy down in the | basement? | mritchie712 wrote: | Hey, I used to be that guy (and still am). | tracerbulletx wrote: | I find a lot of organizations don't have the discipline to | harness whatever power their data may have. Sure collect | everything, but god forbid you have any sort of data | governance, or spend a single resource minute of time manually | tagging or organizing or validating it. Then they try to make | shitty ML models or products out of it, but don't care if the | models actually work or not, just that they have AI now. Then a | year later when the model has provided no value they are like, | oh well big data is worthless I guess. | hef19898 wrote: | Palantir, you have to have Palantir. | | Oh, and a bunch of data scientists with zero domain knowledge | for whatever data they are analyzing, preferably with PhDs in | maths, but some ML background will do. And agile, because of | course all those Palantir dashboards can only be developed | using agile. | | Once all is said and done, zero insight was created but a | whole lot of consultants, contractors and project managers | have been paid handsomely, while some higher ups can now put | "implemented agile and big data at X" on their LinkedIn | profiles. | snapetom wrote: | Agree on the size not being the issue. I transitioned from a | data engineering manager to a data product manager at a new | company. You know how much data a typical customer generates? | Less than 1TB a year. | | I told my VP that the engineering foul ups in the current | product are easily fixable. Standard tooling and patterns exist | to re-architect and solve the bottlenecks. What is much harder | is a data architect to make sense of the complex data and make | sure there is good value for our customers. | | Guess what position I don't have on the team, and won't have | due to budget issues. | aicharades wrote: | deleted- length | LeanderK wrote: | of course it won't and it's really ironic that you reply with | the next hype | slt2021 wrote: | I used to joke that Data Scientists exist not to uncover | insights or provide analysis, but merely to provide factoids | that confirm senior management's prior beliefs. | | I did several experiments, and noticed that whenever I produced | analysis that was in line with what management expected - my | analysis was praised and widely disseminated. Nobody would even | question data completeness, quality, whatever. They would pick | some flashy metric like a percentage and run around with it. | | Whenever my analysis contradicted - there was so much scrutiny | in numbers, data quality, etc, and even after answering all | questions and concerns - analysis would be tossed away as non- | actionable/useless/etc. | | if you want to succeed as a Data Scientist and be praised by | management - you got to provide data analysis that supports | managements ideas (however wrong or ineffective they might be). | | Data Scientist's job is to launder management's intuition using | quantitative methods :) | zeagle wrote: | Huh. I've not thought of it as laundering, but I think you've | basically summarised consulting in healthcare. Pay to | legitimize and push through a pre-existing idea (eg let's | close down a few ERs) or a delusion (e.g. lean, we don't need | a waiting room) and say it was recommended by consultants to | stakeholders and the public. | saltcured wrote: | Right, the more appropriate analogy is "parallel | construction"... | slt2021 wrote: | all consulting is like that, Partners/MDs at consulting | companies meet with Board/CEOs to get rough idea of what | they want/need, and quickly negotiates a consulting | engagement contract to create PowerPoint with all the | evidence and analysis gathered that supports CEO's initial | idea. | | This is the only reason why a 60+ PowerPoint slide deck can | cost several millions dollars | dylan604 wrote: | how is this really different from any other aspect of life? | Very few people really like to be told counter information, | and it is always easier when providing data that aligns with | the current group think. Doesn't matter if it is business, | politics, or really anything. Being the outlier trying to | change the direction of things is a struggle. | sva_ wrote: | > I used to joke that Data Scientists exist not to uncover | insights or provide analysis, but merely to provide factoids | that confirm senior management's prior beliefs. | | Also someone to blame if it doesn't work out | sidpatil wrote: | > Data Scientist's job is to launder management's intuition | using quantitative methods :) | | https://www.youtube.com/watch?v=kAichhoZrKs | alar44 wrote: | I mean, that makes sense does it not? If you're confirming | something people already had a hunch about, why would they | challenge it? And if it does go against their belief, they | are going to want to make sure the data is correct before | they change the course of the ship. | pjmorris wrote: | > I used to joke that Data Scientists exist not to uncover | insights or provide analysis, but merely to provide factoids | that confirm senior management's prior beliefs. | | So, a synonym for 'consultant?' :) | didgetmaster wrote: | In the news business, if your story or opinion backs up the | preconceived notions of the investigative reporter then you | are a 'source' otherwise you are a 'conspiracy theorist'. | suction wrote: | [dead] | stronglikedan wrote: | A consultant with the _data_ to back up their claims! | happymellon wrote: | My experience with consultants normally ended up with them | asking why they are there and what report should they | present to upper management. | | I've always used them as "independent 3rd parties" who were | listened to. | EricE wrote: | "A prophet is not welcome in his hometown" | | This has been going on for mellennia, unfortunately. | I_complete_me wrote: | Here's my take on this not listening to the "expert": | | A few years ago there was a problem with storm-water | infiltration into my (elderly at the time) mother's | property from her neighbor. I, being a dutiful son and a | civil engineer, investigated it and came up with the | probable cause, the likely effects of non-action and the | most cost-effective solution. I presented it to my mother | in the most layman-like terms that I could. She said | she'd think about it - meaning she'd refer i.e. defer - | to her daughters. In the meantime I had a very layman- | like chat with my mother's carer and told her the | situation in layman's terms. The carer listened and said | that what I said I made total sense to her. Later on, one | of my sisters accosted me and stated that it was | completely obvious what the problem and the solution was | - "even the carer could see it". Human foreheads don't | have the real estate for where my eyebrows wanted to | ascend. | | My advice is to consider whether the message should be | separated from the messenger somehow. | neuronic wrote: | As a consultant with roots in backend dev, I fully | understand the scrutiny that we receive because | unfortunately, it is often very warranted... It feels a | bit refreshing to read your comment and see someone | articulate what I am trying to convey to my clients. I am | a tool, and yes, this pun is intended. | jacobn wrote: | This sounds a lot like how my kids will listen to a | teacher/coach, but not their parents... | rxhernandez wrote: | Which is similar to how a lot of parents won't listen to | their kids but will listen to the coach, teacher, or | priest. | strbean wrote: | If Data Scientists are essentially in-house management | consultants, I wonder which is cheaper? | slt2021 wrote: | This could be a reason why Data Scientist as a job title | exploded in last years, every middle manager could afford | one/two/few headcounts of data scientists to produce | analysis that advances that middle manager's corporate | agenda (more growth, empire building, expansion to | certain de-novo areas, etc). | | Recent tech layoffs is the other side of that growth, | when cheap money is gone and company is forced to stick | to core competencies and shutdown growth plans | htrp wrote: | All symptoms of the same problem..... you can hire McKinsey | to confirm your priors, massage the data to confirm your | priors, or anything in between. | disqard wrote: | An interesting essay that echoes these same sentiments: | | https://ryxcommar.com/2022/11/27/goodbye-data-science/ | monero-xmr wrote: | Also a big reason McKinsey and BCG exists - provide cover for | business plans intended by management to protect them from | shareholder lawsuits. My friend did a sojourn at McKinsey and | 6 months of his life was producing PowerPoints and memos | backing up an expansion to AIPAC region. Was already | happening but he was providing all manner of business | justification for board meetings and whatnot. | adasdasdas wrote: | Oh the experiment didn't go as expected? Rerun 5 more times | with minor tweaks. It definitely not p-hacking ;). | data-ottawa wrote: | I've been there, we wanted to release a feature, it kept | coming back with issues that made it perform much worse | than control, after 5 or so iterations with bug fixes it | came back positive. | | It took a lot of analysis and time to clarify to higher ups | that we weren't just P-hacking , but at least they were | concerned about that. | peatmoss wrote: | > Data Scientist's job is to launder management's intuition | using quantitative methods | | Ouch. This is savage, but sadly correct in many cases. | | HOWEVER, to play devil's advocate here, I've also seen | corporate data scientists overstate the conclusions / | generalizability of their analysis. I've also seen data | scientists fall prey to believing that their analysis proves | would _should_ be done, rather than what is likely to happen. | | The role of an executive or decision maker is to apply a | normative lens to problems. The role of the data scientist / | economist / whatever is to reduce the uncertainty that an | action will have the desired effect. | steveBK123 wrote: | Yes. I worked in the data org of a moderately sized financial | firms tech org. The tech org claimed to be hugely data | driven. Was in the org mottos and all of that. | | Nonetheless, the CTO went on a multi-year, 10s of millions of | dollars, huge data tech stack & staffing reorg shake up... | with really zero data points explaining the driver, or what | we would measure to determine it was successful. | | So it became a self referential decision that we are | successful by doing what he decided, and we are doing it | because he decided it. | throwaway15908 wrote: | Sounds like a good way to weed out middle management ;) | xkcd1963 wrote: | Reminds me of the book "Bullshit Jobs" | karmajunkie wrote: | "Data launderer" would be a good job title... | Phrenzy wrote: | That would imply the data is clean when they are finished. | GIGO | hermitcrab wrote: | Most data is very dirty. | wolf550e wrote: | The data is not laundered. Preconceived ideas and biases | are laundered and given scientific sounding justifications. | e12e wrote: | Concept Confirmer, Bias Booster? | barbecue_sauce wrote: | Context Provider. | e12e wrote: | Affirmation Artificier? | MonkeyClub wrote: | Assessment Assurance, for double the bang. | prometheus76 wrote: | I agree with you. I call Data Scientists "soothsayers for the | Pharoah". | chongli wrote: | Same goes for economists and the politicians who sponsor | them, just as it did for the astrologers and their patron | kings. | out_of_step wrote: | This phenomenon is true to varying degrees in academic | medicine (maybe all of academia) as well - personally have | seen excellent data and methods disregarded when they don't | confirm existing agendas. The choice for the researcher can | become one of burning out trying to do good work and getting | nowhere, or acquiesce and only present data that is | uncontroversial. Huge existential threat to knowledge | advancement. | terry_y_comb wrote: | Indeed, confirmation bias happens to almost everyone. | m463 wrote: | I wonder what a data scientist could really find out about | executive (over?) compensation. employee compensation. | working from home. office cubicle size and layout. tool | expenditure for employees vs productivity. | ProjectArcturis wrote: | How would you measure productivity at scale? | tpoacher wrote: | sadly same with academia and funding sources | koolba wrote: | > if you want to succeed as a Data Scientist and be praised | by management - you got to provide data analysis that | supports managements ideas (however wrong or ineffective they | might be). | | > Data Scientist's job is to launder management's intuition | using quantitative methods :) | | It's no different than the days when grey bearded wisemen | would read the stars and weave a tale about the great glory | that awaits the king if he proceeds with whatever he already | wants to do. | | The beards might be a bit shorter or nonexistent, but the | story hasn't changed. | Guthur wrote: | And the alternative is to use the data as bones, throw it | up in the air and let it tell you what to do? | red-iron-pine wrote: | And you'd better hope the bones actually say something | useful. | | I was the infra lead on a data lake project and got take | part all the way to breaking down the data and turning | into PowerBI reports. The result was "sell more" and to | clients who marketing already identified, years ago, as | whales. | | There were some interesting other insights, esp. w/r/t to | niche products that sold around weird dates (Easter, | Memorial Day, 4th July -- but not obvious gift days like | Valentines or X-Mas), but it led to a lot of "you're | doing it wrong!" recriminations and follow up projects. | bronson wrote: | Absolutely. If you don't like what K-Means is telling | you, change a variable and re-run. (that's one great | thing about business data: there's no shortage of | variables! True, there's usually a shortage of | independent variables, but fixing that is difficult and | underfunded). | SkyBelow wrote: | This isn't just "Data Scientist" but scientist as well. The | more a finding is in contradiction, either with existing | scientific consensus or even with just popular culture, the | more the science is criticized. I've seen unequal criticism | based on how much people wanted the results to be true/false | and even after responding to the criticism I've seen people | just ignore science they don't like. | | The skepticism isn't a problem, the unequal application of | it, the potential to harm careers, and the chilling effect as | people wisen to how best meet their own personal goals is. | timbaboon wrote: | As a data scientist at a large corporate I find this is often | the push... but I resist every time and tell people what they | don't want to hear. Maybe I'm playing this whole corporate | ladder thing wrong :/ | dukeofdoom wrote: | That's like a portrait artist that finds success by painting | people more beautiful than they really are vs a starving one | that paints them true to life due to sense of artistic | integrity. Reminds me how Garth Brooks started doing metal | after becoming a country music star. | eska wrote: | This would match what psychologists say about humans in | general: we feel first, then we use our brain to justify that | feeling. We're not rational beings. | AnIdiotOnTheNet wrote: | We totally are, it's just that rationality is a tool, not a | guide. If you want to work out the truth, rationality will | help you do that, but if instead you want to justify a | decision you already made, well, it'll help you do that | too. | galangalalgol wrote: | Hypothesis don't come from rationality either, they | result from well informed intuition. All of the formality | of science is about tricking ourselves into discovering | our intuition is wrong using a rational series of steps | even when everything in our nature is to use that ability | to reason to do the opposite. | diognesofsinope wrote: | I think the answer is simpler: people care about their | careers and their family first. Think, "If the data says | something that gets in the way of my career well I don't | care about the data." | | Had the same problem when I was an economics researcher -- | publication bias for what stakeholders want to hear (often | the government) is rampant because that's where funding for | the economics department mostly comes from. | trieste92 wrote: | or is it that psychologists _feel_ that we aren 't rational | and use reason to justify this? | | it isn't clear to me how the grounds for realizaing the | theory are reconciliable with its conclusion | moremetadata wrote: | Thats because psychologists dont understand or choose to | ignore how chemistry influences our personalities and | emotions. An extremely simple example from the same | medical/health profession is the use of SSRI's to make | people feel happy. The legal system recognises how | chemicals influence our feelings because of the laws that | exists on illegal drugs or drink driving. | | The definition of rational is being informed enough to know | what said chemicals will do in the short term and long term | in order to make an informed decision, but then I'm | reminded we dont get taught any of the above unless we | specialise at a Uni, so most people cant make any sort of | informed decision. | ArjenM wrote: | A whole industry of emotional branding is thriving, | systematically overloading our brains so it hurt to even | think differently in the moment. | | We are accepting a whole lot of assumptions every day. | remus wrote: | I think this depends a lot on the org. In a place I used to | work we collected and analysed a lot of data which convinced | management to significantly change the spec of the product | and spend a lot more time and effort on testing, because the | product was being used in unexpected ways. | | I would say it was a very engineering driven org however, so | if you could present compelling data it could go a long way. | boh wrote: | The other thing to consider is that data simply has nothing of | value. Part of the marketing of big data is the almost fairy | tale belief in "insights" existing in any data set if you just | look hard enough. | ryguyrg wrote: | Correct, much data has no value. The cost for storing the | data, maintaining the data [in the day-and-age of privacy | requirements especially], and combing through the data is | often much greater than the value obtained from the data | itself. | | The expertise we need in the industry is people who | understand applications in-and-out and make great decisions | on what data is worth keeping for present and future | applications. And what data is needed to be kept, but only in | aggregates (or anonymized, which reduces costs of | maintenance) | revolvingocelot wrote: | It's absolutely this. "Decision-based evidence-making" is what | I've seen it called. | hgsgm wrote: | "We make decisions based on data, so let's use mine." | midasuni wrote: | Take random words that Brent Spiner has said and mash them | into a video. | Consultant32452 wrote: | I don't want you to tell me what the data says, I want to tell | you what the data says and you go find data that confirms it. | | kthnxbye | capableweb wrote: | Is that what's happening at Amazon as well? As they seemingly | is loosing more and more track of the "Customer Obsession" | schtick. | bluedays wrote: | Their mission hasn't changed. They're still obsessed with | customers, just not the way you think. | aintgonnatakeit wrote: | They are encouraging their customers to have a bias toward | action. Away from that asshole Bezos. | fuzzylightbulb wrote: | "customer obsession" was always at the mercy of the real | obsession: "making money hand over fist". The former will | ALWAYS lose out to the latter given enough cycles. | [deleted] | pphysch wrote: | Yeah, "customer obsession" really just means "market share | / growth obsession" which is a means to (eventually) making | monopoly profits. Which Amazon seems to have achieved. | LarryMullins wrote: | All the amazon corporate values are derived from making | profit. Two-pizza teams? More like "three slices isn't | frugal", one or two should be enough for you. | ethbr0 wrote: | One may follow the other, but not vice versus. | | It's a pretty strong argument to say that Microsoft under | Gates was technically obsessed, but that really faltered | under Balmer. | | Microsoft continued to win profits, but they made major | strategic missteps that cost the revenue. | | Amazon feels like it's going down the same path: | empowering the tree-gazers without remembering that the | forest also matters. | [deleted] | ren_engineer wrote: | "data" is usually just twisted to make leadership look good or | justify what decision they wanted to make anyway. Analytics | data is sliced at arbitrary time periods to make growth in | whatever metric look good, certain subsets are just removed, | etc. | | doesn't help that most of this data goes through multiple | layers of BS where each person is putting it through filters to | make themselves look better. And a good chunk of people don't | have enough understanding of stats to understand when they are | being tricked | commandlinefan wrote: | > a huge amount of thought | | Which itself both takes time and is wildly unpredictable - | neither of play well with today's Taylorist managements | schemes. | ralph84 wrote: | Big Data got replaced by Big Parameters. | hgsgm wrote: | Parameters come from data. | therealbilly wrote: | I think server hardware solved the big data issue. The stuff we | have now can blitz through data in the blink of an eye. For | national governments like our own, mainframes still have a place. | For me personally, I don't even talk about big data anymore. | revskill wrote: | Main goal of Big Data as i see is to profile performance and | metrics. Number of user registration, number of converted | users,... | lern_too_spel wrote: | People don't want to deal with having to rearchitect when their | workload does not fit on a single instance. Yes, optimize for the | small data case, but if you build a product that can handle only | the small data case, you have a tough sell. | [deleted] | rvieira wrote: | What about IoT? | blakeburch wrote: | Great post and really resonates with my experience. Good to have | some confirmation that most organizations aren't using their | large swaths of data. | | Although I don't think most organizations are blaming lack of | actionable insights on the data size. It's the lack of | prioritizing data usage over data accessibility. We need to be | teaching data people business levers and teaching business people | data levers. | | Data should be a byproduct of an actionable idea that you want to | execute. It shouldn't exist until you have that experiment in | mind. | bfrog wrote: | This reminds me of a great blog post by Frank McSherry | (Materialize, timely dataflow, etc) talking about how using the | right tools on a laptop could beat out a bunch of these JVM | distributed querying tools because... data locality basically. | | https://github.com/frankmcsherry/blog/blob/master/posts/2015... | cmrdporcupine wrote: | From about 2008/2009/2010 or so on there was perhaps an over- | emphasis on specialized tools for the mass acquisition of streams | of data. Maybe in large part due to the explosion of $$ in ad- | tech. Some people had legitimately insane click/impression | streams -- I worked at a couple companies like that. Development | of DBs based on LSM trees or other write-specialized storage | structures became important. Existing relational databases | weren't particularly well built for this stuff. This was part of, | but not the whole story with the whole NoSQL thing. People were | willing to go completely denormalized in order to gain some | advantage or ability here. It helped that much of the data looked | at was of perhaps little structural complexity. | | In the meantime SSD storage took off, so the IOPS from a stock | drive have skyrocketed, business domains for large data sets have | broadened beyond click/impression streams, and the challenge now | is not "can I store all this data" it's "WTH do I do with it?" | | Regardless of quantity of data, structuring and analysis and | querying of said data remains paramount. The challenge for | anybody working with data is to represent and extract knowledge. | I remain convinced that logic -- first order logic and its | offshoot in the relational model -- remains the best tool for | reasoning about knowledge. Codd's prognostications on data from | the 1970s are still profound. | | I think we're in a space now where we can turn our attention to | knowledge management, not just accumulating streams of | unstructured data. The challenge in a business is to discover and | capture the rules and relationship in data. SQL is an existing | but poor tool for this, based on some of the concepts in the | relational model but tossing them together in a relatively | uncomposable and awkward way (though it remains better than the | dogs breakfast of "NoSQL" alternatives that were tossed together | for a while there.) | | My employer is working in this space, I think they have a really | good product: https://relational.ai/ | cubefox wrote: | This is a bit ironic given that generative AI models like GPT-3 | and Dall-E only work because they were trained on very large | datasets. | hugesniff wrote: | "Very often when a data warehousing customer moves from an | environment where they didn't have separation of storage and | compute into one where they do have it, their storage usage grows | tremendously..." | | Can someone explain why this is the case? Is it due to more | replications or maintaining more indices? | datan3rd wrote: | Detailed web event telemetry is where I have seen the "biggest" | data, not application-generated data. Orders, customers, products | will always be within reasonable limits. Generating 100s of | events (and their associated properties) for every single | page/app view to track impressions, clicks, scrolls, page-quality | measurements can get you to billions of rows and TBs of data | pretty quickly for a moderately popular site. Convincing | technical leaders to delete old, unused data has been difficult; | convincing product owners to instrument fewer events is even | harder. | AaronBBrown wrote: | The truth is that most "big data" problems aren't big and can | often be solved with awk and xargs. | zX41ZdbW wrote: | My presentation from FOSDEM 2023 is very sympathetic to the "Big | data is dead" statement: | https://www.youtube.com/watch?v=JlcI2Vfz_uk | | It is about using modern tools (ClickHouse) for data engineering | without the fluff - when you can take whatever dataset or data | stream and make what you need without the need for complex | infrastructure. | | Nevertheless, the statement "big data is dead" is short-sighted, | and I don't entirely follow this opinion. | | For example, here is one of ClickHouse's use-case: | | > Main cluster is 110PB nvme storage, 100k+ cpu cores, 800TB ram. | The uncompressed data size on the main cluster is 1EB. | | And when you have this sort of data for realtime processing, no | other technology can help you. | fredliu wrote: | The title might be hyperbole (intentionally), but the | observations are more or less in line with what I experienced | through a few the Big Data initiatives over the years under | different enterprise environments (although I have reservation | about the one 1%er comment). To me, Big Data was never about how | "big" the data was, but more about the tools/system/practice | needed to overcome the limitation of the previous generation. | From that perspective, yes, the "monolith" may be having a | "coming back" for now due to the improvement of underlying single | node performance. But I do think Data size will keep growing, | everything needed to make Big Data work would still be there when | the pendulum swings back where a single node can't handle it | anymore. | freedude wrote: | "Among customers who were using the service heavily, the median | data storage size was much less than 100 GB" | | Eye-opening. Especially when combined with a recent quote from | Satya Nadella, "First, as we saw customers accelerate their | digital spend during the pandemic, we're now seeing them optimize | their digital spend to do more with less." | | Conclusion: SaaS is easy to drop off in downturns. Just as easy | as it is to buy initially. | carlineng wrote: | MotherDuck has been making the rounds with a big funding | announcement [1], and a lot of posts like this one. As a life- | long data industry person, I agree with nearly all of what Jordan | and Ryan are saying. It all tracks with my personal experience on | both the customer and vendor side of "Big Data". | | That being said, what's the product? The website says | "Commercializing DuckDB", but that doesn't give much of an idea | of what they're offering. DuckDB is already super easy to use out | of the box, so what's their value-add? It's still a super young | company, so I'm sure all that is being figured out as we speak, | but if any MotherDuckers are on here, I'd love to hear more about | the actual thing that you're building. | | [1]: https://techcrunch.com/2022/11/15/motherduck-secures- | investm... | dangwhy wrote: | > DuckDB is already super easy to use out of the box, so what's | their value-add? | | I think this is analytics equivalent of edge computing. Instead | of one big-cluster cruching numbers. | | 1. User requests bunch of analytics | | 2. Server assembles a duckdb file | | 3. Sends this down to users laptop | | 4. User runs local queries on the duckfile | | 5. Go to step 1 for more analytics | jtigani wrote: | We're being a bit hand-wavy with the offering while we're in | "build" mode, because we don't want to sell vaporware. DuckDB | is easy to use out of the box, but so is Postgres, and there | are plenty of folks building interesting cloud services using | Postgres, from Aurora to Neon. And as many people will point | out, DuckDB is not a data platform on its own. | | For a preview of what we're doing, on the technical side, a | couple of our engineers gave a talk at DuckCon last week in | Brussels, it is on youtube here: | https://www.youtube.com/watch?v=tNNaG7e8_n8 | | (for context I'm the author of this blog post and co-founder of | MotherDuck) | danielmarkbruce wrote: | Deliberately speculating so someone will correct it: I'd guess | they'll make a bunch of enterprise tools to do things like: | enable access and synch the data in a way which complies with | various policy, encrypt/tokenize/hide certain columns etc, | monitor queries, ensure data is encrypted at rest, stuff like | that. | | Assuming the above it true: I'll bet the reason they aren't so | loud about exactly what they are doing is they want to get a | head start on it. In theory anyone can build this stuff around | DuckDB. From a marketing perspective the clever thing to do | would be drive up usage of DuckDB while they build out all this | functionality and then the minute corporates start seeing | problems with their people using it (compliance etc), they have | the solutions. | carlineng wrote: | I'd wager you're right. All the "boring" stuff that's | actually very complicated/difficult, and without which no | large enterprise will adopt a technology. | threeseed wrote: | Especially since enterprise companies hate the idea of | shifting large amounts of highly sensitive company data | onto commonly lost and misplaced work laptops. | | If you're going to do that you better have your security | and governance on point. | vgt wrote: | Shoot me a note and I'm happy to fill you in! | | (PM at MotherDuck) | carlineng wrote: | Would love to! How do I get in touch? My contact info is | in my profile. | vgt wrote: | done! | danielmarkbruce wrote: | Is there a reason you can't post it here? | anon223345 wrote: | Long live big data! | Agingcoder wrote: | I remember the big data craze. People had very little data and | low quality at that so they had a data problem before they had a | big data one! | mejakethomas wrote: | Yes! This!!! | | Volume != Quality | kthejoker2 wrote: | So the argument is you can do everything with an OLAP Database | because we shrunk "Big Data" back inside RAM? | | K, good luck! | spaintech wrote: | Not that big data is dead, more like real time data is coming to | life, but you need the old stuff around to make a buck or two... | Well, that my view. LLMs are transformer model technique are | making data more relevant than ever. If you are a business, well | you are in for a "now real" digital transformation. | | Making data the centerpiece of your business business could mean | that your effectiveness of business process could increase | several order of magnitudes. Funny thing is, you will not use | some else's model, unless you are building a ChatBox to infer, | but you will need to build your own model and be trained in your | own business process to be successful. | | Consider a bank, here is my prediction of expected outcomes: | | Enhanced Customer Experience: The system can act as a virtual | banking assistant, providing customers with instant access to | their account information, real-time transactions, and balance | updates. The system can also answer customer inquiries and | provide relevant information, improving the overall customer | experience. Improved Fraud Detection: The system can monitor the | bank's financial transactions in real-time and identify any | potential fraud, helping the bank reduce its exposure to | financial losses. | | Automated Loan Processing: The system can analyze loan | applications, credit scores, and other relevant data to approve | or reject loan applications in real-time, reducing the time and | effort required for manual loan processing. Personalized | Marketing: The system can analyze customer behavior, transaction | history, and demographic information to provide personalized | marketing and cross-selling opportunities, increasing the bank's | revenue and customer loyalty. | | Real-Time Insights: The system can provide real-time insights | into the bank's financial performance, customer behavior, and | market trends, enabling the bank to make informed decisions and | respond to market changes quickly. | | What is interesting to me is, this is just the beginning of what | could be... | mr_tristan wrote: | Yeah, I've noticed more applications just need to focus on | making sense of raw information really quickly, but usually | don't need an archive to make decisions. | | There are lots of interesting things that can happen with "big | streaming" than necessarily "big data". Like, cybersecurity is | evolving to monitoring and reacting what everyone's machine is | doing in the last 15 minutes, instead of having a huge database | of hashes you trust. But not a ton of things really utilize | what happened, say, 10 years ago on people's machines. | | There's definitely some things that can use massive archives of | old data, but I have found far, far fewer things that would | benefit from it, and often that comes with some very big | maintenance hassles. Most of the time, you can just set data | retention to 30 days and be done. | threeseed wrote: | I assume you've never actually worked at a bank. | | They've been working to implement your ideas for decades and | none of it requires LLMs or any machine learning techniques. | Basic old ETL is more than sufficient. | | The issue is that (a) the calculations they need to perform are | complex and take time to run (b) there are financial | regulations that weave its way through those system and (c) | there is a lot of legacy code especially in the core ledger | system which "just works" and people are reluctant to touch. | | That said depending on your bank you can get real-time account | activity, loan approvals in < 5 minutes etc. | spaintech wrote: | Well, that is an understatement. I do agree with you that | banks have been trying to fix decades old application. | | But in this process, you don't need ETL, nor all the process | and development to accomplish these ideas. Conceptually the | idea builds its self (it learns) how to threat the data, | quite revealing and near real time. Considering you account | for security and privacy, then you basically shift your input | into the data stream and using a natural language get the | data output you need, not clunky apps. | | Imagine I just login, and say: me>how much do I have? | bank>You have 100$ me>Please send 50$ to 1003 bank> Are you | sure? Please add your security code to confirm | | bla bla... | | All this with little intervention. | | Banks spend hundreds of man hours developing a lacking | application while delivering a very poor customer experience. | They spend millions on running decades old applications | because it so expensive to change them... and thus the circle | continues... | | I'm really exited to see DataBases disappear conceptualy, | data entry, mostly all that just disappear... I will ask my | ChapBot for statement, give me a personal investment advice, | and classify all my purchases and see where my wife has been | spending all my money, all from the confort of my phone. | | it's a brave new wold we are wakening up to, that to me is | exciting. And coming from having helped several major banks | build their infrastructure, it's just a boost to talk about | something fresh, no more Hypervisor, core count, db licenses, | ect. Ok, I'll concede it's pretty much the same old, just the | nemonics will be different... How many GPUs, how quickly can | you spin a container, how fast if your S3 datastore... oh | wait, there is that circle again... >:D | threeseed wrote: | So you're not actually talking about back-ends system but | about the front-end. | | In that case, chat-bots have existed for years and | consumers largely don't like them. | | In your scenario you can transfer money in a few clicks | rather than having to write out an entire conversation. ___________________________________________________________________ (page generated 2023-02-07 23:00 UTC)