[HN Gopher] Qsv: Efficient CSV CLI Toolkit ___________________________________________________________________ Qsv: Efficient CSV CLI Toolkit Author : s1291 Score : 56 points Date : 2023-12-22 12:50 UTC (1 days ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | foehrenwald wrote: | related: https://github.com/johnkerl/miller | | I am wondering who really uses these tools and for what since | there are R and python data science tools available? | snidane wrote: | Out of core computations. While your python and R script will | choke after reading few hundred megs, my compiled binary cli | will keep streaming through many such files with memory usage | sitting somewhere near zero. | mbreese wrote: | That's just the effect of streaming IO vs reading in the file | into memory all at once. That has nothing to do with the | language you use, but how you process the data. | | I keep multiple little Python scripts around to do things | like sum lists of numbers (think extracting a column with | awk, then calculating a sum). Compiled vs an interpreted | script really doesn't matter. What matters is using the right | algorithm for the job. R and Python data science libraries | like to read in all of the data at once into one single data | structure. That's the anti-pattern to avoid if at all | possible. | | (But they are very handy for small datasets of complex | calculations that require the entire dataset in memory. ) | hermitcrab wrote: | Also: https://github.com/BurntSushi/xsv | https://csvkit.readthedocs.io/en/latest/ | dima55 wrote: | For simple analyses (i.e. what most people do most of the time) | doing this on the commandline gets you there faster. I use | vnlog (https://github.com/dkogan/vnlog/). By the time you fired | up your editor to write your Python code, I already have | analyses and plots ready. | fbdab103 wrote: | I write Python every day, but still use miller here and there. | If I am doing a "simple" operation (eye of the beholder), being | able to pipe it on the command line is great. | | To do a comparable amount of manipulation in Python takes a lot | more boilerplate (imports, command line arguments, diety-can- | we-default-to-Int64 already?, etc), plus you have to ensure you | have a virtual environment with correct dependencies. Which is | more or less standard numpy+pandas, but a single executable | tool to do some data workup is always appreciated. | | I am never performance constrained, but I have been told that | miller is one of the slower tools in this space, but I still | reach for it do to its wide format support. | dima55 wrote: | An incomplete list of other similar tools: | https://github.com/dkogan/vnlog/#description | alchemist1e9 wrote: | Here is a related but more obscure tool that can be | surprisingly useful. | | http://hopper.si.edu/wiki/mmti/Starbase | | Their tbl format is so trivially close to standard csv that I | just convert on the fly back and forth with tiny helper perl | scripts. | alchemist1e9 wrote: | Wow! This looks a really complete set of operations and extremely | useful. | snidane wrote: | This looks great! | | Please consider removing any implicit network calls like the | initial "Checking GitHub for updates...". This itself will | prevent people from adoption or even trying it any further. This | is similar to gnu parallel's --citation, which, albeit a small | thing - will scare many people off. | | Consider adding pivot and unpivot operations. Mlr gets it quite | right with syntax, but is unusable since it doesn't work in | streaming mode and tries to load everything into memory, despite | claiming otherwise. | | Consider adding basic summing command. Sum is the most common | data operation, which could warrant its own special optimized | command, instead offloading this to external math processor like | lua or python. Even better if this had a group by (-by) and | window by (-over) capability. Eg. 'qsv sum col1,col2 -by | col3,col4'. Brimdata's zq utility is the only one I know that | does this quite right, but is quite clunky to use. | | Consider adding a laminate command. Essentially adding a new | column with a constant. This probably could be achieved by a join | with a file with a single row, but why not make this common | operation easier to use. | | Consider the option to concatenate csv files with mismatched | headers. cat rows or cat columns complains about the mismatch. | One of the most common problems with handling csvs is schema | evolution. I and many others would appreciate if we could merge | similar csvs together easily. | | Conversions to and from other standard formats would be | appreciated (parquet, ion, fixed width lenghts, avro, etc.). Othe | compression formats as well - especially zstd. | | It would be nice if the tool enabled embedding outputs of | external commands easily. Lua and python builtin support is nice, | but probably not sufficient. i'd like to be able to run a jq | command on a single column and merge it back as another for | example. | | Inspiration: - csvquote: | https://news.ycombinator.com/item?id=31351393 - teip: | https://github.com/greymd/teip | quasarj wrote: | Wait, who is scared off by parallel's --citation? | fbdab103 wrote: | I refuse to use parallel due to that obnoxiousness. | | At minimum, it is not installed by default, so it is already | a negative to just using xargs. That it then puts that | barrier in my way makes it an easy tool to skip. | quasarj wrote: | I just don't understand what barrier you are talking about. | I just checked, it doesn't even whine at you when you use | it, the help just notes that you should cite it if you | publish a paper where you used it. And... anyone publishing | papers knows about citation requirements lol. Anyone else | can ignore it. What is this barrier? | dima55 wrote: | In addition to being annoying, it raises questions about | whether it is free software or not. Some people care a | whole lot about that. And some people have higher | standards about being nagged. And lots and lots of time | was spent discussing solutions, for instance: | https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=915541 | quasarj wrote: | Ah, I see, they have changed it (or possibly the version | on my system has had the --will-cite patched out, as | discussed in this bug). | | Okay, I accept your argument about Free Software. | However, I find it interesting that it's a GNU project... | they are generally the most hardline Free Software | people. | fbdab103 wrote: | To slippery slope this, what happens if more tools start | adopting this behavior? Curl now asks you to buy Daniel | Stenberg a coffee on each use. Wget asks you to support | Ukraine. Caddy wants you to invest in their startup. Each | of which may come with their own `--ignore-annoyance- | flag` I need to learn. The best I can do is vote with my | feet. | | I also do not care for the citation requirement. I | utilize tons of tools in my work which go unstated. I do | not feel the need to cite Linux, DNS, htop, Make, Diet | Coke, my Kinesis keyboard, etc. Sadly, reliable plumbing | gets no respect. Especially for a tool which is more or | less interchangeable with some shell scripting. Unless I | am trying to shore up the references list, I am going to | cite directly relevant work. | | At some point, you no longer need to note that your work | was powered by electricity. | jasonjayr wrote: | Vim has solicited donations for Uganda since forever. | dima55 wrote: | You can get quite far by piping to other tools and/or using | DSLs. pivoting can almost certainly be done by the luau support | in qsv (or `vnl-filter`, for instance). Summing and grouping is | something that `datamash` does well (or qsv luau probably, or | `vnl-filter --eval`). Adding a column once again can be done | with luau or `vnl-filter`. | | Would you be more likely to use this tool if it had even more | stuff in it requiring reading even more documentation? That's a | genuine question. ___________________________________________________________________ (page generated 2023-12-23 23:00 UTC)