[HN Gopher] Show HN: Igel - A CLI tool to run machine learning w... ___________________________________________________________________ Show HN: Igel - A CLI tool to run machine learning without writing code Author : nidhaloff Score : 286 points Date : 2020-10-03 12:23 UTC (10 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | djhaskin987 wrote: | So a lot like weka then. https://www.cs.waikato.ac.nz/ml/weka/ | iamflimflam1 wrote: | Keep going with this, I think you are onto something. | mxscho wrote: | > A machine learning tool that allows you to train/fit, test and | use models without writing code | | I recently had a discussion about the requirements that a text | file format (like YAML) has to fulfill to be considered "code". | :) | nidhaloff wrote: | Hi, and what was the result/conclusion of the discussion? I'm | interested in your finding, is it considered code or not :D | mxscho wrote: | Well, we came to the conclusion that there is no hard border | and therefore good answer to that question. | | But we also agreed that it's not the most important factor | whether it's code, a graphical user interface or a command | line interface to make a tool usable for a lay person. What's | more important is that the entry point is easy, and that the | complexity and flexibility is abstracted away in layers that | do not have to be fully understood from the beginning, so | that the learning curve is not too steep. | | Of course, my first post was not meant to be criticism of the | project, just some pseudo philosophical thoughts that crossed | my mind when reading that sentence. Sorry for being too off | topic with that. :) | jeroenjanssens wrote: | Reminds me of SKLL: | https://github.com/EducationalTestingService/skll | marcinzm wrote: | I feel like the places where a non-technical user would be | building non-trivial models also have the money to pay for one of | those commercial GUI drag boxes around tools. | ericpts wrote: | Usually the hardest part of a learning pipeline is data gather | and cleaning; once it is in a suitable format (such that it is | easy to create a structured CSV file), the training part is | probably the easiest part: just a few lines of Python code. | kthejoker2 wrote: | https://images.app.goo.gl/ZrvQDrMtKxbnMo2C9 | devaler wrote: | And, arguably, data cleaning is the most overlooked part. | crehn wrote: | From a purely UX perspective, there's a huge difference between | "no lines of code" and "a few lines of code". | TheRealPomax wrote: | All parts of a learning pipeline are hard if you want to do it | right. Gathering, weeding, and binning your data is meticulous | and hard work, and while "a single run" is trivial, _rerunning_ | it over and over with new parameters or even a completely | different model because the outcome made no sense whatsoever is | not. | | If updating a YAML file and hitting "run" makes that other | "hardest part of learning" easier: hurray! | nidhaloff wrote: | I agree. That's why some usually used pre-processing methods | were implemented in the stable release.. and more is yet to | come | howmayiannoyyou wrote: | Terrific! | | Keep pursuing this and ignore critics. What you're doing is | important b/c ML is just out of reach of a big percentage of | developers and technical lay people. It will take time to get | your approach right, but it will make a difference. | | As a suggestion - provide more real-world examples (eg. business, | sports, etc) so that users can tinker with your samples as | pathway toward learning. | | Please don't give up on this. Great job. | nidhaloff wrote: | Hi thanks a lot. I received positive interactions on github | from the community, however, your comment is the first | encouraging feedback I ve got here :D so, I appreciate it. | | I will take your suggestion into consideration. You are right, | there should be more real-world examples that will help users | get started and see how this can be useful. | | The thing is, I started the project two weeks ago, so it still | relatively new. I ve been coding day n night because the idea | got me excited. I published the first stable release this week. | However, there are new features that will be implemented in the | next releases. | khimaros wrote: | In case it isn't on your radar, there is also | https://github.com/uber/ludwig which seems to have similar | goals. | nidhaloff wrote: | Someone posted this tool earlier in the comments too. I was | surprised since I never heard of it and find it great! | | However, I think it is only for building deep learning | models and does not have any general ML support or am I | missing something? If yes then that fact makes it very | different from igel as a tool | nurettin wrote: | What you're doing is creating a declarative syntax for | applying machine learning tasks directly to data. This makes | it learnable by machines, effectively teaching them how to do | their own machine learning experiments. I think this project | is greater than the sum of its parts. | musingsole wrote: | If the project is only 2 weeks old, all the more reason to | ignore any critics. Particularly here where people are likely | to criticize a baby in the crib for not working on coding | projects outside of naptime. | reagent_finder wrote: | Well, I mean that baby doesn't have a functional colon yet | so putting semicolons everywhere just makes perfect sense. | crehn wrote: | Totally agree. I sense ML in general can benefit tremendously | from buttery-smooth UX, something it has typically lagged | behind on. | | Keep it up Nidhal, you're doing a tremendous service. Don't let | the snobs get to you. | gavinray wrote: | Agreed here. I am not involved in the ML space but have briefly | toyed with PyTorch/TF/Sklearn. | | I see the value in having a CSV data dump and going "I wonder | what happens if I run it through X." then a CLI command to find | out. | | Would be neat if there was an adapter for SQLite too IMO. | rmbeard wrote: | Combining it with bash and psql + csvkit + xsv will give you | a powerful combination for data ingestion, wrangling and | training all on the command line this would seem to have | clear benefits for fast development and prototyping. | master_yoda_1 wrote: | This is going too far, looks like nobody understand machine | learning at hacker rank | henvic wrote: | Well, it's probably almost the truth. | lioeters wrote: | Off topic, but I recently found myself using the phrase | "pretty much exactly". I realized it's nonsense, because the | first part contradicts the meaning of "exact". I vowed to | never use that phrase again. | | I feel the same about "probably almost the truth" (not a | criticism, just a thought) - unless truth is a range (100% | true to 100% false) rather than a binary (either true or | false). | codetrotter wrote: | Truth is a range though, for the vast majority of things. | jkmcf wrote: | This is the giant's shoulders I like to stand on! | | Need to see more projects abstracting away the hard stuff (I'm | looking at you, GUI libraries!) | nidhaloff wrote: | Thanks for your feedback. Stay tuned, we are working on an | integrated gui tool written in python too. | foolfoolz wrote: | i think long term this is the future of ML. it's like a database. | every engineer needs to know when to use one and how. not every | engineer needs to be able to write a database | toxik wrote: | Coincidentally, poor understanding of your tools (especially | databases) seems like a huge source of frustration and pain for | anyone involved in software development. | foolfoolz wrote: | you can have a poor understanding of your db and still build | a billion dollar company | toxik wrote: | You _can_ win the lottery too, that statement asserts | exactly nothing. | IdiocyInAction wrote: | This is already how most ML in production works though. Noone | writes their own NN, optimizer or even linear regression and | for good reasons. | mcint wrote: | Thank you for sharing! | | I was thinking about starting something like this, and had | reached out to datasette [1] / Simon Willison for advice on | starting and maintaining a project. | | [1]: https://simonwillison.net/2017/Nov/13/datasette/ | st1x7 wrote: | "Automate everything" is a disingenuous claim. You're simply | replacing a couple of lines of scikit-learn with a couple of | lines of your CLI tool. There is pretty much no benefit to using | this. | nidhaloff wrote: | Well. The user is writing a description in a human readable | format. Then, the tool will take that description and start | running the pipeline. From data reading, preprocessing until | creating and evaluating the model. If this isn't automation, | please define automation for me. | | Also there are new features that I'm working on. The stable | release was done this week. | hobofan wrote: | From what I can tell it's a declarative framework (while most | other common ones are imperative). And as generally with the | tradeoffs of a declarative approach, if the data/model is | easy the baked in assumption require less input, while if | it's more complex your config files will be equally verbose | and/or you will run into a wall. I don't see much automation | there, just abstraction. | nidhaloff wrote: | Interesting opinion. Well I must disagree in some points. | First, yes sure the tool uses declarative paradigm, which | is the goal of the project. If you want to use ML without | writing code, then you will certainly not want an | imperative framework. | | Second, I must disagree that most other common frameworks | are imperative. I would say it's a mix of declarative & | imperative but certainly not imperative. | | Finally, it's interesting how you see this as a just | abstraction tool. I find other ML frameworks are more about | abstraction since you are focusing on building your model | but all details are hidden from you using the framework. | Sure, igel is also about abstraction but to say it's JUST | abstraction? mmm I find it not quite right, instead it's | more about automating the stuff that you would write | yourself using other frameworks. | | At the end of the day, we all have different opinions and | feedback is important ;) | TheRealPomax wrote: | There's most definitely a benefit - turning "writing the code | yourself" into "updating a config file" is huge: I can now | write code _that creates config files_ , which is stupidly | easy, instead of having to write code that writes code, which | is stupidly hard. | | The title is a complete misnomer, but the project itself is | perfectly useful. As a programmer, any programming I _don't_ | have to do is time and money saved. | tpetry wrote: | To be really automatic the only thing i should need to do is feed | an csv, correct the suggested data types and then run all | algorithms on the data with the information at the end which has | been the most effective and which i should further optimize. | fractionalhare wrote: | Sure, but shotgunning every statistical test and machine | learning algorithm completely undermines your results, because | the power and significance levels are not adjusted for many | experiments. In the statistical setting this leads to spurious | correlations, and in the machine learning setting it leads to | overfitting. In either case the results have a high risk of not | generalizing beyond the initial sample being analyzed. | | I'm not saying you're endorsing this, but it's basically | antithetical to sound experimental design. I don't think the | author should pursue automatic anything when it comes to | statistics, unless it's just a thin quality-of-life wrapper | around other statistical primitives and libraries. | tpetry wrote: | There are for example multiple decision tree learners or rule | learners. Everyone has different semantics and works | differently on the data. If i just can run every one and see | which one performs the best is a completely normal approach. | | And with k-fold cross validation its very hard to have | overfitting. | streetcat1 wrote: | So over-fitting can be solved by : | | 1) Using cross validation/validation set. 2) Regularisation . | 3) Finding statistically significant features (e.g. chi | square). | | Why cant this be done automatically? I.e. what is the human | advantage ? | fractionalhare wrote: | It's not that you intrinsically need a human, it's that | doing this without human oversight requires being very | careful not to make tricky mistakes. | | The nature of statistical significance (which underpins | everything you've said), is that repeating many experiments | reduces the confidence you should have in your results. | Supposing each algorithm is an experiment and each | experiment is independent, if you target a significance | level of p = 0.05, you can expect to find 1 correlated | feature out of every 20 you test just by chance. | | Can you automatically correct for this? Sure. But this is | just one possible footgun. Are you confident you're | avoiding them all? In theory automation could do an even | better job than a human of avoiding the myriad statistical | mistakes you could make, but in practice that requires | significant upfront effort and expertise during the | development process. | | At a certain point doing this automatically becomes | analogous to rolling your own crypto. It's not quite an | adversarial problem, but it's quite easy to screw up. | | I agree that cross validating would work; that's what I was | gesturing to when I was talking about making an assessment | of the data and partitioning it. Either the provided sample | should be partitioned for cross validation, or it should | prompt the user for a second set. | streetcat1 wrote: | Correct. My point is that both humans and machines face | the same issues. At least with a machine you get a | consistent errors (that does not cost you time), which | you can decrease with time. | | With humans you must make sure that the same human with | the same skill set who knows stats at Master Level, will | always be there for your specific data and actually have | the time to do the experiments. | | Also, I think that 95% of the users/consumers of machine | learning are non consumers - I.e. they do not have ANY | access to any machine learning tech, and thus need to | revert to guessing. | | So the ethical thing to do is actually give them some | tool even if it might not be optimal. | nidhaloff wrote: | this is a great feature! Thanks for the feedback | fractionalhare wrote: | No, it's emphatically _not_ a great feature, and it 's not | clear to me the commenter was recommending that so much as | making a nit. Please don't automate the process of choosing | and running algorithms on a single sample of data, it's | unsound experimental design that undermines your results. If | you insist on doing it anyway, at minimum you will need to | automate an initial assessment of the sample data to | determine if it has a suitable size and distribution to allow | you to adjust the significance of results for the number of | tests you're running, and partition the data into smaller | subsamples. | nidhaloff wrote: | Hi, thanks for your comment. I actually understood that he | meant something like a hyperparameter search/tuning using | cross validation (at least that what came in my mind). | fractionalhare wrote: | Cross validation would be good! I think if you build this | in you could automatically run a few heuristics to see if | the data can be partitioned, or maybe just prompt the | user for another sample of the data with the same | distribution. | tpetry wrote: | Parameter tuning and algorithm selection! I just don't | want to manually start 5 different runs of algorithms i | believe which could work good on the data and manually | compare the results. And maybe i was too lazy to run the | 6th algorithm which now performs much better. | | But to be sure, every test should be done with k-fold | cross validation. The decision whether to split the | training set should not be chosen by the user. It's | crucial that this is a must! | mjgs wrote: | Great idea - I've been waiting for a good cli tool for machine | learning, saves the hassle to have to learn python and also can | use with other existing shell tools. | mk_chan wrote: | I find it difficult to believe anyone who can use the models | listed on the repository effectively would have any difficulty | using scikit themselves. | | Abstracting scikit out into a configuration file only very | slightly simplifies the actual code involved but I can see this | being useful for some non technical users who don't care about | the code and just know the ML terms. | nidhaloff wrote: | It's not about that someone will have difficulty using sklearn. | It's more about how clean the approach is if you have all your | configs in a yaml file and you can change things very | easily/quickly and rerun an experiment. I'm working with data & | ML models everyday and it became overwhelming when my codebase | is large and I want to change small things and re-run an | experiment. Also It would be great to not lose much time | writing that code in the first place (although it's easy to | do), if you want a quick and dirty draft. The thing is, it is | much cleaner if you have your preprocessing methods and model | definition in one file. However, there are other features that | will be integrated soon, like a simple gui built in python | iamflimflam1 wrote: | This is a good point, something that I've been struggling | with in my own personal projects is keeping track of | parameters as I tweak and play with hyper-parameters and | model structures. | | A few parameters are fine, you can pull them out into | constants, but you quickly end up with a lot of variables to | keep track of. | fakedang wrote: | I remember how I first got interested in ML and DL. I did not | know the a lick of programming ML in Python or whatever language | was out there. I simply began by using Matlab's Neural Network | and Machine Learning toolboxes and playing around on them. That | turned into real coding interest on Matlab, which carried on to | Python, so on and so forth. In a sense, I rediscovered | programming because of those toolboxes. | | What you're doing is great stuff and I hope it encourages a lot | of folks to play around just as I had, just to get started. | alphachloride wrote: | I think it can be a useful tool for automation of very | standardized ML tasks. However: | | It's a command line tool that is also intended for non-technical | folks. I sense a contradiction. | | That doesn't even speak to the requirement of understanding all | these ML algorithms so I can specify them in the config file, or | understanding YAML format, or data curation. At this point it | would be easier to write the python code - especially scikit- | learn which is a very well-documented library. | nidhaloff wrote: | Hi, I want to clear up some points. First, it is not intended | for non technical folks, this was never claimed! However, even | if it was, we are currently working on a gui, where (non | technical)users can run it by writing a simple cmd in the | terminal. | | Second, I'm a technical user, in fact this is my daily work and | we build this tool for reasons that were mentioned in the | docs/readme, so you can check it out. | | Third, you mentioned understanding YAML Format. Really? I mean | yaml is the most understandable format any person can | understand. I can never imagine that a person cannot learn yaml | in 30 min at most. | | Finally, yes sklearn is great and well documented but did you | checked how many libraries are out there that represent | basically a wrapper to make it easier/abstracter to write | sklearn code? you ll be surprised. | | As discussed in the official repo & docs, it is a much cleaner | approach to gather your preprocessing & model definition | parameters/configs in one human readable file/place, where you | can manipulate it easily. Re-run experiments, generate drafts, | building proof of concepts as fast as possible, than to write | code. At the end of the day, we all have different opinions, | you can still write code of course. The tools are there to | help. | alphachloride wrote: | I am only going off on the README, as the other user pointed | out, which addresses technical _and_ non-technical people. | | So yes, this tool can have great utility. It adds an | abstraction layer and removes busywork for repetitive | programming tasks. However, the utility will be for users | acquainted with command line. Users who know what a config | file is, or data types, lists, and key-value relationships | assumed by the YAML spec. Users will also have to know the | different algorithms so they can populate the config. All of | these things require technical knowledge. | | All of the above things are what us technical users take for | granted, so a claim to cater to non-technical users must be | evaluated from their perspective. | | I am not belittling your work - this is a good project, but | currently targeting an audience too broad. | Peritract wrote: | The README says "The goal of the project is to provide | machine learning for everyone, both technical and non | technical users"; that definitely sounds as though it's | intended for non-technical users. | nidhaloff wrote: | Well, " __both __technical and non technical users " right? | asimjalis wrote: | What does "IGEL" stand for? I couldn't find it in the | documentation. | nidhaloff wrote: | It's a german word and means Hedgehog. | | It's funny we were discussing a name for the project and we | wanted to make an abbreviation from some words that make sense, | so we started throwing ideas spontaneously. At the end we | wanted to make an abbr for these words: "Init, Generate, | Evaluate Machine Learning". | | IGEL made sense for us then since it's a german word too. Easy | to say, type and remember ;) | phil294 wrote: | Not sure if there is any deeper meaning hidden behind it, but | it is the German word for Hedgehog (pronounce as: "Eagle") | asimjalis wrote: | Interesting. That makes sense since the logo is a hedgehog. | What's the connection though I wonder. | hashmush wrote: | Interesting, igel is Swedish for leech (Egel in German). The | Swedish word for hedgehog is instead igelkott. In short, Egel | = igel and Igel != igel... TIL | JshWright wrote: | I'm more familiar with a different "i-gel" (which does a similar | thing for emergency airway management as this does for ML, | allowing less trained users to still achieve "advanced" results) | | https://www.intersurgical.com/info/igel | eyeball wrote: | https://pycaret.org/ | zatel wrote: | This is so cool! | | I know the answer is to just write what I'm describing myself but | does anyone know of an existing way to find the best SciKitLearn | algorithm for a particular problem. Like if I want to find the | regression fit is there a way to just pass in the data and have | it trained,tested on all of the regression algorithms in SKLearn? | My current workflow is to just pick a handful of algorithms that | sound like they should be good for the problem at hand and try | each one of them manually. Igel seems like a step towards making | this sort of thing possible if another tool doesn't exist | already. | nidhaloff wrote: | Hi, we should be careful with the feature you are talking | about. The results from all machine learning algorithm can be | very misleading and probably some models will overfit the data. | | So, if you throw some data and fit all machine learning models | on it and then compare the performance. You will probably | receive misleading values since different models require | different tuning approaches. It's not as easy as you said it, | you can't just feed data (also depends on the data) to models | and expect to get the best model at the output. | | One approach I can think of here is to integrate cross | validation and hyperparameter tuning with your suggestion. | However, I can imagine that this can be computationally | expensive. I will take it into consideration as an enhancement | for the tool. Thanks for your feedback | craftinator wrote: | Hey, I really appreciate your answer to this question. As I | was reading the question, red flags started popping up in my | mind about the risk of overfitting when using the ensemble | approach, and I think your response was spot on for how an ML | researcher would go about it! Most ML professionals I've | talked to have been really against making a user friendly ML | suite because of how easy it is to misuse these algorithms. | dint wrote: | Triage is built for this: training and evaluating giant grids | of models & hyperparameters using cross-validation. Similar to | igel, it abstracts ML to config files and a CLI. | | It's designed for use in a public policy context, so it works | best with: - binary classification problems (ex: the evaluation | module is designed for binary classification metrics) - | problems that have a temporal component (the cross validation | system makes some assumptions about this) | | https://dssg.github.io/triage/ | rabscuttler wrote: | I think you're looking for something like AutoML by H2O[0]. | There are few similar offerings out there if you search around | 'automl'. | | [0] https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html | [deleted] | sthatipamala wrote: | Awesome! How do you compare this to | https://github.com/uber/ludwig, which also has a YAML-based cli | for ML? | nidhaloff wrote: | Wow! this is great! I didn't know that such a tool exists, | thanks for posting it here. The python & AI community are | moving really fast, it's crazy! | | However, It looks like the ludwig tool is about deep learning | and not ML, or am I wrong? It looks like there is no support | for ML models or am I missing something | | I didn't try it yet, I just read the get started section but | looks really great for training deep neural networks. | kamhh94 wrote: | "non-technical" and "cli tool" sound like an oxymoron. But if you | hide the yaml config behind a UI i guess it can pass for "non | technical". | nidhaloff wrote: | already working on a gui that users can launch using a command. | You can check the issues list | TheRealPomax wrote: | Make it (double) clicking an icon on a desktop/app list, and | you have a winner. The moment a terminal is needed, you've | lost the non-technical crowd (and some of the technical | crowd, even) | desilinguist wrote: | Great idea! We had a similar idea back in 2014 with SKLL[1]. We | are still actively maintaining it and it's definitely been | helpful to many folks outside our organization over the years! | Wishing you the best! | | [1] https://github.com/EducationalTestingService/skll | joshspankit wrote: | Who else thought this was something that would turn AI loose on | your bash commands, and automate everything _in_ your CLI? | _frkl wrote: | If you are disappointed, there's mcfly, which does things with | your shell history and ML: https://github.com/cantino/mcfly :-) | tommica wrote: | This is really cool! ___________________________________________________________________ (page generated 2020-10-03 23:00 UTC)