[HN Gopher] Show HN: Igel - A CLI tool to run machine learning w...
       ___________________________________________________________________
        
       Show HN: Igel - A CLI tool to run machine learning without writing
       code
        
       Author : nidhaloff
       Score  : 286 points
       Date   : 2020-10-03 12:23 UTC (10 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | djhaskin987 wrote:
       | So a lot like weka then. https://www.cs.waikato.ac.nz/ml/weka/
        
       | iamflimflam1 wrote:
       | Keep going with this, I think you are onto something.
        
       | mxscho wrote:
       | > A machine learning tool that allows you to train/fit, test and
       | use models without writing code
       | 
       | I recently had a discussion about the requirements that a text
       | file format (like YAML) has to fulfill to be considered "code".
       | :)
        
         | nidhaloff wrote:
         | Hi, and what was the result/conclusion of the discussion? I'm
         | interested in your finding, is it considered code or not :D
        
           | mxscho wrote:
           | Well, we came to the conclusion that there is no hard border
           | and therefore good answer to that question.
           | 
           | But we also agreed that it's not the most important factor
           | whether it's code, a graphical user interface or a command
           | line interface to make a tool usable for a lay person. What's
           | more important is that the entry point is easy, and that the
           | complexity and flexibility is abstracted away in layers that
           | do not have to be fully understood from the beginning, so
           | that the learning curve is not too steep.
           | 
           | Of course, my first post was not meant to be criticism of the
           | project, just some pseudo philosophical thoughts that crossed
           | my mind when reading that sentence. Sorry for being too off
           | topic with that. :)
        
       | jeroenjanssens wrote:
       | Reminds me of SKLL:
       | https://github.com/EducationalTestingService/skll
        
       | marcinzm wrote:
       | I feel like the places where a non-technical user would be
       | building non-trivial models also have the money to pay for one of
       | those commercial GUI drag boxes around tools.
        
       | ericpts wrote:
       | Usually the hardest part of a learning pipeline is data gather
       | and cleaning; once it is in a suitable format (such that it is
       | easy to create a structured CSV file), the training part is
       | probably the easiest part: just a few lines of Python code.
        
         | kthejoker2 wrote:
         | https://images.app.goo.gl/ZrvQDrMtKxbnMo2C9
        
         | devaler wrote:
         | And, arguably, data cleaning is the most overlooked part.
        
         | crehn wrote:
         | From a purely UX perspective, there's a huge difference between
         | "no lines of code" and "a few lines of code".
        
         | TheRealPomax wrote:
         | All parts of a learning pipeline are hard if you want to do it
         | right. Gathering, weeding, and binning your data is meticulous
         | and hard work, and while "a single run" is trivial, _rerunning_
         | it over and over with new parameters or even a completely
         | different model because the outcome made no sense whatsoever is
         | not.
         | 
         | If updating a YAML file and hitting "run" makes that other
         | "hardest part of learning" easier: hurray!
        
         | nidhaloff wrote:
         | I agree. That's why some usually used pre-processing methods
         | were implemented in the stable release.. and more is yet to
         | come
        
       | howmayiannoyyou wrote:
       | Terrific!
       | 
       | Keep pursuing this and ignore critics. What you're doing is
       | important b/c ML is just out of reach of a big percentage of
       | developers and technical lay people. It will take time to get
       | your approach right, but it will make a difference.
       | 
       | As a suggestion - provide more real-world examples (eg. business,
       | sports, etc) so that users can tinker with your samples as
       | pathway toward learning.
       | 
       | Please don't give up on this. Great job.
        
         | nidhaloff wrote:
         | Hi thanks a lot. I received positive interactions on github
         | from the community, however, your comment is the first
         | encouraging feedback I ve got here :D so, I appreciate it.
         | 
         | I will take your suggestion into consideration. You are right,
         | there should be more real-world examples that will help users
         | get started and see how this can be useful.
         | 
         | The thing is, I started the project two weeks ago, so it still
         | relatively new. I ve been coding day n night because the idea
         | got me excited. I published the first stable release this week.
         | However, there are new features that will be implemented in the
         | next releases.
        
           | khimaros wrote:
           | In case it isn't on your radar, there is also
           | https://github.com/uber/ludwig which seems to have similar
           | goals.
        
             | nidhaloff wrote:
             | Someone posted this tool earlier in the comments too. I was
             | surprised since I never heard of it and find it great!
             | 
             | However, I think it is only for building deep learning
             | models and does not have any general ML support or am I
             | missing something? If yes then that fact makes it very
             | different from igel as a tool
        
           | nurettin wrote:
           | What you're doing is creating a declarative syntax for
           | applying machine learning tasks directly to data. This makes
           | it learnable by machines, effectively teaching them how to do
           | their own machine learning experiments. I think this project
           | is greater than the sum of its parts.
        
           | musingsole wrote:
           | If the project is only 2 weeks old, all the more reason to
           | ignore any critics. Particularly here where people are likely
           | to criticize a baby in the crib for not working on coding
           | projects outside of naptime.
        
             | reagent_finder wrote:
             | Well, I mean that baby doesn't have a functional colon yet
             | so putting semicolons everywhere just makes perfect sense.
        
         | crehn wrote:
         | Totally agree. I sense ML in general can benefit tremendously
         | from buttery-smooth UX, something it has typically lagged
         | behind on.
         | 
         | Keep it up Nidhal, you're doing a tremendous service. Don't let
         | the snobs get to you.
        
         | gavinray wrote:
         | Agreed here. I am not involved in the ML space but have briefly
         | toyed with PyTorch/TF/Sklearn.
         | 
         | I see the value in having a CSV data dump and going "I wonder
         | what happens if I run it through X." then a CLI command to find
         | out.
         | 
         | Would be neat if there was an adapter for SQLite too IMO.
        
           | rmbeard wrote:
           | Combining it with bash and psql + csvkit + xsv will give you
           | a powerful combination for data ingestion, wrangling and
           | training all on the command line this would seem to have
           | clear benefits for fast development and prototyping.
        
       | master_yoda_1 wrote:
       | This is going too far, looks like nobody understand machine
       | learning at hacker rank
        
         | henvic wrote:
         | Well, it's probably almost the truth.
        
           | lioeters wrote:
           | Off topic, but I recently found myself using the phrase
           | "pretty much exactly". I realized it's nonsense, because the
           | first part contradicts the meaning of "exact". I vowed to
           | never use that phrase again.
           | 
           | I feel the same about "probably almost the truth" (not a
           | criticism, just a thought) - unless truth is a range (100%
           | true to 100% false) rather than a binary (either true or
           | false).
        
             | codetrotter wrote:
             | Truth is a range though, for the vast majority of things.
        
       | jkmcf wrote:
       | This is the giant's shoulders I like to stand on!
       | 
       | Need to see more projects abstracting away the hard stuff (I'm
       | looking at you, GUI libraries!)
        
         | nidhaloff wrote:
         | Thanks for your feedback. Stay tuned, we are working on an
         | integrated gui tool written in python too.
        
       | foolfoolz wrote:
       | i think long term this is the future of ML. it's like a database.
       | every engineer needs to know when to use one and how. not every
       | engineer needs to be able to write a database
        
         | toxik wrote:
         | Coincidentally, poor understanding of your tools (especially
         | databases) seems like a huge source of frustration and pain for
         | anyone involved in software development.
        
           | foolfoolz wrote:
           | you can have a poor understanding of your db and still build
           | a billion dollar company
        
             | toxik wrote:
             | You _can_ win the lottery too, that statement asserts
             | exactly nothing.
        
         | IdiocyInAction wrote:
         | This is already how most ML in production works though. Noone
         | writes their own NN, optimizer or even linear regression and
         | for good reasons.
        
       | mcint wrote:
       | Thank you for sharing!
       | 
       | I was thinking about starting something like this, and had
       | reached out to datasette [1] / Simon Willison for advice on
       | starting and maintaining a project.
       | 
       | [1]: https://simonwillison.net/2017/Nov/13/datasette/
        
       | st1x7 wrote:
       | "Automate everything" is a disingenuous claim. You're simply
       | replacing a couple of lines of scikit-learn with a couple of
       | lines of your CLI tool. There is pretty much no benefit to using
       | this.
        
         | nidhaloff wrote:
         | Well. The user is writing a description in a human readable
         | format. Then, the tool will take that description and start
         | running the pipeline. From data reading, preprocessing until
         | creating and evaluating the model. If this isn't automation,
         | please define automation for me.
         | 
         | Also there are new features that I'm working on. The stable
         | release was done this week.
        
           | hobofan wrote:
           | From what I can tell it's a declarative framework (while most
           | other common ones are imperative). And as generally with the
           | tradeoffs of a declarative approach, if the data/model is
           | easy the baked in assumption require less input, while if
           | it's more complex your config files will be equally verbose
           | and/or you will run into a wall. I don't see much automation
           | there, just abstraction.
        
             | nidhaloff wrote:
             | Interesting opinion. Well I must disagree in some points.
             | First, yes sure the tool uses declarative paradigm, which
             | is the goal of the project. If you want to use ML without
             | writing code, then you will certainly not want an
             | imperative framework.
             | 
             | Second, I must disagree that most other common frameworks
             | are imperative. I would say it's a mix of declarative &
             | imperative but certainly not imperative.
             | 
             | Finally, it's interesting how you see this as a just
             | abstraction tool. I find other ML frameworks are more about
             | abstraction since you are focusing on building your model
             | but all details are hidden from you using the framework.
             | Sure, igel is also about abstraction but to say it's JUST
             | abstraction? mmm I find it not quite right, instead it's
             | more about automating the stuff that you would write
             | yourself using other frameworks.
             | 
             | At the end of the day, we all have different opinions and
             | feedback is important ;)
        
         | TheRealPomax wrote:
         | There's most definitely a benefit - turning "writing the code
         | yourself" into "updating a config file" is huge: I can now
         | write code _that creates config files_ , which is stupidly
         | easy, instead of having to write code that writes code, which
         | is stupidly hard.
         | 
         | The title is a complete misnomer, but the project itself is
         | perfectly useful. As a programmer, any programming I _don't_
         | have to do is time and money saved.
        
       | tpetry wrote:
       | To be really automatic the only thing i should need to do is feed
       | an csv, correct the suggested data types and then run all
       | algorithms on the data with the information at the end which has
       | been the most effective and which i should further optimize.
        
         | fractionalhare wrote:
         | Sure, but shotgunning every statistical test and machine
         | learning algorithm completely undermines your results, because
         | the power and significance levels are not adjusted for many
         | experiments. In the statistical setting this leads to spurious
         | correlations, and in the machine learning setting it leads to
         | overfitting. In either case the results have a high risk of not
         | generalizing beyond the initial sample being analyzed.
         | 
         | I'm not saying you're endorsing this, but it's basically
         | antithetical to sound experimental design. I don't think the
         | author should pursue automatic anything when it comes to
         | statistics, unless it's just a thin quality-of-life wrapper
         | around other statistical primitives and libraries.
        
           | tpetry wrote:
           | There are for example multiple decision tree learners or rule
           | learners. Everyone has different semantics and works
           | differently on the data. If i just can run every one and see
           | which one performs the best is a completely normal approach.
           | 
           | And with k-fold cross validation its very hard to have
           | overfitting.
        
           | streetcat1 wrote:
           | So over-fitting can be solved by :
           | 
           | 1) Using cross validation/validation set. 2) Regularisation .
           | 3) Finding statistically significant features (e.g. chi
           | square).
           | 
           | Why cant this be done automatically? I.e. what is the human
           | advantage ?
        
             | fractionalhare wrote:
             | It's not that you intrinsically need a human, it's that
             | doing this without human oversight requires being very
             | careful not to make tricky mistakes.
             | 
             | The nature of statistical significance (which underpins
             | everything you've said), is that repeating many experiments
             | reduces the confidence you should have in your results.
             | Supposing each algorithm is an experiment and each
             | experiment is independent, if you target a significance
             | level of p = 0.05, you can expect to find 1 correlated
             | feature out of every 20 you test just by chance.
             | 
             | Can you automatically correct for this? Sure. But this is
             | just one possible footgun. Are you confident you're
             | avoiding them all? In theory automation could do an even
             | better job than a human of avoiding the myriad statistical
             | mistakes you could make, but in practice that requires
             | significant upfront effort and expertise during the
             | development process.
             | 
             | At a certain point doing this automatically becomes
             | analogous to rolling your own crypto. It's not quite an
             | adversarial problem, but it's quite easy to screw up.
             | 
             | I agree that cross validating would work; that's what I was
             | gesturing to when I was talking about making an assessment
             | of the data and partitioning it. Either the provided sample
             | should be partitioned for cross validation, or it should
             | prompt the user for a second set.
        
               | streetcat1 wrote:
               | Correct. My point is that both humans and machines face
               | the same issues. At least with a machine you get a
               | consistent errors (that does not cost you time), which
               | you can decrease with time.
               | 
               | With humans you must make sure that the same human with
               | the same skill set who knows stats at Master Level, will
               | always be there for your specific data and actually have
               | the time to do the experiments.
               | 
               | Also, I think that 95% of the users/consumers of machine
               | learning are non consumers - I.e. they do not have ANY
               | access to any machine learning tech, and thus need to
               | revert to guessing.
               | 
               | So the ethical thing to do is actually give them some
               | tool even if it might not be optimal.
        
         | nidhaloff wrote:
         | this is a great feature! Thanks for the feedback
        
           | fractionalhare wrote:
           | No, it's emphatically _not_ a great feature, and it 's not
           | clear to me the commenter was recommending that so much as
           | making a nit. Please don't automate the process of choosing
           | and running algorithms on a single sample of data, it's
           | unsound experimental design that undermines your results. If
           | you insist on doing it anyway, at minimum you will need to
           | automate an initial assessment of the sample data to
           | determine if it has a suitable size and distribution to allow
           | you to adjust the significance of results for the number of
           | tests you're running, and partition the data into smaller
           | subsamples.
        
             | nidhaloff wrote:
             | Hi, thanks for your comment. I actually understood that he
             | meant something like a hyperparameter search/tuning using
             | cross validation (at least that what came in my mind).
        
               | fractionalhare wrote:
               | Cross validation would be good! I think if you build this
               | in you could automatically run a few heuristics to see if
               | the data can be partitioned, or maybe just prompt the
               | user for another sample of the data with the same
               | distribution.
        
               | tpetry wrote:
               | Parameter tuning and algorithm selection! I just don't
               | want to manually start 5 different runs of algorithms i
               | believe which could work good on the data and manually
               | compare the results. And maybe i was too lazy to run the
               | 6th algorithm which now performs much better.
               | 
               | But to be sure, every test should be done with k-fold
               | cross validation. The decision whether to split the
               | training set should not be chosen by the user. It's
               | crucial that this is a must!
        
       | mjgs wrote:
       | Great idea - I've been waiting for a good cli tool for machine
       | learning, saves the hassle to have to learn python and also can
       | use with other existing shell tools.
        
       | mk_chan wrote:
       | I find it difficult to believe anyone who can use the models
       | listed on the repository effectively would have any difficulty
       | using scikit themselves.
       | 
       | Abstracting scikit out into a configuration file only very
       | slightly simplifies the actual code involved but I can see this
       | being useful for some non technical users who don't care about
       | the code and just know the ML terms.
        
         | nidhaloff wrote:
         | It's not about that someone will have difficulty using sklearn.
         | It's more about how clean the approach is if you have all your
         | configs in a yaml file and you can change things very
         | easily/quickly and rerun an experiment. I'm working with data &
         | ML models everyday and it became overwhelming when my codebase
         | is large and I want to change small things and re-run an
         | experiment. Also It would be great to not lose much time
         | writing that code in the first place (although it's easy to
         | do), if you want a quick and dirty draft. The thing is, it is
         | much cleaner if you have your preprocessing methods and model
         | definition in one file. However, there are other features that
         | will be integrated soon, like a simple gui built in python
        
           | iamflimflam1 wrote:
           | This is a good point, something that I've been struggling
           | with in my own personal projects is keeping track of
           | parameters as I tweak and play with hyper-parameters and
           | model structures.
           | 
           | A few parameters are fine, you can pull them out into
           | constants, but you quickly end up with a lot of variables to
           | keep track of.
        
       | fakedang wrote:
       | I remember how I first got interested in ML and DL. I did not
       | know the a lick of programming ML in Python or whatever language
       | was out there. I simply began by using Matlab's Neural Network
       | and Machine Learning toolboxes and playing around on them. That
       | turned into real coding interest on Matlab, which carried on to
       | Python, so on and so forth. In a sense, I rediscovered
       | programming because of those toolboxes.
       | 
       | What you're doing is great stuff and I hope it encourages a lot
       | of folks to play around just as I had, just to get started.
        
       | alphachloride wrote:
       | I think it can be a useful tool for automation of very
       | standardized ML tasks. However:
       | 
       | It's a command line tool that is also intended for non-technical
       | folks. I sense a contradiction.
       | 
       | That doesn't even speak to the requirement of understanding all
       | these ML algorithms so I can specify them in the config file, or
       | understanding YAML format, or data curation. At this point it
       | would be easier to write the python code - especially scikit-
       | learn which is a very well-documented library.
        
         | nidhaloff wrote:
         | Hi, I want to clear up some points. First, it is not intended
         | for non technical folks, this was never claimed! However, even
         | if it was, we are currently working on a gui, where (non
         | technical)users can run it by writing a simple cmd in the
         | terminal.
         | 
         | Second, I'm a technical user, in fact this is my daily work and
         | we build this tool for reasons that were mentioned in the
         | docs/readme, so you can check it out.
         | 
         | Third, you mentioned understanding YAML Format. Really? I mean
         | yaml is the most understandable format any person can
         | understand. I can never imagine that a person cannot learn yaml
         | in 30 min at most.
         | 
         | Finally, yes sklearn is great and well documented but did you
         | checked how many libraries are out there that represent
         | basically a wrapper to make it easier/abstracter to write
         | sklearn code? you ll be surprised.
         | 
         | As discussed in the official repo & docs, it is a much cleaner
         | approach to gather your preprocessing & model definition
         | parameters/configs in one human readable file/place, where you
         | can manipulate it easily. Re-run experiments, generate drafts,
         | building proof of concepts as fast as possible, than to write
         | code. At the end of the day, we all have different opinions,
         | you can still write code of course. The tools are there to
         | help.
        
           | alphachloride wrote:
           | I am only going off on the README, as the other user pointed
           | out, which addresses technical _and_ non-technical people.
           | 
           | So yes, this tool can have great utility. It adds an
           | abstraction layer and removes busywork for repetitive
           | programming tasks. However, the utility will be for users
           | acquainted with command line. Users who know what a config
           | file is, or data types, lists, and key-value relationships
           | assumed by the YAML spec. Users will also have to know the
           | different algorithms so they can populate the config. All of
           | these things require technical knowledge.
           | 
           | All of the above things are what us technical users take for
           | granted, so a claim to cater to non-technical users must be
           | evaluated from their perspective.
           | 
           | I am not belittling your work - this is a good project, but
           | currently targeting an audience too broad.
        
           | Peritract wrote:
           | The README says "The goal of the project is to provide
           | machine learning for everyone, both technical and non
           | technical users"; that definitely sounds as though it's
           | intended for non-technical users.
        
             | nidhaloff wrote:
             | Well, " __both __technical and non technical users " right?
        
       | asimjalis wrote:
       | What does "IGEL" stand for? I couldn't find it in the
       | documentation.
        
         | nidhaloff wrote:
         | It's a german word and means Hedgehog.
         | 
         | It's funny we were discussing a name for the project and we
         | wanted to make an abbreviation from some words that make sense,
         | so we started throwing ideas spontaneously. At the end we
         | wanted to make an abbr for these words: "Init, Generate,
         | Evaluate Machine Learning".
         | 
         | IGEL made sense for us then since it's a german word too. Easy
         | to say, type and remember ;)
        
         | phil294 wrote:
         | Not sure if there is any deeper meaning hidden behind it, but
         | it is the German word for Hedgehog (pronounce as: "Eagle")
        
           | asimjalis wrote:
           | Interesting. That makes sense since the logo is a hedgehog.
           | What's the connection though I wonder.
        
           | hashmush wrote:
           | Interesting, igel is Swedish for leech (Egel in German). The
           | Swedish word for hedgehog is instead igelkott. In short, Egel
           | = igel and Igel != igel... TIL
        
       | JshWright wrote:
       | I'm more familiar with a different "i-gel" (which does a similar
       | thing for emergency airway management as this does for ML,
       | allowing less trained users to still achieve "advanced" results)
       | 
       | https://www.intersurgical.com/info/igel
        
       | eyeball wrote:
       | https://pycaret.org/
        
       | zatel wrote:
       | This is so cool!
       | 
       | I know the answer is to just write what I'm describing myself but
       | does anyone know of an existing way to find the best SciKitLearn
       | algorithm for a particular problem. Like if I want to find the
       | regression fit is there a way to just pass in the data and have
       | it trained,tested on all of the regression algorithms in SKLearn?
       | My current workflow is to just pick a handful of algorithms that
       | sound like they should be good for the problem at hand and try
       | each one of them manually. Igel seems like a step towards making
       | this sort of thing possible if another tool doesn't exist
       | already.
        
         | nidhaloff wrote:
         | Hi, we should be careful with the feature you are talking
         | about. The results from all machine learning algorithm can be
         | very misleading and probably some models will overfit the data.
         | 
         | So, if you throw some data and fit all machine learning models
         | on it and then compare the performance. You will probably
         | receive misleading values since different models require
         | different tuning approaches. It's not as easy as you said it,
         | you can't just feed data (also depends on the data) to models
         | and expect to get the best model at the output.
         | 
         | One approach I can think of here is to integrate cross
         | validation and hyperparameter tuning with your suggestion.
         | However, I can imagine that this can be computationally
         | expensive. I will take it into consideration as an enhancement
         | for the tool. Thanks for your feedback
        
           | craftinator wrote:
           | Hey, I really appreciate your answer to this question. As I
           | was reading the question, red flags started popping up in my
           | mind about the risk of overfitting when using the ensemble
           | approach, and I think your response was spot on for how an ML
           | researcher would go about it! Most ML professionals I've
           | talked to have been really against making a user friendly ML
           | suite because of how easy it is to misuse these algorithms.
        
         | dint wrote:
         | Triage is built for this: training and evaluating giant grids
         | of models & hyperparameters using cross-validation. Similar to
         | igel, it abstracts ML to config files and a CLI.
         | 
         | It's designed for use in a public policy context, so it works
         | best with: - binary classification problems (ex: the evaluation
         | module is designed for binary classification metrics) -
         | problems that have a temporal component (the cross validation
         | system makes some assumptions about this)
         | 
         | https://dssg.github.io/triage/
        
         | rabscuttler wrote:
         | I think you're looking for something like AutoML by H2O[0].
         | There are few similar offerings out there if you search around
         | 'automl'.
         | 
         | [0] https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html
        
       | [deleted]
        
       | sthatipamala wrote:
       | Awesome! How do you compare this to
       | https://github.com/uber/ludwig, which also has a YAML-based cli
       | for ML?
        
         | nidhaloff wrote:
         | Wow! this is great! I didn't know that such a tool exists,
         | thanks for posting it here. The python & AI community are
         | moving really fast, it's crazy!
         | 
         | However, It looks like the ludwig tool is about deep learning
         | and not ML, or am I wrong? It looks like there is no support
         | for ML models or am I missing something
         | 
         | I didn't try it yet, I just read the get started section but
         | looks really great for training deep neural networks.
        
       | kamhh94 wrote:
       | "non-technical" and "cli tool" sound like an oxymoron. But if you
       | hide the yaml config behind a UI i guess it can pass for "non
       | technical".
        
         | nidhaloff wrote:
         | already working on a gui that users can launch using a command.
         | You can check the issues list
        
           | TheRealPomax wrote:
           | Make it (double) clicking an icon on a desktop/app list, and
           | you have a winner. The moment a terminal is needed, you've
           | lost the non-technical crowd (and some of the technical
           | crowd, even)
        
       | desilinguist wrote:
       | Great idea! We had a similar idea back in 2014 with SKLL[1]. We
       | are still actively maintaining it and it's definitely been
       | helpful to many folks outside our organization over the years!
       | Wishing you the best!
       | 
       | [1] https://github.com/EducationalTestingService/skll
        
       | joshspankit wrote:
       | Who else thought this was something that would turn AI loose on
       | your bash commands, and automate everything _in_ your CLI?
        
         | _frkl wrote:
         | If you are disappointed, there's mcfly, which does things with
         | your shell history and ML: https://github.com/cantino/mcfly :-)
        
       | tommica wrote:
       | This is really cool!
        
       ___________________________________________________________________
       (page generated 2020-10-03 23:00 UTC)