[HN Gopher] Gradient Free Optimizers: A collection of modern opt...
       ___________________________________________________________________
        
       Gradient Free Optimizers: A collection of modern optimization
       methods in Python
        
       Author : EntICOnc
       Score  : 176 points
       Date   : 2021-02-28 13:10 UTC (9 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | freemint wrote:
       | If you have an expensive but not high dimensional problem you
       | might want to try https://github.com/DLR-SC/sqpdfo . It is based
       | on Krigging, SQP and also handles constraints.
        
       | klon wrote:
       | I am surprised CMA-ES is not included. It is the state of the art
       | gradient free optimization algorithm last I checked.
       | 
       | https://en.wikipedia.org/wiki/CMA-ES
        
         | simonblanke wrote:
         | Evolution strategy is included. The "Covariance matrix
         | adaptation" is for making this algorithm work for continuous
         | search spaces. But gradient-free-optimizers has discrete search
         | space.
        
       | morelandjs wrote:
       | This looks great! You have a lot of algorithms here. Which ones
       | do you find to be the most promising for general use and why?
        
         | simonblanke wrote:
         | I think Random Restart Hill Climbing is good if your objective
         | function evaluates fast (simple functions). It does not get
         | stuck in local optima and also does local search very well.
         | 
         | Bayesian Optimization is very good if your objective function
         | takes some time (> 1 second) to evaluate. If you look at the
         | gifs in the readme of Gradient-Free-Optimizers you will see how
         | fast it finds promising solutions. It is almost scary how good
         | it is (even compared to other smbo).
        
           | morelandjs wrote:
           | Thanks, that's useful. My impression of Bayesian parameter
           | optimization using Gaussian processes is that it is quite
           | good when the data has a more or less constant correlation
           | length across the evaluation points as in your example.
           | 
           | When there are large correlation lengths in some regions of
           | the dataset and small correlation lengths elsewhere, it often
           | over (or under) shoots the curvature of the hypersurface.
        
           | protoplaid wrote:
           | Which algorithm would you recommend when the objective
           | function is noisy (and nondeterministic)?
           | 
           | For example the objective function is the "score" of a
           | particular stochastic simulation, which can be started with
           | varied initial random seed, or the result of a real physical
           | experiment, which is naturally stochastic (and expensive to
           | evaluate).
           | 
           | There is a tradeoff between getting a very accurate
           | estimation of the objective function + variance _of a single
           | point_ vs exploring other points. Is there a search algorithm
           | that somehow manages this tradeoff automatically?
           | 
           | Note: In the past I've used Tree of Parzen Estimators (Kernel
           | density estimators), wasting 3-4 evaluations per point, but I
           | have a feeling it is sub-optimal. Is there an "optimal"
           | algorithm, like the _optimal algorithm_ for the multi-armed
           | bandit problem[1] (which is similar)
           | 
           | [1] https://en.wikipedia.org/wiki/Multi-armed_bandit
        
             | morelandjs wrote:
             | I'm wondering the same. I'd be concerned that Random
             | Restart Hill climbing would essentially chase random noise.
        
               | simonblanke wrote:
               | You could be right. I must confess, that i have a
               | (probably) very narrow understanding of typical
               | optimization problems. Most of the objective functions i
               | optimize have machine learning algorithms in it (to
               | optimizer hyperparameters). Depending on the evaluation
               | those can have low noise.
               | 
               | If you like you could provide other use cases and
               | applications for optimization algorithms.
        
             | plaidfuji wrote:
             | Bayesian Optim is designed for that case specifically. It
             | fits a surrogate model with uncertainty estimates and picks
             | the next point with an understanding of that uncertainty.
             | Look up the MEI acquisition function for more info.
             | 
             | Edit: BO does usually require some tuning for your use
             | case. Its acquisition function sometimes samples replicates
             | where there's high noise, especially if the first sample
             | looks particularly "good". There's usually a hyper
             | parameter that can be set to favor exploration vs
             | exploitation, I.e. to favor non-replicate samples. But I am
             | not aware of an algo that can learn your preference along
             | that axis.
        
               | protoplaid wrote:
               | Correct me if I'm wrong, but it seems the
               | bayesian_optimization.py optimizer in this library
               | assumes that the sampled points are exact, ie their
               | variance is zero. It doesn't seem to re-sample existing
               | points.
               | 
               | This will cause the algorithm to "chase random noise", as
               | morelandjs wrote below
        
               | plaidfuji wrote:
               | I would be very disappointed if that were the case.. no,
               | it looks like it's set up to capture variance. The BO
               | algo wraps an "Expected Improvement Optimizer":
               | 
               | https://github.com/SimonBlanke/Gradient-Free-
               | Optimizers/blob...
               | 
               | Which selects new points based on both the model's mean
               | estimate and its variance. See around line 58
        
               | protoplaid wrote:
               | line 62: exp_imp[sigma == 0.0] = 0.0
               | 
               | I'm afraid it never samples points more than once, since
               | it estimated already-sampled-points as points with
               | variance zero, and no expected improvement.
               | 
               | IMHO that's wrong. Variance of a single _sample_ should
               | be infinite (classical statistics), or similar to the
               | variance of nearby points (bayesian+model), or some pre-
               | defined prior (not a great idea... I 'd prefer some
               | automatic method). But not zero.
        
               | plaidfuji wrote:
               | Ah, good catch. So in the event the gpr predicts zero
               | variance, the optimizer says EI is zero and thus won't
               | sample again. That may depend on the settings of the
               | gpr.. if I'm not mistaken there are ways for gpr to model
               | noise and not collapse to zero variance on every sampled
               | point.
               | 
               | Anyway, I guess I stand by my original suggestion that BO
               | is the best tool for gradient free optim with slow and
               | noisy fevals, but to my knowledge nobody has built a way
               | to dial in the hyper parameters automatically. And there
               | are quite a few. Entire companies exist for this purpose,
               | SigOpt comes to mind..
        
               | morelandjs wrote:
               | I haven't looked at his source, but usually there's a
               | white noise term in the covariance structure of the
               | Gaussian process regressor that adds some statistical
               | uncertainty at each evaluation point. So even when it
               | evaluates a point of parameter space the GPR is still
               | somewhat uncertain about the value of the optimization
               | function at that point. So it should balance exploration
               | versus exploitation taking that statistical uncertainty
               | into account.
        
               | protoplaid wrote:
               | > especially if the first sample looks particularly
               | "good".
               | 
               | You've precisely described the problem: the algorithm
               | will get stuck on a point if the _first_ sample looks
               | good and the assumption of zero variance. Until it
               | randomly hits a luckier sampler (but not necessarily
               | better point).
               | 
               | Another related problem, is that the boundaries of the
               | parameter space have a bad score (objective function),
               | but _very low variance_ (they 're _always_ bad), which
               | confuses the search function into believing that the
               | interior points also have a very low variance, which is
               | incorrect.
               | 
               | If anyone knows of a library that handles those cases
               | correctly, without providing user-defined priors for each
               | dimensions, I'd be glad to hear
        
         | optimalsolver wrote:
         | Not OP, but you can't go wrong with randomized/stochastic hill-
         | climbing.It's actually very competitive with complex, fancy
         | genetic algorithms:
         | 
         | https://core.ac.uk/download/pdf/37767541.pdf
        
       | uyt wrote:
       | For those who want to learn more, I really like the book
       | "Algorithms for Optimization":
       | https://algorithmsbook.com/optimization/#outline
       | 
       | It's a great intro because it covers a large breadth and gives
       | you a sense of what class of techniques are useful for what
       | (rather than just going super deep into one particular
       | technique).
       | 
       | They also have a github in julia notebooks implementing most of
       | the ideas: https://github.com/sisl/algforopt-notebooks
        
       | yaroslavvb wrote:
       | SPSA seems to be missing. OpenAI's "evolutionary strategies"
       | paper used a form of SPSA.
       | https://en.wikipedia.org/wiki/Simultaneous_perturbation_stoc...
        
       | simonblanke wrote:
       | Also: Gradient-Free-Optimizers is basically just the
       | optimization-backend for a much larger project of mine:
       | https://github.com/SimonBlanke/Hyperactive
        
       | amatic wrote:
       | Amazing collection! I'm learning about backprop in neural
       | networks and I'm reading how that kind of optimization may not be
       | biologically 'implementable'. Are non-gradient based optimizers
       | an alternative proposal for biological learning? How do they
       | compare to backprop?
        
       | MauranKilom wrote:
       | I'd recommend also implementing the DIRECT algorithm, which
       | balances local and global search very well (but consequently will
       | not refine as aggressively near the current [possibly only local]
       | optimum as other algorithms):
       | 
       | https://www.researchgate.net/profile/Donald-Jones-5/publicat...
       | 
       | You probably also want to include some advantages/disadvantages
       | of each algorithm. How robust against local minima is it? Up to
       | how many dimensions does it work well? How is the convergence
       | speed when started far from the optimum? Does it work well with
       | few function evaluations? Etc.
        
         | simonblanke wrote:
         | I will look into this algorithm. Thanks for the suggestion. I
         | have some basic explanations of the optimization techniques and
         | their parameters in a separate repository:
         | https://github.com/SimonBlanke/optimization-tutorial
         | 
         | But there is still a lot of work to be done.
        
       | asah wrote:
       | Given the rise of multi-core machines, parallel optimization
       | seems pretty important. Which of these support parallel execution
       | ?
       | 
       | For this reason, I chose Mango for a recent project:
       | https://github.com/ARM-software/mango (bayesian optimizer with
       | built-in support for both parallel optimization (via
       | multiprocessing) and distributed (via celery))
        
         | simonblanke wrote:
         | Gradient-Free-Optimizers is a lightweight optimization package
         | that serves as a backend for Hyperactive:
         | https://github.com/SimonBlanke/Hyperactive
         | 
         | Hyperactive can do parallel computing with multiprocessing or
         | joblib, or a custom wrapper-function. Parallel computing can be
         | done with all its optimization algorithms. You could even pass
         | a gaussian process regressor like GpFlow to the Bayesian
         | Optimizer (GFO and Hyperactive) to get GPU acceleration.
        
           | wtallis wrote:
           | Consider adding some affordance for parallel computing to
           | Gradient-Free-Optimizers by allowing the user to provide a
           | vectorized objective function instead of one that evaluates
           | only a single point in the search space per function call.
           | That leaves all the hard work of parallelization as an
           | exercise for the user, and gives the user the flexibility to
           | parallelize their objective function with whatever mechanism
           | they wish.
           | 
           | I have previously used this approach in a project where the
           | objective function contained a half-hour long simulation,
           | which was the bottleneck that made estimating a gradient
           | intractable. When the optimization algorithm gave a batch of
           | several points in the search space to evaluate, our objective
           | function could prepare and run several instances of the
           | simulation in parallel, and return when the whole batch was
           | complete. From this, it was easy for us to also distribute
           | simulation runs across several machines, without needing any
           | changes to the optimizers. We would not have been able to
           | easily achieve this with an optimization framework that tried
           | to directly manage parallelization, because the steps
           | necessary to prepare the input files for the simulation
           | software had to be done serially.
           | 
           | For that project we tried: DIRECT, several variants of
           | Nelder-Mead, and an evolutionary strategy. In hindsight, the
           | Nelder-Mead variants worked best; once we accumulated enough
           | simulation results it became clear that our objective
           | function was convex and pretty well-behaved in the region of
           | interest. Nelder-Mead was also trivial to extend to trying
           | several extra points per batch to ensure that each of our
           | several workstations had something to work on. (We didn't
           | have access to a large cluster, and Nelder-Mead wouldn't
           | generalize well to a large degree of parallelization in that
           | manner.)
        
             | simonblanke wrote:
             | Your parallel computing approach sounds intriguing! Could
             | you provide an example script? I would like to look into
             | this. If you like you could open an issue as a feature
             | request and provide a code snipped there.
        
       | dijereedan wrote:
       | What about genetic optimization?
        
         | simonblanke wrote:
         | Gradient-Free-Optimizers has a type of genetic algorithm:
         | Evolution Strategy
        
       | stilley2 wrote:
       | Cool! I'd considering adding CMA-ES [1], which as far as I know
       | is the gold standard for derivative free optimization.
       | 
       | 1. https://en.m.wikipedia.org/wiki/CMA-ES
        
         | stilley2 wrote:
         | Ah nevermind I see this was addressed in another comment.
        
       | morelandjs wrote:
       | Getting a lot of these errors when using the GPR. Seems to be an
       | issue with the estimation of the noise level term. Scratching my
       | head on this one. Anyone familiar with what's going on? I tried
       | adding some random noise to the optimization function, and it had
       | the same issue.
       | 
       | ...python3.8/site-
       | packages/sklearn/gaussian_process/kernels.py:402:
       | ConvergenceWarning: The optimal value found for dimension 0 of
       | parameter k2__noise_level is close to the specified lower bound
       | 1e-05. Decreasing the bound and calling fit again may find a
       | better value.
        
         | simonblanke wrote:
         | Yeah this is a warning from sklearns gaussian process
         | regressor. Sklearn probably wants you to increase the
         | "n_restarts_optimizer"-parameter of the gpr, but from my
         | experience this warning does not correlate with bad results
         | from Gradient-Free-Optimizers.
         | 
         | Sklearn warnings are often difficult to silence. Just google
         | "sklearn suppress warnings" to get a code snippet for this
         | problem.
        
           | morelandjs wrote:
           | thanks!
        
       | mik09 wrote:
       | nice repo, i saw some comments expressing excitement and thought
       | of the first time i saw pymoo, tpot, and finally automl. there
       | are a lot of stuff laying around github but we just need to
       | search better, like using topics (tags) and the gazer repo which
       | kinda page-ranks the github repos.
        
         | simonblanke wrote:
         | I had the same excitement for tpot! That project was the reason
         | i started creating mine.
        
       | plaidfuji wrote:
       | Be still my heart... could this be... the missing gradient-free
       | optim library for Python? Particle swarm, evolutionary, AND
       | Bayesian all in a single package? A no-nonsense, simple API with
       | strong enough defaults that you can basically just call .search()
       | with a function (I'm looking at you, nevergrad)??
       | 
       | I'll have to test it of course but this looks truly life
       | changing. A few clarifications, if the author happens to be
       | watching this thread:
       | 
       | - looks like the search space is "enumerated", I.e. I define the
       | grid spacing for each independent variable in search_space and it
       | puts together an outer product? In other words, this will never
       | be truly continuous?
       | 
       | - by the same token, constraints are applied by virtue of
       | specifying the endpoints of each grid in the search_space? Is it
       | possible for me to apply more complex constraints by generating
       | my own enumerated list of search_space points and feeding that
       | directly?
       | 
       | Seriously well done. Like I said above, this looks to me like an
       | immediate staple library.
        
         | optimalsolver wrote:
         | There's also Opytimizer [0] for almost every metaheuristic
         | optimization algorithm under the Sun.
         | 
         | [0] https://github.com/gugarosa/opytimizer
        
         | simonblanke wrote:
         | I am glad you like my project. The search space is not
         | continuous. But you can make the steps as small as you want.
         | Like: np.arange(0, 10, 0.00001)
         | 
         | I am not sure i understand your second question. The entire
         | search space is always a N-dimensional cube. So you just define
         | a start and end point and the step size in each dimension.
        
           | plaidfuji wrote:
           | Nope, that answers the question. I was asking about cases
           | where the space is, for example, a triangle or something non-
           | cuboidal. Think constraints like x1 + x2 < 1 or something
           | like that.
           | 
           | What I'm wondering is if your library converts the
           | search_space to an enumerated list of candidate points
           | somewhere in the backend, and if there's a way for me to just
           | construct that list and pass it directly for more complex
           | cases.
        
           | zmk_ wrote:
           | They probably mean that you can search in n-dimensional
           | cubes, but what they'd like is having a more general shapes
           | of the grid, e.g., have restriction of the sort: a+b<1.
        
             | dwiel wrote:
             | Another option in some cases is to just map the space to
             | something else. Like in this case, x and y are
             | unconstrained and are what the optimizer sees, a = x and b
             | = (1 - a) - abs(y).
             | 
             | Not always easy to do, but can work in some cases if the
             | optimizer cant deal with it natively.
        
             | simonblanke wrote:
             | Okay that is interesting. You could realize that by making
             | a restricted area in the search space by returning np.nan
             | in the objective function for those cases. Gradient-Free-
             | Optimizers can handle np.nan and np.inf just fine.
             | 
             | Maybe you could do something like:
             | 
             | If a+b>=1: return np.nan else: return score
        
               | ialyos wrote:
               | This is at a high level how one of our Google internal
               | black box optimizers behaves. The term used is
               | "infeasible region" but it's the same idea as using a
               | nan.
               | 
               | I would caution against using nan to always mean
               | infeasible. Instead users should catch experiments
               | outside the feasibility region and return a special
               | infeasible value. This will increase visibility into the
               | behavior of the optimizer, because it leaves nan to be
               | used for values inside the region of constraint that are
               | still problematic (due to bugs, numerical instability,
               | etc)
        
               | unnah wrote:
               | If the volume of feasible space is small, the optimizer
               | may have considerable trouble finding the feasible
               | region(s) in the domain. The simple solution is to use a
               | penalty function instead, i.e. add a term like M*max(0,
               | 1-(a+b)) to the objective, with sufficiently large M. If
               | the optimum is at the boundary of the feasible region,
               | you might still get a solution that is slightly
               | infeasible, which can be worked around with more special
               | case logic...
        
               | [deleted]
        
               | abhgh wrote:
               | This would be like the barrier function [1] for gradient
               | based methods; which produces some challenges for those
               | techniques. Do you know how this affects gradient free
               | techniques?
               | 
               | [1] https://en.m.wikipedia.org/wiki/Barrier_function
        
       | aldanor wrote:
       | Looks very nice!
       | 
       | Two questions off top of my head:
       | 
       | - Might be worth mentioning how Hyperactive compares to tons of
       | other similar packages, like Hyperopt, Optunity and a ton of
       | others? Also how your GFO package is different from optima,
       | opytimizer, and many others as well. Having some benchmarks or
       | performance stats (or even just listing functionality
       | differences) would be very helpful to anyone who's been already
       | using those.
       | 
       | - One of the best universal 'works-out-of-the-box-most-of-time'
       | global optimizers I've used recently is DLib's:
       | http://blog.dlib.net/2017/12/a-global-optimization-algorithm... -
       | any chances for implementing something similar?
        
         | simonblanke wrote:
         | Two very interesting questions! I should work soon on a
         | comparison of Hyperactive and GFO to other packages. If some
         | important features are missing, maybe i could add them.
         | 
         | I will also look into Dlib. If you like you can open an
         | official issue in my repository. We could discuss why it is
         | useful and if it should be implemented.
        
       | twic wrote:
       | For linear programming solvers, there is an annual competition to
       | see which is best:
       | 
       | https://www.minizinc.org/challenge.html
       | 
       | Is there anything like this for gradient-free optimisers?
        
         | simonblanke wrote:
         | I thought about a table in the readme that shows some kind of
         | metric for each optimizer that describes its performance. I
         | will look into that.
        
       | twic wrote:
       | I love gradient-free optimisers - there are so many interesting
       | problems where there aren't analytical gradients available.
       | 
       | An I right in thinking this library doesn't contain a simplex /
       | Nelder-Mead optimiser? To me that's the bread and butter of
       | gradient-free optimisation.
       | 
       | Also no Powell algorithms like BOBYQA?
       | 
       | nlopt has these and many more, and a Python wrapper:
       | 
       | https://nlopt.readthedocs.io/en/latest/NLopt_Algorithms/
       | 
       | So, are those omitted because this library is modern and those
       | are now considered out of date?
        
         | bloaf wrote:
         | Came here to ask about the Powell method. I find Powell is
         | typically what engineers come up with if asked to optimize
         | something "by hand" without a mathematical model of the system
         | they are optimizing.
         | 
         | https://en.wikipedia.org/wiki/Powell%27s_method
        
         | simonblanke wrote:
         | I am currently working on the Nelder-Mead optimiser. I will
         | also look into "BOBYQA". I am always searching for interesting
         | algorithms that my users need. If you have more suggestions you
         | can open an issue in the repository.
        
           | sleavey wrote:
           | I've used this, and it works nicely:
           | https://github.com/numericalalgorithmsgroup/pybobyqa. I'd be
           | happy if it were added to your project, then I could just use
           | yours and have access to a bunch of alternatives with the
           | same API.
        
           | twic wrote:
           | That's good to hear. I don't have any other specific requests
           | - for me it seems impossible to know what will work on my
           | problems, so i just try things at random to see what works!
           | 
           | I don't know the architecture of your project, but is there
           | any way it could include NLopt, and so make all of its
           | algorithms available? Some of them could be useful, and i
           | doubt it's a good use of your time to reimplement them all
           | from scratch.
        
       | optimalsolver wrote:
       | I hope to see gradient-free/black box optimization algorithms
       | gain more popularity in machine learning. I'm always surprised
       | when I see a review of ML optimization methods, and it's not even
       | mentioned that you can optimize an ML model without using the
       | gradient.
       | 
       | Also, love or hate Python, there's almost always a library for
       | whatever it is you want to do.
        
         | xyzzyz wrote:
         | > and it's not even mentioned that you can optimize an ML model
         | without using the gradient.
         | 
         | Can you, though? How does the outcomes compare to gradient
         | based methods?
        
           | cgearhart wrote:
           | In general GFO is not a good alternative when there are lots
           | of model parameters or when it's cheap to get the gradient.
           | GFO _could_ be used to train models, but kinda in the same
           | way that you _could_ find your way to Cleveland by just
           | wandering around long enough. It sure would help a lot if you
           | had a compass.
        
       | shankr wrote:
       | Can you use the methods in this library to do constrained
       | optimization? Whenever I look for such gradient free optimizer
       | it's always comes with some constraint.
        
         | bbwwnn wrote:
         | One simple trick is doing a bijective transformation between
         | your constrained input space and the unconstrained optimizer
         | space. For example, if you have x>0 in your input space, you
         | take the input from your optimizer and apply e^x to it. Or if
         | you have a<x<b you can do a+b*logit(x). What exactly you choose
         | will depend on your prior on how your function behaves.
        
       | simonblanke wrote:
       | I just saw, that my project got a lot of new stars on github. A
       | surprise to be sure, but a welcome one.
        
       | batterylow wrote:
       | A book on Practical Evolutionary Algorithms with Python for
       | anyone interested https://datacrayon.com/shop/product/practical-
       | evolutionary-a...
        
       | jordigh wrote:
       | Is the Nelder-Mead a.k.a. the _other_ simplex method any good? I
       | remember most of these other methods from my optimisation
       | lessons, but I 'm surprised to not see Nelder-Mead in this list.
        
         | simonblanke wrote:
         | I am currently working on the Nelder-Mead algorithm. I did not
         | realize that it is that popular. This gives me motivation to
         | implement it soon ;-)
         | 
         | If there are more "must have"-algorithms you could open an
         | issue in Gradient-Free-Optimizers.
        
           | jordigh wrote:
           | I like it because it was so easy to implement and it worked
           | well, unless your objective function had very narrow and
           | steep valleys.
           | 
           | I don't know if it's popular, heheh, it's been a while since
           | I've been in maths school, so that's why I'm asking.
        
       ___________________________________________________________________
       (page generated 2021-02-28 23:00 UTC)