[HN Gopher] Gradient Free Optimizers: A collection of modern opt... ___________________________________________________________________ Gradient Free Optimizers: A collection of modern optimization methods in Python Author : EntICOnc Score : 176 points Date : 2021-02-28 13:10 UTC (9 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | freemint wrote: | If you have an expensive but not high dimensional problem you | might want to try https://github.com/DLR-SC/sqpdfo . It is based | on Krigging, SQP and also handles constraints. | klon wrote: | I am surprised CMA-ES is not included. It is the state of the art | gradient free optimization algorithm last I checked. | | https://en.wikipedia.org/wiki/CMA-ES | simonblanke wrote: | Evolution strategy is included. The "Covariance matrix | adaptation" is for making this algorithm work for continuous | search spaces. But gradient-free-optimizers has discrete search | space. | morelandjs wrote: | This looks great! You have a lot of algorithms here. Which ones | do you find to be the most promising for general use and why? | simonblanke wrote: | I think Random Restart Hill Climbing is good if your objective | function evaluates fast (simple functions). It does not get | stuck in local optima and also does local search very well. | | Bayesian Optimization is very good if your objective function | takes some time (> 1 second) to evaluate. If you look at the | gifs in the readme of Gradient-Free-Optimizers you will see how | fast it finds promising solutions. It is almost scary how good | it is (even compared to other smbo). | morelandjs wrote: | Thanks, that's useful. My impression of Bayesian parameter | optimization using Gaussian processes is that it is quite | good when the data has a more or less constant correlation | length across the evaluation points as in your example. | | When there are large correlation lengths in some regions of | the dataset and small correlation lengths elsewhere, it often | over (or under) shoots the curvature of the hypersurface. | protoplaid wrote: | Which algorithm would you recommend when the objective | function is noisy (and nondeterministic)? | | For example the objective function is the "score" of a | particular stochastic simulation, which can be started with | varied initial random seed, or the result of a real physical | experiment, which is naturally stochastic (and expensive to | evaluate). | | There is a tradeoff between getting a very accurate | estimation of the objective function + variance _of a single | point_ vs exploring other points. Is there a search algorithm | that somehow manages this tradeoff automatically? | | Note: In the past I've used Tree of Parzen Estimators (Kernel | density estimators), wasting 3-4 evaluations per point, but I | have a feeling it is sub-optimal. Is there an "optimal" | algorithm, like the _optimal algorithm_ for the multi-armed | bandit problem[1] (which is similar) | | [1] https://en.wikipedia.org/wiki/Multi-armed_bandit | morelandjs wrote: | I'm wondering the same. I'd be concerned that Random | Restart Hill climbing would essentially chase random noise. | simonblanke wrote: | You could be right. I must confess, that i have a | (probably) very narrow understanding of typical | optimization problems. Most of the objective functions i | optimize have machine learning algorithms in it (to | optimizer hyperparameters). Depending on the evaluation | those can have low noise. | | If you like you could provide other use cases and | applications for optimization algorithms. | plaidfuji wrote: | Bayesian Optim is designed for that case specifically. It | fits a surrogate model with uncertainty estimates and picks | the next point with an understanding of that uncertainty. | Look up the MEI acquisition function for more info. | | Edit: BO does usually require some tuning for your use | case. Its acquisition function sometimes samples replicates | where there's high noise, especially if the first sample | looks particularly "good". There's usually a hyper | parameter that can be set to favor exploration vs | exploitation, I.e. to favor non-replicate samples. But I am | not aware of an algo that can learn your preference along | that axis. | protoplaid wrote: | Correct me if I'm wrong, but it seems the | bayesian_optimization.py optimizer in this library | assumes that the sampled points are exact, ie their | variance is zero. It doesn't seem to re-sample existing | points. | | This will cause the algorithm to "chase random noise", as | morelandjs wrote below | plaidfuji wrote: | I would be very disappointed if that were the case.. no, | it looks like it's set up to capture variance. The BO | algo wraps an "Expected Improvement Optimizer": | | https://github.com/SimonBlanke/Gradient-Free- | Optimizers/blob... | | Which selects new points based on both the model's mean | estimate and its variance. See around line 58 | protoplaid wrote: | line 62: exp_imp[sigma == 0.0] = 0.0 | | I'm afraid it never samples points more than once, since | it estimated already-sampled-points as points with | variance zero, and no expected improvement. | | IMHO that's wrong. Variance of a single _sample_ should | be infinite (classical statistics), or similar to the | variance of nearby points (bayesian+model), or some pre- | defined prior (not a great idea... I 'd prefer some | automatic method). But not zero. | plaidfuji wrote: | Ah, good catch. So in the event the gpr predicts zero | variance, the optimizer says EI is zero and thus won't | sample again. That may depend on the settings of the | gpr.. if I'm not mistaken there are ways for gpr to model | noise and not collapse to zero variance on every sampled | point. | | Anyway, I guess I stand by my original suggestion that BO | is the best tool for gradient free optim with slow and | noisy fevals, but to my knowledge nobody has built a way | to dial in the hyper parameters automatically. And there | are quite a few. Entire companies exist for this purpose, | SigOpt comes to mind.. | morelandjs wrote: | I haven't looked at his source, but usually there's a | white noise term in the covariance structure of the | Gaussian process regressor that adds some statistical | uncertainty at each evaluation point. So even when it | evaluates a point of parameter space the GPR is still | somewhat uncertain about the value of the optimization | function at that point. So it should balance exploration | versus exploitation taking that statistical uncertainty | into account. | protoplaid wrote: | > especially if the first sample looks particularly | "good". | | You've precisely described the problem: the algorithm | will get stuck on a point if the _first_ sample looks | good and the assumption of zero variance. Until it | randomly hits a luckier sampler (but not necessarily | better point). | | Another related problem, is that the boundaries of the | parameter space have a bad score (objective function), | but _very low variance_ (they 're _always_ bad), which | confuses the search function into believing that the | interior points also have a very low variance, which is | incorrect. | | If anyone knows of a library that handles those cases | correctly, without providing user-defined priors for each | dimensions, I'd be glad to hear | optimalsolver wrote: | Not OP, but you can't go wrong with randomized/stochastic hill- | climbing.It's actually very competitive with complex, fancy | genetic algorithms: | | https://core.ac.uk/download/pdf/37767541.pdf | uyt wrote: | For those who want to learn more, I really like the book | "Algorithms for Optimization": | https://algorithmsbook.com/optimization/#outline | | It's a great intro because it covers a large breadth and gives | you a sense of what class of techniques are useful for what | (rather than just going super deep into one particular | technique). | | They also have a github in julia notebooks implementing most of | the ideas: https://github.com/sisl/algforopt-notebooks | yaroslavvb wrote: | SPSA seems to be missing. OpenAI's "evolutionary strategies" | paper used a form of SPSA. | https://en.wikipedia.org/wiki/Simultaneous_perturbation_stoc... | simonblanke wrote: | Also: Gradient-Free-Optimizers is basically just the | optimization-backend for a much larger project of mine: | https://github.com/SimonBlanke/Hyperactive | amatic wrote: | Amazing collection! I'm learning about backprop in neural | networks and I'm reading how that kind of optimization may not be | biologically 'implementable'. Are non-gradient based optimizers | an alternative proposal for biological learning? How do they | compare to backprop? | MauranKilom wrote: | I'd recommend also implementing the DIRECT algorithm, which | balances local and global search very well (but consequently will | not refine as aggressively near the current [possibly only local] | optimum as other algorithms): | | https://www.researchgate.net/profile/Donald-Jones-5/publicat... | | You probably also want to include some advantages/disadvantages | of each algorithm. How robust against local minima is it? Up to | how many dimensions does it work well? How is the convergence | speed when started far from the optimum? Does it work well with | few function evaluations? Etc. | simonblanke wrote: | I will look into this algorithm. Thanks for the suggestion. I | have some basic explanations of the optimization techniques and | their parameters in a separate repository: | https://github.com/SimonBlanke/optimization-tutorial | | But there is still a lot of work to be done. | asah wrote: | Given the rise of multi-core machines, parallel optimization | seems pretty important. Which of these support parallel execution | ? | | For this reason, I chose Mango for a recent project: | https://github.com/ARM-software/mango (bayesian optimizer with | built-in support for both parallel optimization (via | multiprocessing) and distributed (via celery)) | simonblanke wrote: | Gradient-Free-Optimizers is a lightweight optimization package | that serves as a backend for Hyperactive: | https://github.com/SimonBlanke/Hyperactive | | Hyperactive can do parallel computing with multiprocessing or | joblib, or a custom wrapper-function. Parallel computing can be | done with all its optimization algorithms. You could even pass | a gaussian process regressor like GpFlow to the Bayesian | Optimizer (GFO and Hyperactive) to get GPU acceleration. | wtallis wrote: | Consider adding some affordance for parallel computing to | Gradient-Free-Optimizers by allowing the user to provide a | vectorized objective function instead of one that evaluates | only a single point in the search space per function call. | That leaves all the hard work of parallelization as an | exercise for the user, and gives the user the flexibility to | parallelize their objective function with whatever mechanism | they wish. | | I have previously used this approach in a project where the | objective function contained a half-hour long simulation, | which was the bottleneck that made estimating a gradient | intractable. When the optimization algorithm gave a batch of | several points in the search space to evaluate, our objective | function could prepare and run several instances of the | simulation in parallel, and return when the whole batch was | complete. From this, it was easy for us to also distribute | simulation runs across several machines, without needing any | changes to the optimizers. We would not have been able to | easily achieve this with an optimization framework that tried | to directly manage parallelization, because the steps | necessary to prepare the input files for the simulation | software had to be done serially. | | For that project we tried: DIRECT, several variants of | Nelder-Mead, and an evolutionary strategy. In hindsight, the | Nelder-Mead variants worked best; once we accumulated enough | simulation results it became clear that our objective | function was convex and pretty well-behaved in the region of | interest. Nelder-Mead was also trivial to extend to trying | several extra points per batch to ensure that each of our | several workstations had something to work on. (We didn't | have access to a large cluster, and Nelder-Mead wouldn't | generalize well to a large degree of parallelization in that | manner.) | simonblanke wrote: | Your parallel computing approach sounds intriguing! Could | you provide an example script? I would like to look into | this. If you like you could open an issue as a feature | request and provide a code snipped there. | dijereedan wrote: | What about genetic optimization? | simonblanke wrote: | Gradient-Free-Optimizers has a type of genetic algorithm: | Evolution Strategy | stilley2 wrote: | Cool! I'd considering adding CMA-ES [1], which as far as I know | is the gold standard for derivative free optimization. | | 1. https://en.m.wikipedia.org/wiki/CMA-ES | stilley2 wrote: | Ah nevermind I see this was addressed in another comment. | morelandjs wrote: | Getting a lot of these errors when using the GPR. Seems to be an | issue with the estimation of the noise level term. Scratching my | head on this one. Anyone familiar with what's going on? I tried | adding some random noise to the optimization function, and it had | the same issue. | | ...python3.8/site- | packages/sklearn/gaussian_process/kernels.py:402: | ConvergenceWarning: The optimal value found for dimension 0 of | parameter k2__noise_level is close to the specified lower bound | 1e-05. Decreasing the bound and calling fit again may find a | better value. | simonblanke wrote: | Yeah this is a warning from sklearns gaussian process | regressor. Sklearn probably wants you to increase the | "n_restarts_optimizer"-parameter of the gpr, but from my | experience this warning does not correlate with bad results | from Gradient-Free-Optimizers. | | Sklearn warnings are often difficult to silence. Just google | "sklearn suppress warnings" to get a code snippet for this | problem. | morelandjs wrote: | thanks! | mik09 wrote: | nice repo, i saw some comments expressing excitement and thought | of the first time i saw pymoo, tpot, and finally automl. there | are a lot of stuff laying around github but we just need to | search better, like using topics (tags) and the gazer repo which | kinda page-ranks the github repos. | simonblanke wrote: | I had the same excitement for tpot! That project was the reason | i started creating mine. | plaidfuji wrote: | Be still my heart... could this be... the missing gradient-free | optim library for Python? Particle swarm, evolutionary, AND | Bayesian all in a single package? A no-nonsense, simple API with | strong enough defaults that you can basically just call .search() | with a function (I'm looking at you, nevergrad)?? | | I'll have to test it of course but this looks truly life | changing. A few clarifications, if the author happens to be | watching this thread: | | - looks like the search space is "enumerated", I.e. I define the | grid spacing for each independent variable in search_space and it | puts together an outer product? In other words, this will never | be truly continuous? | | - by the same token, constraints are applied by virtue of | specifying the endpoints of each grid in the search_space? Is it | possible for me to apply more complex constraints by generating | my own enumerated list of search_space points and feeding that | directly? | | Seriously well done. Like I said above, this looks to me like an | immediate staple library. | optimalsolver wrote: | There's also Opytimizer [0] for almost every metaheuristic | optimization algorithm under the Sun. | | [0] https://github.com/gugarosa/opytimizer | simonblanke wrote: | I am glad you like my project. The search space is not | continuous. But you can make the steps as small as you want. | Like: np.arange(0, 10, 0.00001) | | I am not sure i understand your second question. The entire | search space is always a N-dimensional cube. So you just define | a start and end point and the step size in each dimension. | plaidfuji wrote: | Nope, that answers the question. I was asking about cases | where the space is, for example, a triangle or something non- | cuboidal. Think constraints like x1 + x2 < 1 or something | like that. | | What I'm wondering is if your library converts the | search_space to an enumerated list of candidate points | somewhere in the backend, and if there's a way for me to just | construct that list and pass it directly for more complex | cases. | zmk_ wrote: | They probably mean that you can search in n-dimensional | cubes, but what they'd like is having a more general shapes | of the grid, e.g., have restriction of the sort: a+b<1. | dwiel wrote: | Another option in some cases is to just map the space to | something else. Like in this case, x and y are | unconstrained and are what the optimizer sees, a = x and b | = (1 - a) - abs(y). | | Not always easy to do, but can work in some cases if the | optimizer cant deal with it natively. | simonblanke wrote: | Okay that is interesting. You could realize that by making | a restricted area in the search space by returning np.nan | in the objective function for those cases. Gradient-Free- | Optimizers can handle np.nan and np.inf just fine. | | Maybe you could do something like: | | If a+b>=1: return np.nan else: return score | ialyos wrote: | This is at a high level how one of our Google internal | black box optimizers behaves. The term used is | "infeasible region" but it's the same idea as using a | nan. | | I would caution against using nan to always mean | infeasible. Instead users should catch experiments | outside the feasibility region and return a special | infeasible value. This will increase visibility into the | behavior of the optimizer, because it leaves nan to be | used for values inside the region of constraint that are | still problematic (due to bugs, numerical instability, | etc) | unnah wrote: | If the volume of feasible space is small, the optimizer | may have considerable trouble finding the feasible | region(s) in the domain. The simple solution is to use a | penalty function instead, i.e. add a term like M*max(0, | 1-(a+b)) to the objective, with sufficiently large M. If | the optimum is at the boundary of the feasible region, | you might still get a solution that is slightly | infeasible, which can be worked around with more special | case logic... | [deleted] | abhgh wrote: | This would be like the barrier function [1] for gradient | based methods; which produces some challenges for those | techniques. Do you know how this affects gradient free | techniques? | | [1] https://en.m.wikipedia.org/wiki/Barrier_function | aldanor wrote: | Looks very nice! | | Two questions off top of my head: | | - Might be worth mentioning how Hyperactive compares to tons of | other similar packages, like Hyperopt, Optunity and a ton of | others? Also how your GFO package is different from optima, | opytimizer, and many others as well. Having some benchmarks or | performance stats (or even just listing functionality | differences) would be very helpful to anyone who's been already | using those. | | - One of the best universal 'works-out-of-the-box-most-of-time' | global optimizers I've used recently is DLib's: | http://blog.dlib.net/2017/12/a-global-optimization-algorithm... - | any chances for implementing something similar? | simonblanke wrote: | Two very interesting questions! I should work soon on a | comparison of Hyperactive and GFO to other packages. If some | important features are missing, maybe i could add them. | | I will also look into Dlib. If you like you can open an | official issue in my repository. We could discuss why it is | useful and if it should be implemented. | twic wrote: | For linear programming solvers, there is an annual competition to | see which is best: | | https://www.minizinc.org/challenge.html | | Is there anything like this for gradient-free optimisers? | simonblanke wrote: | I thought about a table in the readme that shows some kind of | metric for each optimizer that describes its performance. I | will look into that. | twic wrote: | I love gradient-free optimisers - there are so many interesting | problems where there aren't analytical gradients available. | | An I right in thinking this library doesn't contain a simplex / | Nelder-Mead optimiser? To me that's the bread and butter of | gradient-free optimisation. | | Also no Powell algorithms like BOBYQA? | | nlopt has these and many more, and a Python wrapper: | | https://nlopt.readthedocs.io/en/latest/NLopt_Algorithms/ | | So, are those omitted because this library is modern and those | are now considered out of date? | bloaf wrote: | Came here to ask about the Powell method. I find Powell is | typically what engineers come up with if asked to optimize | something "by hand" without a mathematical model of the system | they are optimizing. | | https://en.wikipedia.org/wiki/Powell%27s_method | simonblanke wrote: | I am currently working on the Nelder-Mead optimiser. I will | also look into "BOBYQA". I am always searching for interesting | algorithms that my users need. If you have more suggestions you | can open an issue in the repository. | sleavey wrote: | I've used this, and it works nicely: | https://github.com/numericalalgorithmsgroup/pybobyqa. I'd be | happy if it were added to your project, then I could just use | yours and have access to a bunch of alternatives with the | same API. | twic wrote: | That's good to hear. I don't have any other specific requests | - for me it seems impossible to know what will work on my | problems, so i just try things at random to see what works! | | I don't know the architecture of your project, but is there | any way it could include NLopt, and so make all of its | algorithms available? Some of them could be useful, and i | doubt it's a good use of your time to reimplement them all | from scratch. | optimalsolver wrote: | I hope to see gradient-free/black box optimization algorithms | gain more popularity in machine learning. I'm always surprised | when I see a review of ML optimization methods, and it's not even | mentioned that you can optimize an ML model without using the | gradient. | | Also, love or hate Python, there's almost always a library for | whatever it is you want to do. | xyzzyz wrote: | > and it's not even mentioned that you can optimize an ML model | without using the gradient. | | Can you, though? How does the outcomes compare to gradient | based methods? | cgearhart wrote: | In general GFO is not a good alternative when there are lots | of model parameters or when it's cheap to get the gradient. | GFO _could_ be used to train models, but kinda in the same | way that you _could_ find your way to Cleveland by just | wandering around long enough. It sure would help a lot if you | had a compass. | shankr wrote: | Can you use the methods in this library to do constrained | optimization? Whenever I look for such gradient free optimizer | it's always comes with some constraint. | bbwwnn wrote: | One simple trick is doing a bijective transformation between | your constrained input space and the unconstrained optimizer | space. For example, if you have x>0 in your input space, you | take the input from your optimizer and apply e^x to it. Or if | you have a<x<b you can do a+b*logit(x). What exactly you choose | will depend on your prior on how your function behaves. | simonblanke wrote: | I just saw, that my project got a lot of new stars on github. A | surprise to be sure, but a welcome one. | batterylow wrote: | A book on Practical Evolutionary Algorithms with Python for | anyone interested https://datacrayon.com/shop/product/practical- | evolutionary-a... | jordigh wrote: | Is the Nelder-Mead a.k.a. the _other_ simplex method any good? I | remember most of these other methods from my optimisation | lessons, but I 'm surprised to not see Nelder-Mead in this list. | simonblanke wrote: | I am currently working on the Nelder-Mead algorithm. I did not | realize that it is that popular. This gives me motivation to | implement it soon ;-) | | If there are more "must have"-algorithms you could open an | issue in Gradient-Free-Optimizers. | jordigh wrote: | I like it because it was so easy to implement and it worked | well, unless your objective function had very narrow and | steep valleys. | | I don't know if it's popular, heheh, it's been a while since | I've been in maths school, so that's why I'm asking. ___________________________________________________________________ (page generated 2021-02-28 23:00 UTC)