This is the mail archive of the gsl-discuss@sources.redhat.com mailing list for the GSL project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Multivariate minimization


Hi Robert.

In many applications, a local minimum is far enough. This is the case for instance with neural networks. It is not uncommon to use neural networks with more than 1000 numerical parameters and still to use vector BFGS. You're perfectly right to say that GLOBAL optimization is a major problem (and in fact no free lunch theorems show that even GA cannot do anything about that), but this is not always what we want to do.

Fabrice

Robert G. Brown wrote:
On Fri, 22 Aug 2003, Przemyslaw Sliwa wrote:


Dear friends,

I have some questions regarding the multivariate minimisation routines.
I have t0 minimise a highly non-linear function depending on several
variables (about 5000). I want to use the vector BFGS method because I
have heard it is the best method for high dimensions. Is it correct???


Minimizing a "highly nonlinear function" depending on 5000 variables is
a nontrivial computational task and this process is the core nut that
must be cracked to solve many problems in physics or statistical
mechanics.  As in, for a rather large space of possible functions it is
not computationally possible.  To illustrate, suppose the variables are
the simplest possible variables -- binary variables with just two
states.  The space of all variable combinations is then a binary string
with 5000 digits, or 2^5000 values.  Being sloppy and estimating 2^10 =
10^3 as is common, this is 10^500, a pretty big number.

Then imagine that the nonlinear function is 1 for all values of those
variables except for one where it is 0, its minimum -- perhaps the
number 00000011000011010001010...errm, forgive me if I don't write the
rest of the digits:-)

The function has no useful gradient -- the only way to find the minimum
is to do an exhaustive search, and even if we use a 1000 node beowulf to
search 10^10 possibilities a second, it still takes us 10^(500-13)
seconds, which is the lifetime of many, many universes.  If we let every
atom in the universe be a computer that searches the possibilities at
10^43/second (which is the rough boundary, IIRC, of the quantum
granularity of time -- the Planck time) we STILL require the lifetime of
many, many universes to search.  In other words, we can't do this search
and never will be able to do this search and expect to find the minimum
unless we just happen to stumble onto it, and the probability of such a
stumble makes the probability of getting hit on the head by a meteor at
your desk today look big.

This is one boundary, although of course your variables could have more
than 2 states and your function could have more than two values (which
just adds to the computation time required).

If your function has gradients and your variables and function space is
quasi-continuous (binary encoded floats) you aren't really any better
off.  Sure, there are non-linear functions that look like a great big
well in 5000 dimensions that goes monotonically down to the minimum from
any starting point, but these are boring and almost infinitely unlikely.
Far MORE likely is that your minimization surface looks like the surface
of a sheet of paper that has been crumpled, or the surface of the earth
seen from on high.  There are hills and valleys everywhere, and many of
the valleys are completely surrounded by hills.  Here and there somebody
may have driven a narrow well in an otherwise nearly flat terrain.  Even
if the surface is over a ONE dimensional or TWO dimensional space,
finding the global minimum can range from computationally nontrivial to
nearly impossible (well, nowadays one might manage an exhaustive search
at binary granularity if the limits weren't too large).

Gradient methods, you see, are limited because one has to start IN the
right valley to be ABLE to go downhill to the right minimum, and there
can be a near infinity of valleys, mapping the problem right back to the
disjoint binary search problem we started with.  Once again, in most
complex cases of interest, solving for THE global minimum is simply
impossible.

Finding >>a<< minimum that is pretty low, a lot lower than a large
fraction of the surrounding terrain, is sometimes doable.  Methods like
simulated annealing and so forth are designed to perform searches that
can "hop" from valley to valley in search of a "good" valley that goes
down a lot lower than most of the valleys on the n-surface.  Even so,
for n=5000 dimensions this sort of thing would be nearly useless unless
the surface were relatively smooth and the valleys hierarchical.

The best, nearly ONLY method I am aware of for searching spaces of this
sort of dimensionality for a reasonable optimum is genetic algorithm
optimization.  It works like this.  You need the right, or at least a
good, valley in order to start a gradient descent to the bottom, which
might be a decent minimum (defined as the best you can afford and one
hopefully representative of the One True Minimum in some way).  One way
to find such a starting point is to roll dice -- use Monte Carlo to
generate a 5000-vector and start going downhill.

However, if you bisect the range of each dimension and imagine that the
good valleys live in only one half of each dimension with only some of
the dimensions being "important" and the others having a nearly flat
effect, you see that the problem still maps into the binary search
problem and your chances of picking a good valley by chance are remote.
N MC-generated starting points only very slowly search the space, and
the information gained from each trial is wasted when a completely new
trial is generated.

A GA doesn't waste that information.  It recombines the variables of
many MC trials in new ways, using the most successfull trials to date to
do so.  Consequently it covers the space at a rate (IIRC) of N^3
relative to the number of trials N.  Not a huge advantage compared to
2^5000, but important -- it gives you a chance at getting a LOT of the
most important variables at least in the general ballpark of what may be
a decent minimum.

So I'd recommend a GA followed by (or integrated with) the gradient
method of your choice, preferrably one that can proceed with a lot of
variables.  Then cross your fingers and hope, because frankly even a GA
is going to really "use" only a few hundred at most of your variables in
most cases.  You'd almost do as well by determining the 100 most
important variables and restricting your space to them, unless as I
noted your function is really rather simple.

rgb




Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]