Self−Organizing Map in Matlab: the SOM Toolbox - Aaltodoc
Self−Organizing Map in Matlab: the SOM Toolbox - Aaltodoc
Self−Organizing Map in Matlab: the SOM Toolbox - Aaltodoc
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Publication 6<br />
<strong>Self−Organiz<strong>in</strong>g</strong> <strong>Map</strong> <strong>in</strong> <strong>Matlab</strong>: <strong>the</strong> <strong>SOM</strong> <strong>Toolbox</strong><br />
Juha Vesanto, Johan Himberg, Esa Alhoniemi and Juha<br />
Parhankangas<br />
In Proceed<strong>in</strong>gs of <strong>the</strong> <strong>Matlab</strong> DSP Conference 1999, Espoo,<br />
F<strong>in</strong>land, pp. 35−40, 1999.
Self-organiz<strong>in</strong>g map <strong>in</strong> <strong>Matlab</strong>: <strong>the</strong> <strong>SOM</strong> <strong>Toolbox</strong><br />
Juha Vesanto, Johan Himberg, Esa Alhoniemi and Juha Parhankangas<br />
Laboratory of Computer and Information Science, Hels<strong>in</strong>ki University of Technology, F<strong>in</strong>land<br />
Abstract<br />
The Self-Organiz<strong>in</strong>g <strong>Map</strong> (<strong>SOM</strong>) is a vector<br />
quantization method which places <strong>the</strong> prototype vectors on<br />
a regular low-dimensional grid <strong>in</strong> an ordered fashion.<br />
This makes <strong>the</strong> <strong>SOM</strong> a powerful visualization tool. The<br />
<strong>SOM</strong> <strong>Toolbox</strong> is an implementation of <strong>the</strong> <strong>SOM</strong> and its<br />
visualization <strong>in</strong> <strong>the</strong> <strong>Matlab</strong> 5 comput<strong>in</strong>g environment. In<br />
this article, <strong>the</strong> <strong>SOM</strong> <strong>Toolbox</strong> and its usage are shortly<br />
presented. Also its performance <strong>in</strong> terms of computational<br />
load is evaluated and compared to a correspond<strong>in</strong>g Cprogram.<br />
1. General<br />
This article presents <strong>the</strong> (second version of <strong>the</strong>) <strong>SOM</strong><br />
<strong>Toolbox</strong>, hereafter simply called <strong>the</strong> <strong>Toolbox</strong>, for <strong>Matlab</strong><br />
5 comput<strong>in</strong>g environment by MathWorks, Inc. The <strong>SOM</strong><br />
acronym stands for Self-Organiz<strong>in</strong>g <strong>Map</strong> (also called<br />
Self-Organiz<strong>in</strong>g Feature <strong>Map</strong> or Kohonen map), a popular<br />
neural network based on unsupervised learn<strong>in</strong>g [1]. The<br />
<strong>Toolbox</strong> conta<strong>in</strong>s functions for creation, visualization and<br />
analysis of Self-Organiz<strong>in</strong>g <strong>Map</strong>s. The <strong>Toolbox</strong> is<br />
available free of charge under <strong>the</strong> GNU General Public<br />
License from http://www.cis.hut.fi/projects/somtoolbox.<br />
The <strong>Toolbox</strong> was born out of need for a good,<br />
easy-to-use implementation of <strong>the</strong> <strong>SOM</strong> <strong>in</strong> <strong>Matlab</strong> for<br />
research purposes. In particular, <strong>the</strong> researchers<br />
responsible for <strong>the</strong> <strong>Toolbox</strong> work <strong>in</strong> <strong>the</strong> field of data<br />
m<strong>in</strong><strong>in</strong>g, and <strong>the</strong>refore <strong>the</strong> <strong>Toolbox</strong> is oriented towards that<br />
direction <strong>in</strong> <strong>the</strong> form of powerful visualization functions.<br />
However, also people do<strong>in</strong>g o<strong>the</strong>r k<strong>in</strong>ds of research us<strong>in</strong>g<br />
<strong>SOM</strong> will probably f<strong>in</strong>d it useful — especially if <strong>the</strong>y<br />
have not yet made a <strong>SOM</strong> implementation of <strong>the</strong>ir own <strong>in</strong><br />
<strong>Matlab</strong> environment. S<strong>in</strong>ce much effort has been put to<br />
make <strong>the</strong> <strong>Toolbox</strong> relatively easy to use, it can also be<br />
used for educational purposes.<br />
The <strong>Toolbox</strong> — <strong>the</strong> basic package toge<strong>the</strong>r with<br />
contributed functions — can be used to preprocess data,<br />
<strong>in</strong>itialize and tra<strong>in</strong> <strong>SOM</strong>s us<strong>in</strong>g a range of different k<strong>in</strong>ds<br />
of topologies, visualize <strong>SOM</strong>s <strong>in</strong> various ways, and<br />
analyze <strong>the</strong> properties of <strong>the</strong> <strong>SOM</strong>s and data, e.g. <strong>SOM</strong><br />
quality, clusters on <strong>the</strong> map and correlations between<br />
variables. With data m<strong>in</strong><strong>in</strong>g <strong>in</strong> m<strong>in</strong>d, <strong>the</strong> <strong>Toolbox</strong> and <strong>the</strong><br />
<strong>SOM</strong> <strong>in</strong> general is best suited for data understand<strong>in</strong>g or<br />
survey, although it can also be used for classification and<br />
model<strong>in</strong>g.<br />
2. Self-organiz<strong>in</strong>g map<br />
A <strong>SOM</strong> consists of neurons organized on a regular lowdimensional<br />
grid, see Figure 1. Each neuron is a ddimensional<br />
weight vector (prototype vector, codebook<br />
vector) where d is equal to <strong>the</strong> dimension of <strong>the</strong> <strong>in</strong>put<br />
vectors. The neurons are connected to adjacent neurons by<br />
a neighborhood relation, which dictates <strong>the</strong> topology, or<br />
structure, of <strong>the</strong> map. In <strong>the</strong> <strong>Toolbox</strong>, topology is divided<br />
to two factors: local lattice structure (hexagonal or<br />
rectangular, see Figure 1) and global map shape (sheet,<br />
cyl<strong>in</strong>der or toroid).<br />
2<br />
1<br />
0<br />
Figure 1. Neighborhoods (0, 1 and 2) of <strong>the</strong> centermost<br />
unit: hexagonal lattice on <strong>the</strong> left, rectangular on <strong>the</strong> right.<br />
The <strong>in</strong>nermost polygon corresponds to 0-, next to <strong>the</strong> 1-<br />
and <strong>the</strong> outmost to <strong>the</strong> 2-neighborhood.<br />
The <strong>SOM</strong> can be thought of as a net which is spread to<br />
<strong>the</strong> data cloud. The <strong>SOM</strong> tra<strong>in</strong><strong>in</strong>g algorithm moves <strong>the</strong><br />
weight vectors so that <strong>the</strong>y span across <strong>the</strong> data cloud and<br />
so that <strong>the</strong> map is organized: neighbor<strong>in</strong>g neurons on <strong>the</strong><br />
grid get similar weight vectors. Two variants of <strong>the</strong> <strong>SOM</strong><br />
tra<strong>in</strong><strong>in</strong>g algorithm have been implemented <strong>in</strong> <strong>the</strong> <strong>Toolbox</strong>.<br />
In <strong>the</strong> traditional sequential tra<strong>in</strong><strong>in</strong>g, samples are<br />
presented to <strong>the</strong> map one at a time, and <strong>the</strong> algorithm<br />
gradually moves <strong>the</strong> weight vectors towards <strong>the</strong>m, as<br />
shown <strong>in</strong> Figure 2. In <strong>the</strong> batch tra<strong>in</strong><strong>in</strong>g, <strong>the</strong> data set is<br />
presented to <strong>the</strong> <strong>SOM</strong> as a whole, and <strong>the</strong> new weight<br />
vectors are weighted averages of <strong>the</strong> data vectors. Both<br />
algorithms are iterative, but <strong>the</strong> batch version is much<br />
2<br />
1<br />
0
faster <strong>in</strong> <strong>Matlab</strong> s<strong>in</strong>ce matrix operations can be utilized<br />
efficiently.<br />
For a more complete description of <strong>the</strong> <strong>SOM</strong> and its<br />
implementation <strong>in</strong> <strong>Matlab</strong>, please refer to <strong>the</strong> book by<br />
Kohonen [1], and to <strong>the</strong> <strong>SOM</strong> <strong>Toolbox</strong> documentation.<br />
X<br />
BMU<br />
Figure 2. Updat<strong>in</strong>g <strong>the</strong> best match<strong>in</strong>g unit (BMU) and<br />
its neighbors towards <strong>the</strong> <strong>in</strong>put sample marked with x.<br />
The solid and dashed l<strong>in</strong>es correspond to situation before<br />
and after updat<strong>in</strong>g, respectively.<br />
3. Performance<br />
The <strong>Toolbox</strong> can be downloaded for free from<br />
http://www.cis.hut.fi/projects/somtoolbox. It requires no<br />
o<strong>the</strong>r toolboxes, just <strong>the</strong> basic functions of <strong>Matlab</strong> (version<br />
5.1 or later). The total diskspace required for <strong>the</strong> <strong>Toolbox</strong><br />
itself is less than 1 MB. The documentation takes a few<br />
MBs more.<br />
The performance tests were made <strong>in</strong> a mach<strong>in</strong>e with 3<br />
GBs of memory and 8 250 MHz R10000 CPUs (one of<br />
which was used by <strong>the</strong> test process) runn<strong>in</strong>g IRIX 6.5<br />
operat<strong>in</strong>g system. Some tests were also performed <strong>in</strong> a<br />
workstation with a s<strong>in</strong>gle 350 MHz Pentium II CPU, 128<br />
MBs of memory and L<strong>in</strong>ux operat<strong>in</strong>g system. The <strong>Matlab</strong><br />
version <strong>in</strong> both environments was 5.3.<br />
The purpose of <strong>the</strong> performance tests was only to<br />
evaluate <strong>the</strong> computational load of <strong>the</strong> algorithms. No<br />
attempt was made to compare <strong>the</strong> quality of <strong>the</strong> result<strong>in</strong>g<br />
mapp<strong>in</strong>gs, primarily because <strong>the</strong>re is no uniformly<br />
recognized “correct” method to evaluate it. The tests were<br />
performed with data sets and maps of different sizes, and<br />
three tra<strong>in</strong><strong>in</strong>g functions: som_batchtra<strong>in</strong>,<br />
som_seqtra<strong>in</strong> and som_sompaktra<strong>in</strong>, <strong>the</strong> last of<br />
which calls <strong>the</strong> C-program vsom to perform <strong>the</strong> actual<br />
tra<strong>in</strong><strong>in</strong>g. This program is part of <strong>the</strong> <strong>SOM</strong>_PAK [3],<br />
which is a free software package implement<strong>in</strong>g <strong>the</strong> <strong>SOM</strong><br />
algorithm <strong>in</strong> ANSI-C.<br />
Some typical comput<strong>in</strong>g times are shown <strong>in</strong> Table 1. As<br />
a general result, som_batchtra<strong>in</strong> was clearly <strong>the</strong><br />
fastest. In IRIX it was upto 20 times faster than<br />
som_seqtra<strong>in</strong> and upto 8 times faster than<br />
som_sompaktra<strong>in</strong>. Median values were 6 times and 3<br />
times, respectively. The som_batchtra<strong>in</strong> was<br />
especially faster with larger data sets, while with a small<br />
set and large map it was actually slower. However, <strong>the</strong><br />
latter case is very atypical, and can thus be ignored. In<br />
L<strong>in</strong>ux, <strong>the</strong> smaller amount of memory clearly came <strong>in</strong>to<br />
play: <strong>the</strong> marg<strong>in</strong>al between batch and o<strong>the</strong>r tra<strong>in</strong><strong>in</strong>g<br />
functions was halved.<br />
The number of data samples clearly had a l<strong>in</strong>ear effect<br />
on <strong>the</strong> computational load. On <strong>the</strong> o<strong>the</strong>r hand, <strong>the</strong> number<br />
of map units seemed to have a quadratic effect, at least<br />
with som_batchtra<strong>in</strong>. Of course, also <strong>in</strong>crease <strong>in</strong><br />
<strong>in</strong>put dimension <strong>in</strong>creased <strong>the</strong> comput<strong>in</strong>g times: about<br />
two- to threefold as <strong>in</strong>put dimension <strong>in</strong>creased from 10 to<br />
50. The most supris<strong>in</strong>g result of <strong>the</strong> performance test was<br />
that especially with large data sets and maps, <strong>the</strong><br />
som_batchtra<strong>in</strong> outperformed <strong>the</strong> C-program (vsom<br />
used by som_sompaktra<strong>in</strong>). The reason is probably<br />
<strong>the</strong> fact that <strong>in</strong> <strong>SOM</strong>_PAK, distances between map units<br />
on <strong>the</strong> grid are always calculated anew when needed. In<br />
<strong>SOM</strong> <strong>Toolbox</strong>, all <strong>the</strong>se are calculated beforehand.<br />
Likewise for many o<strong>the</strong>r required matrices.<br />
Indeed, <strong>the</strong> major deficiency of <strong>the</strong> <strong>SOM</strong> <strong>Toolbox</strong>, and<br />
especially of batch tra<strong>in</strong><strong>in</strong>g algorithm, is <strong>the</strong> expenditure<br />
of memory. A rough lower bound estimate of <strong>the</strong> amount<br />
of memory used by som_batchtra<strong>in</strong> is given by:<br />
8(5(m+n)d + 3m 2 ) bytes, where m is <strong>the</strong> number of<br />
map units, n is <strong>the</strong> number of data samples and d is <strong>the</strong><br />
<strong>in</strong>put space dimension. For [3000 x 10] data matrix and<br />
300 map units <strong>the</strong> amount of memory required is still<br />
moderate, <strong>in</strong> <strong>the</strong> order of 3.5 MBs. But for [30000 x 50]<br />
data matrix and 3000 map units, <strong>the</strong> memory requirement<br />
is more than 280 MBs, <strong>the</strong> majority of which comes from<br />
<strong>the</strong> last term of <strong>the</strong> equation. The sequential algorithm is<br />
less extreme requir<strong>in</strong>g only one half or one third of this.<br />
<strong>SOM</strong>_PAK requires much less memory, about 20 MBs for<br />
<strong>the</strong> [30000 x 50] case, and can operate with buffered data.<br />
Table 1. Typical comput<strong>in</strong>g times. Data set size is<br />
given as [n x d] where n is <strong>the</strong> number of data samples<br />
and d is <strong>the</strong> <strong>in</strong>put dimension.<br />
data size<br />
IRIX<br />
map units batch seq sompak<br />
[300x10] 30 0.2 s 3.1 s 0.9 s<br />
[3000x10] 300 7 s 54 s 17 s<br />
[30000x10] 1000 5 m<strong>in</strong> 19 m<strong>in</strong> 9 m<strong>in</strong><br />
[30000x50] 3000<br />
L<strong>in</strong>ux<br />
27 m<strong>in</strong> 5.7 h 75 m<strong>in</strong><br />
[300x10] 30 0.3 s 2.7 s 1.9 s<br />
[3000x10] 300 24 s 76 s 26 s<br />
[30000x10] 1000 13 m<strong>in</strong> 40 m<strong>in</strong> 15 m<strong>in</strong>
4. Use of <strong>SOM</strong> <strong>Toolbox</strong><br />
4.1. Data format<br />
The k<strong>in</strong>d of data that can be processed with <strong>the</strong><br />
<strong>Toolbox</strong> is so-called spreadsheet or table data. Each row<br />
of <strong>the</strong> table is one data sample. The columns of <strong>the</strong> table<br />
are <strong>the</strong> variables of <strong>the</strong> data set. The variables might be <strong>the</strong><br />
properties of an object, or a set of measurements measured<br />
at a specific time. The important th<strong>in</strong>g is that every sample<br />
has <strong>the</strong> same set of variables. Some of <strong>the</strong> values may be<br />
miss<strong>in</strong>g, but <strong>the</strong> majority should be <strong>the</strong>re. The table<br />
representation is a very common data format. If <strong>the</strong><br />
available data does not conform to <strong>the</strong>se specifications, it<br />
can usually be transformed so that it does.<br />
The <strong>Toolbox</strong> can handle both numeric and categorial<br />
data, but only <strong>the</strong> former is utilized <strong>in</strong> <strong>the</strong> <strong>SOM</strong> algorithm.<br />
In <strong>the</strong> <strong>Toolbox</strong>, categorial data can be <strong>in</strong>serted <strong>in</strong>to labels<br />
associated with each data sample. They can be considered<br />
as post-it notes attached to each sample. The user can<br />
check on <strong>the</strong>m later to see what was <strong>the</strong> mean<strong>in</strong>g of some<br />
specific sample, but <strong>the</strong> tra<strong>in</strong><strong>in</strong>g algorithm ignores <strong>the</strong>m.<br />
Function som_autolabel can be used to handle<br />
categorial variables. If <strong>the</strong> categorial variables need to be<br />
utilized <strong>in</strong> tra<strong>in</strong><strong>in</strong>g <strong>the</strong> <strong>SOM</strong>, <strong>the</strong>y can be converted <strong>in</strong>to<br />
numerical variables us<strong>in</strong>g, e.g., mapp<strong>in</strong>g or 1-of-n<br />
cod<strong>in</strong>g [4].<br />
Note that for a variable to be “numeric”, <strong>the</strong> numeric<br />
representation must be mean<strong>in</strong>gful: values 1, 2 and 4<br />
correspond<strong>in</strong>g to objects A, B and C should really mean<br />
that (<strong>in</strong> terms of this variable) B is between A and C, and<br />
that <strong>the</strong> distance between B and A is smaller than <strong>the</strong><br />
distance between B and C. Identification numbers, error<br />
codes, etc. rarely have such mean<strong>in</strong>g, and <strong>the</strong>y should be<br />
handled as categorial data.<br />
4.2. Construction of data sets<br />
First, <strong>the</strong> data has to be brought <strong>in</strong>to <strong>Matlab</strong> us<strong>in</strong>g, for<br />
example, standard <strong>Matlab</strong> functions load and fscanf.<br />
In addition, <strong>the</strong> <strong>Toolbox</strong> has function som_read_data<br />
which can be used to read ASCII data files:<br />
sD = som_read_data(‘data.txt’);<br />
The data is usually put <strong>in</strong>to a so-called data struct,<br />
which is a <strong>Matlab</strong> struct def<strong>in</strong>ed <strong>in</strong> <strong>the</strong> <strong>Toolbox</strong> to group<br />
<strong>in</strong>formation related to a data set. It has fields for numerical<br />
data (.data), str<strong>in</strong>gs (.labels), as well as for<br />
<strong>in</strong>formation about data set and <strong>the</strong> <strong>in</strong>dividual variables.<br />
The <strong>Toolbox</strong> utilizes many o<strong>the</strong>r structs as well, for<br />
example a map struct which holds all <strong>in</strong>formation related<br />
to a <strong>SOM</strong>. A numerical matrix can be converted <strong>in</strong>to a<br />
data struct with: sD = som_data_struct(D). If <strong>the</strong><br />
data only consists of numerical values, it is not actually<br />
necessary to use data structs at all. Most functions accept<br />
numerical matrices as well. However, if <strong>the</strong>re are<br />
categorial variables, data structs has be used. The<br />
categorial variables are converted to str<strong>in</strong>gs and put <strong>in</strong>to<br />
<strong>the</strong> .labels field of <strong>the</strong> data struct as a cell array of<br />
str<strong>in</strong>gs.<br />
4.3. Data preprocess<strong>in</strong>g<br />
Data preprocess<strong>in</strong>g <strong>in</strong> general can be just about<br />
anyth<strong>in</strong>g: simple transformations or normalizations<br />
performed on s<strong>in</strong>gle variables, filters, calculation of new<br />
variables from exist<strong>in</strong>g ones. In <strong>the</strong> <strong>Toolbox</strong>, only <strong>the</strong> first<br />
of <strong>the</strong>se is implemented as part of <strong>the</strong> package.<br />
Specifically, <strong>the</strong> function som_normalize can be used<br />
to perform l<strong>in</strong>ear and logarithmic scal<strong>in</strong>gs and histogram<br />
equalizations of <strong>the</strong> numerical variables (<strong>the</strong> .data<br />
field). There is also a graphical user <strong>in</strong>terface tool for<br />
preprocess<strong>in</strong>g data, see Figure 3.<br />
Scal<strong>in</strong>g of variables is of special importance <strong>in</strong> <strong>the</strong><br />
<strong>Toolbox</strong>, s<strong>in</strong>ce <strong>the</strong> <strong>SOM</strong> algorithm uses Euclidean metric<br />
to measure distances between vectors. If one variable has<br />
values <strong>in</strong> <strong>the</strong> range of [0,...,1000] and ano<strong>the</strong>r <strong>in</strong> <strong>the</strong> range<br />
of [0,...,1] <strong>the</strong> former will almost completely dom<strong>in</strong>ate <strong>the</strong><br />
map organization because of its greater impact on <strong>the</strong><br />
distances measured. Typically, one would want <strong>the</strong><br />
variables to be equally important. The standard way to<br />
achieve this is to l<strong>in</strong>early scale all variables so that <strong>the</strong>ir<br />
variances are equal to one.<br />
One of <strong>the</strong> advantages of us<strong>in</strong>g data structs <strong>in</strong>stead of<br />
simple data matrices is that <strong>the</strong> structs reta<strong>in</strong> <strong>in</strong>formation<br />
of <strong>the</strong> normalizations <strong>in</strong> <strong>the</strong> field .comp_norm. Us<strong>in</strong>g<br />
function som_denormalize one can reverse <strong>the</strong><br />
normalization to get <strong>the</strong> values <strong>in</strong> <strong>the</strong> orig<strong>in</strong>al scale: sD =<br />
som_denormalize(sD). Also, one can repeat <strong>the</strong><br />
exactly same normalizations to o<strong>the</strong>r data sets.<br />
All normalizations are s<strong>in</strong>gle-variable transformations.<br />
One can make one k<strong>in</strong>d of normalization to one variable,<br />
and ano<strong>the</strong>r type of normalization to ano<strong>the</strong>r variable.<br />
Also, multiple normalizations one after <strong>the</strong> o<strong>the</strong>r can be<br />
made for each variable. For example, consider a data set<br />
sD with three numerical variables. The user could do a<br />
histogram equalization to <strong>the</strong> first variable, a logarithmic<br />
scal<strong>in</strong>g to <strong>the</strong> third variable, and f<strong>in</strong>ally a l<strong>in</strong>ear scal<strong>in</strong>g to<br />
unit variance to all three variables:<br />
sD = som_normalize(sD,'histD',1);<br />
sD = som_normalize(sD,'log',3);<br />
sD = som_normalize(sD,'var',1:3);<br />
The data does not necessarily have to be preprocessed<br />
at all before creat<strong>in</strong>g a <strong>SOM</strong> us<strong>in</strong>g it. However, <strong>in</strong> most<br />
real tasks preprocess<strong>in</strong>g is important; perhaps even <strong>the</strong><br />
most important part of <strong>the</strong> whole process [4].
2<br />
1<br />
0<br />
−1<br />
Figure 3. Data set preprocess<strong>in</strong>g tool.<br />
Figure 4. <strong>SOM</strong> <strong>in</strong>itialization and tra<strong>in</strong><strong>in</strong>g tool.<br />
4.4. Initialization and tra<strong>in</strong><strong>in</strong>g<br />
There are two <strong>in</strong>itialization (random and l<strong>in</strong>ear) and<br />
two tra<strong>in</strong><strong>in</strong>g (sequential and batch) algorithms<br />
implemented <strong>in</strong> <strong>the</strong> <strong>Toolbox</strong>. By default l<strong>in</strong>ear<br />
<strong>in</strong>itialization and batch tra<strong>in</strong><strong>in</strong>g algorithm are used. The<br />
simplest way to <strong>in</strong>itialize and tra<strong>in</strong> a <strong>SOM</strong> is to use<br />
function som_make which does both us<strong>in</strong>g automatically<br />
selected parameters:<br />
sM = som_make(sD);<br />
The tra<strong>in</strong><strong>in</strong>g is done is two phases: rough tra<strong>in</strong><strong>in</strong>g with<br />
large (<strong>in</strong>itial) neighborhood radius and large (<strong>in</strong>itial)<br />
learn<strong>in</strong>g rate, and f<strong>in</strong>etun<strong>in</strong>g with small radius and<br />
learn<strong>in</strong>g rate. If tighter control over <strong>the</strong> tra<strong>in</strong><strong>in</strong>g<br />
parameters is desired, <strong>the</strong> respective <strong>in</strong>itialization and<br />
tra<strong>in</strong><strong>in</strong>g functions, e.g. som_batchtra<strong>in</strong>, can be used<br />
directly. There is also a graphical user <strong>in</strong>terface tool for<br />
<strong>in</strong>itializ<strong>in</strong>g and tra<strong>in</strong><strong>in</strong>g <strong>SOM</strong>s, see Figure 4.<br />
4.5. Visualization and analysis<br />
There are a variety of methods to visualize <strong>the</strong> <strong>SOM</strong>. In<br />
<strong>the</strong> <strong>Toolbox</strong>, <strong>the</strong> basic tool is <strong>the</strong> function som_show. It<br />
can be used to show <strong>the</strong> U-matrix and <strong>the</strong> component<br />
planes of <strong>the</strong> <strong>SOM</strong>:<br />
som_show(sM);<br />
The U-matrix visualizes distances between neighbor<strong>in</strong>g<br />
map units, and thus shows <strong>the</strong> cluster structure of <strong>the</strong> map:<br />
high values of <strong>the</strong> U-matrix <strong>in</strong>dicate a cluster border,<br />
uniform areas of low values <strong>in</strong>dicate clusters <strong>the</strong>mselves.<br />
Each component plane shows <strong>the</strong> values of one variable <strong>in</strong><br />
each map unit. On top of <strong>the</strong>se visualizations, additional<br />
<strong>in</strong>formation can be shown: labels, data histograms and<br />
trajectories.<br />
With function som_vis much more advanced<br />
visualizations are possible. The function is based on <strong>the</strong><br />
idea that <strong>the</strong> visualization of a data set simply consists of a<br />
set of objects, each with a unique position, color and<br />
shape. In addition, connections between objects, for<br />
example neighborhood relations, can be shown us<strong>in</strong>g<br />
l<strong>in</strong>es. With som_vis <strong>the</strong> user is able to assign arbitrary<br />
values to each of <strong>the</strong>se properties. For example, x-, y-, and<br />
z-coord<strong>in</strong>ates, object size and color can each stand for one<br />
variable, thus enabl<strong>in</strong>g <strong>the</strong> simultaneous visualization of<br />
five variables. The different options are:<br />
- <strong>the</strong> position of an object can be 2- or 3-dimensional<br />
- <strong>the</strong> color of an object can be freely selected from<br />
<strong>the</strong> RGB cube, although typically <strong>in</strong>dexed color is<br />
used<br />
- <strong>the</strong> shape of an object can be any of <strong>the</strong> <strong>Matlab</strong><br />
plot markers ('.','+', etc.), a pie chart, a bar
chart, a plot or even an arbitrarily shaped polygon,<br />
typically a rectangle or hexagon<br />
- l<strong>in</strong>es between objects can have arbitrary color,<br />
width and any of <strong>the</strong> <strong>Matlab</strong> l<strong>in</strong>e modes, e.g. '-'<br />
- <strong>in</strong> addition to <strong>the</strong> objects, associated labels can be<br />
shown<br />
For quantitative analysis of <strong>the</strong> <strong>SOM</strong> <strong>the</strong>re are at <strong>the</strong><br />
moment only a few tools. The function som_quality<br />
supplies two quality measures for <strong>SOM</strong>: average<br />
quantization error and topographic error. However, us<strong>in</strong>g<br />
low level functions, like som_neighborhood,<br />
som_bmus and som_unit_dists, it is easy to<br />
implement new analysis functions. Much research is be<strong>in</strong>g<br />
done <strong>in</strong> this area, and many new functions for <strong>the</strong> analysis<br />
will be added to <strong>the</strong> <strong>Toolbox</strong> <strong>in</strong> <strong>the</strong> future, for example<br />
tools for cluster<strong>in</strong>g and analysis of <strong>the</strong> properties of <strong>the</strong><br />
clusters. Also new visualization functions for mak<strong>in</strong>g<br />
projections and specific visualization tasks will be added<br />
to <strong>the</strong> <strong>Toolbox</strong>.<br />
4.6. Example<br />
Here is a simple example of <strong>the</strong> usage of <strong>the</strong> <strong>Toolbox</strong> to<br />
make and visualize a <strong>SOM</strong> of a data set. As <strong>the</strong> example<br />
data, <strong>the</strong> well-known Iris data set is used [5]. This data set<br />
consists of four measurements from 150 Iris flowers: 50<br />
Iris-setosa, 50 Iris-versicolor and 50 Iris-virg<strong>in</strong>ica. The<br />
measurements are length and width of sepal and petal<br />
leaves. The data is <strong>in</strong> an ASCII file, <strong>the</strong> first few l<strong>in</strong>es of<br />
which are shown below. The first l<strong>in</strong>e conta<strong>in</strong>s <strong>the</strong> names<br />
of <strong>the</strong> variables. Each of <strong>the</strong> follow<strong>in</strong>g l<strong>in</strong>es gives one<br />
data sample beg<strong>in</strong>n<strong>in</strong>g with numerical variables and<br />
followed by labels.<br />
#n sepallen sepalwid petallen petalwid<br />
5.1 3.5 1.4 0.2 setosa<br />
4.9 3.0 1.4 0.2 setosa<br />
...<br />
The data set is loaded <strong>in</strong>to <strong>Matlab</strong> and normalized.<br />
Before normalization, an <strong>in</strong>itial statistical look of <strong>the</strong> data<br />
set would be <strong>in</strong> order, for example us<strong>in</strong>g variable-wise<br />
histograms. This <strong>in</strong>formation would provide an <strong>in</strong>itial idea<br />
of what <strong>the</strong> data is about, and would <strong>in</strong>dicate how <strong>the</strong><br />
variables should be preprocessed. In this example, <strong>the</strong><br />
variance normalization is used. After <strong>the</strong> data set is ready,<br />
a <strong>SOM</strong> is tra<strong>in</strong>ed. S<strong>in</strong>ce <strong>the</strong> data set had labels, <strong>the</strong> map is<br />
also labeled us<strong>in</strong>g som_autolabel. After this, <strong>the</strong><br />
<strong>SOM</strong> is visualized us<strong>in</strong>g som_show. The U-matrix is<br />
shown along with all four component planes. Also <strong>the</strong><br />
labels of each map unit are shown on an empty grid us<strong>in</strong>g<br />
som_addlabels. The values of components are<br />
denormalized so that <strong>the</strong> values shown on <strong>the</strong> colorbar are<br />
<strong>in</strong> <strong>the</strong> orig<strong>in</strong>al value range. The visualizations are shown<br />
<strong>in</strong> Figure 5.<br />
%% make <strong>the</strong> data<br />
sD = som_read_data('iris.data');<br />
sD = som_normalize(sD,'var');<br />
%% make <strong>the</strong> <strong>SOM</strong><br />
sM = som_make(sD,'munits',30);<br />
sM = som_autolabel(sM,sD,'vote');<br />
%% basic visualization<br />
som_show(sM,’umat’,’all’,’comp’,1:4,...<br />
’empty’,’Labels’,’norm’,’d’);<br />
som_addlabels(sM,1,6);<br />
From <strong>the</strong> U-matrix it is easy to see that <strong>the</strong> top three<br />
rows of <strong>the</strong> <strong>SOM</strong> form a very clear cluster. By look<strong>in</strong>g at<br />
<strong>the</strong> labels, it is immediately seen that this corresponds to<br />
<strong>the</strong> Setosa subspecies. The two o<strong>the</strong>r subspecies<br />
Versicolor and Virg<strong>in</strong>ica form <strong>the</strong> o<strong>the</strong>r cluster. The Umatrix<br />
shows no clear separation between <strong>the</strong>m, but from<br />
<strong>the</strong> labels it seems that <strong>the</strong>y correspond to two different<br />
parts of <strong>the</strong> cluster. From <strong>the</strong> component planes it can be<br />
seen that <strong>the</strong> petal length and petal width are very closely<br />
related to each o<strong>the</strong>r. Also some correlation exists between<br />
<strong>the</strong>m and sepal length. The Setosa subspecies exhibits<br />
small petals and short but wide sepals. The separat<strong>in</strong>g<br />
factor between Versicolor and Virg<strong>in</strong>ica is that <strong>the</strong> latter<br />
has bigger leaves.<br />
U−matrix<br />
petallength<br />
d<br />
1.6<br />
1.4<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
5.52<br />
4.64<br />
3.76<br />
2.88<br />
1.99<br />
sepallength<br />
d<br />
d<br />
7.09<br />
6.67<br />
6.26<br />
5.84<br />
5.43<br />
5.02<br />
<strong>Map</strong>: <strong>SOM</strong> 06−Sep−1999, Size: 14 6<br />
sepalwidth<br />
petalwidth Labels<br />
se se se se se se<br />
1.96 se se se se se se<br />
se se se se se<br />
1.58<br />
se<br />
ve<br />
se se<br />
1.2<br />
ve ve ve ve ve ve<br />
ve ve ve ve ve vi<br />
ve ve ve ve ve ve<br />
ve ve ve ve<br />
0.817 vi ve ve ve<br />
vi ve vi vi vi vi<br />
0.436 vi<br />
vi vi vi vi<br />
vi vi vi<br />
vi vi vi vi vi<br />
Figure 5. Visualization of <strong>the</strong> <strong>SOM</strong> of Iris data. Umatrix<br />
on top left, <strong>the</strong>n component planes, and map unit<br />
labels on bottom right. The six figures are l<strong>in</strong>ked by<br />
position: <strong>in</strong> each figure, <strong>the</strong> hexagon <strong>in</strong> a certa<strong>in</strong> position<br />
corresponds to <strong>the</strong> same map unit. In <strong>the</strong> U-matrix,<br />
additional hexagons exist between all pairs of neighbor<strong>in</strong>g<br />
map units. For example, <strong>the</strong> map unit <strong>in</strong> top left corner has<br />
low values for sepal length, petal length and width, and<br />
relatively high value for sepal width. The label associated<br />
with <strong>the</strong> map unit is 'se' (Setosa) and from <strong>the</strong> U-matrix it<br />
can be seen that <strong>the</strong> unit is very close to its neighbors.<br />
d<br />
3.7<br />
3.49<br />
3.27<br />
3.05<br />
2.84<br />
2.62
Component planes are very convenient when one has to<br />
visualize a lot of <strong>in</strong>formation at once. However, when only<br />
a few variables are of <strong>in</strong>terest scatter plots are much more<br />
efficient. Figures 6 and 7 show two scatter plots made<br />
us<strong>in</strong>g <strong>the</strong> som_grid function. Figure 6 shows <strong>the</strong> PCAprojection<br />
of both data and <strong>the</strong> map grid, and Figure 7<br />
visualizes all four variables of <strong>the</strong> <strong>SOM</strong> plus <strong>the</strong><br />
subspecies <strong>in</strong>formation us<strong>in</strong>g three coord<strong>in</strong>ates, marker<br />
size and marker color.<br />
3<br />
2<br />
1<br />
0<br />
−1<br />
−2<br />
se se<br />
se<br />
se<br />
se<br />
se<br />
se<br />
se<br />
se<br />
se<br />
se se<br />
se se<br />
se<br />
se se<br />
se<br />
se<br />
se<br />
ve<br />
vi<br />
ve<br />
veveve vi<br />
vi vi<br />
vi vi vi<br />
vi<br />
veveve<br />
ve vi vi<br />
ve ve ve<br />
ve ve<br />
ve ve<br />
ve<br />
ve<br />
vi<br />
vi vi<br />
vi<br />
vi vi<br />
ve vi<br />
ve vi<br />
ve veve<br />
vi ve<br />
ve ve<br />
−3<br />
−3 −2 −1 0 1 2 3 4<br />
Figure 6. Projection of <strong>the</strong> IRIS data set to <strong>the</strong><br />
subspace spanned by its two eigenvectors with greatest<br />
eigenvalues. The three subspecies have been plotted us<strong>in</strong>g<br />
different markers: IRU 6HWRVD x for Versicolor and ¸ IRU<br />
Virg<strong>in</strong>ica. The <strong>SOM</strong> grid has been projected to <strong>the</strong> same<br />
subspace. Neighbor<strong>in</strong>g map units connected with l<strong>in</strong>es.<br />
Labels associated with map units are also shown.<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
4<br />
3.5<br />
3<br />
2.5<br />
2<br />
4.5<br />
Figure 7. The four variables and <strong>the</strong> subspecies<br />
<strong>in</strong>formation from <strong>the</strong> <strong>SOM</strong>. Three coord<strong>in</strong>ates and marker<br />
size show <strong>the</strong> four variables. Marker color gives<br />
subspecies: black for Setosa, dark gray for Versicolor and<br />
light gray for Virg<strong>in</strong>ica.<br />
5<br />
5.5<br />
vi<br />
6<br />
6.5<br />
7<br />
7.5<br />
5. Conclusions<br />
In this paper, <strong>the</strong> <strong>SOM</strong> <strong>Toolbox</strong> has been shortly<br />
<strong>in</strong>troduced. The <strong>SOM</strong> is an excellent tool <strong>in</strong> <strong>the</strong><br />
visualization of high dimensional data [6]. As such it is<br />
most suitable for data understand<strong>in</strong>g phase of <strong>the</strong><br />
knowledge discovery process, although it can be used for<br />
data preparation, model<strong>in</strong>g and classification as well.<br />
In future work, our research will concentrate on <strong>the</strong><br />
quantitative analysis of <strong>SOM</strong> mapp<strong>in</strong>gs, especially<br />
analysis of clusters and <strong>the</strong>ir properties. New functions<br />
and graphical user <strong>in</strong>terface tools will be added to <strong>the</strong><br />
<strong>Toolbox</strong> to <strong>in</strong>crease its usefulness <strong>in</strong> data m<strong>in</strong><strong>in</strong>g. Also<br />
outside contributions to <strong>the</strong> <strong>Toolbox</strong> are welcome.<br />
It is our hope that <strong>the</strong> <strong>SOM</strong> <strong>Toolbox</strong> promotes <strong>the</strong><br />
utilization of <strong>SOM</strong> algorithm – <strong>in</strong> research as well as <strong>in</strong><br />
<strong>in</strong>dustry – by mak<strong>in</strong>g its best features more readily<br />
accessible.<br />
Acknowledgements<br />
This work has been partially carried out <strong>in</strong> ‘Adaptive<br />
and Intelligent Systems Applications’ technology program<br />
of Technology Development Center of F<strong>in</strong>land, and <strong>the</strong><br />
EU f<strong>in</strong>anced Brite/Euram project ‘Application of Neural<br />
Network Based Models for Optimization of <strong>the</strong> Roll<strong>in</strong>g<br />
Process’ (NEUROLL). We would like to thank Mr. Mika<br />
Pollari for implement<strong>in</strong>g <strong>the</strong> <strong>in</strong>itialization and tra<strong>in</strong><strong>in</strong>g<br />
GUI.<br />
References<br />
[1] Kohonen T. Self-Organiz<strong>in</strong>g <strong>Map</strong>s. Spr<strong>in</strong>ger, Berl<strong>in</strong>, 1995.<br />
[2] Vesanto J., Alhoniemi E., Himberg J., Kiviluoto K.,<br />
Parvia<strong>in</strong>en J. Self-Organiz<strong>in</strong>g <strong>Map</strong> for Data M<strong>in</strong><strong>in</strong>g <strong>in</strong><br />
MATLAB: <strong>the</strong> <strong>SOM</strong> <strong>Toolbox</strong>. Simulation News Europe<br />
1999;25:54.<br />
[3] Kohonen T., Hynn<strong>in</strong>en J., Kangas J., Laaksonen J.<br />
<strong>SOM</strong>_PAK: The Self-Organiz<strong>in</strong>g <strong>Map</strong> Program Package,<br />
Technical Report A31, Hels<strong>in</strong>ki University of Technology,<br />
1996, http://www.cis.hut.fi/nnrc/nnrc-programs.html<br />
[4] Pyle D. Data Preparation for Data M<strong>in</strong><strong>in</strong>g. Morgan<br />
Kaufman Publishers, San Francisco, 1999.<br />
[5] Anderson E. The Irises of <strong>the</strong> Gaspe Pen<strong>in</strong>sula. Bull.<br />
American Iris Society; 1935;59:2-5.<br />
[6] Vesanto J. <strong>SOM</strong>-Based Visualization Methods. Intelligent<br />
Data Analysis 1999;3:111-126.<br />
Address for correspondence.<br />
Juha Vesanto<br />
Hels<strong>in</strong>ki University of Technology<br />
P.O.Box 5400, FIN-02015 HUT, F<strong>in</strong>land<br />
Juha.Vesanto@hut.fi<br />
http://www.cis.hut.fi/projects/somtoolbox