27.06.2013 Views

Self−Organizing Map in Matlab: the SOM Toolbox - Aaltodoc

Self−Organizing Map in Matlab: the SOM Toolbox - Aaltodoc

Self−Organizing Map in Matlab: the SOM Toolbox - Aaltodoc

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Publication 6<br />

<strong>Self−Organiz<strong>in</strong>g</strong> <strong>Map</strong> <strong>in</strong> <strong>Matlab</strong>: <strong>the</strong> <strong>SOM</strong> <strong>Toolbox</strong><br />

Juha Vesanto, Johan Himberg, Esa Alhoniemi and Juha<br />

Parhankangas<br />

In Proceed<strong>in</strong>gs of <strong>the</strong> <strong>Matlab</strong> DSP Conference 1999, Espoo,<br />

F<strong>in</strong>land, pp. 35−40, 1999.


Self-organiz<strong>in</strong>g map <strong>in</strong> <strong>Matlab</strong>: <strong>the</strong> <strong>SOM</strong> <strong>Toolbox</strong><br />

Juha Vesanto, Johan Himberg, Esa Alhoniemi and Juha Parhankangas<br />

Laboratory of Computer and Information Science, Hels<strong>in</strong>ki University of Technology, F<strong>in</strong>land<br />

Abstract<br />

The Self-Organiz<strong>in</strong>g <strong>Map</strong> (<strong>SOM</strong>) is a vector<br />

quantization method which places <strong>the</strong> prototype vectors on<br />

a regular low-dimensional grid <strong>in</strong> an ordered fashion.<br />

This makes <strong>the</strong> <strong>SOM</strong> a powerful visualization tool. The<br />

<strong>SOM</strong> <strong>Toolbox</strong> is an implementation of <strong>the</strong> <strong>SOM</strong> and its<br />

visualization <strong>in</strong> <strong>the</strong> <strong>Matlab</strong> 5 comput<strong>in</strong>g environment. In<br />

this article, <strong>the</strong> <strong>SOM</strong> <strong>Toolbox</strong> and its usage are shortly<br />

presented. Also its performance <strong>in</strong> terms of computational<br />

load is evaluated and compared to a correspond<strong>in</strong>g Cprogram.<br />

1. General<br />

This article presents <strong>the</strong> (second version of <strong>the</strong>) <strong>SOM</strong><br />

<strong>Toolbox</strong>, hereafter simply called <strong>the</strong> <strong>Toolbox</strong>, for <strong>Matlab</strong><br />

5 comput<strong>in</strong>g environment by MathWorks, Inc. The <strong>SOM</strong><br />

acronym stands for Self-Organiz<strong>in</strong>g <strong>Map</strong> (also called<br />

Self-Organiz<strong>in</strong>g Feature <strong>Map</strong> or Kohonen map), a popular<br />

neural network based on unsupervised learn<strong>in</strong>g [1]. The<br />

<strong>Toolbox</strong> conta<strong>in</strong>s functions for creation, visualization and<br />

analysis of Self-Organiz<strong>in</strong>g <strong>Map</strong>s. The <strong>Toolbox</strong> is<br />

available free of charge under <strong>the</strong> GNU General Public<br />

License from http://www.cis.hut.fi/projects/somtoolbox.<br />

The <strong>Toolbox</strong> was born out of need for a good,<br />

easy-to-use implementation of <strong>the</strong> <strong>SOM</strong> <strong>in</strong> <strong>Matlab</strong> for<br />

research purposes. In particular, <strong>the</strong> researchers<br />

responsible for <strong>the</strong> <strong>Toolbox</strong> work <strong>in</strong> <strong>the</strong> field of data<br />

m<strong>in</strong><strong>in</strong>g, and <strong>the</strong>refore <strong>the</strong> <strong>Toolbox</strong> is oriented towards that<br />

direction <strong>in</strong> <strong>the</strong> form of powerful visualization functions.<br />

However, also people do<strong>in</strong>g o<strong>the</strong>r k<strong>in</strong>ds of research us<strong>in</strong>g<br />

<strong>SOM</strong> will probably f<strong>in</strong>d it useful — especially if <strong>the</strong>y<br />

have not yet made a <strong>SOM</strong> implementation of <strong>the</strong>ir own <strong>in</strong><br />

<strong>Matlab</strong> environment. S<strong>in</strong>ce much effort has been put to<br />

make <strong>the</strong> <strong>Toolbox</strong> relatively easy to use, it can also be<br />

used for educational purposes.<br />

The <strong>Toolbox</strong> — <strong>the</strong> basic package toge<strong>the</strong>r with<br />

contributed functions — can be used to preprocess data,<br />

<strong>in</strong>itialize and tra<strong>in</strong> <strong>SOM</strong>s us<strong>in</strong>g a range of different k<strong>in</strong>ds<br />

of topologies, visualize <strong>SOM</strong>s <strong>in</strong> various ways, and<br />

analyze <strong>the</strong> properties of <strong>the</strong> <strong>SOM</strong>s and data, e.g. <strong>SOM</strong><br />

quality, clusters on <strong>the</strong> map and correlations between<br />

variables. With data m<strong>in</strong><strong>in</strong>g <strong>in</strong> m<strong>in</strong>d, <strong>the</strong> <strong>Toolbox</strong> and <strong>the</strong><br />

<strong>SOM</strong> <strong>in</strong> general is best suited for data understand<strong>in</strong>g or<br />

survey, although it can also be used for classification and<br />

model<strong>in</strong>g.<br />

2. Self-organiz<strong>in</strong>g map<br />

A <strong>SOM</strong> consists of neurons organized on a regular lowdimensional<br />

grid, see Figure 1. Each neuron is a ddimensional<br />

weight vector (prototype vector, codebook<br />

vector) where d is equal to <strong>the</strong> dimension of <strong>the</strong> <strong>in</strong>put<br />

vectors. The neurons are connected to adjacent neurons by<br />

a neighborhood relation, which dictates <strong>the</strong> topology, or<br />

structure, of <strong>the</strong> map. In <strong>the</strong> <strong>Toolbox</strong>, topology is divided<br />

to two factors: local lattice structure (hexagonal or<br />

rectangular, see Figure 1) and global map shape (sheet,<br />

cyl<strong>in</strong>der or toroid).<br />

2<br />

1<br />

0<br />

Figure 1. Neighborhoods (0, 1 and 2) of <strong>the</strong> centermost<br />

unit: hexagonal lattice on <strong>the</strong> left, rectangular on <strong>the</strong> right.<br />

The <strong>in</strong>nermost polygon corresponds to 0-, next to <strong>the</strong> 1-<br />

and <strong>the</strong> outmost to <strong>the</strong> 2-neighborhood.<br />

The <strong>SOM</strong> can be thought of as a net which is spread to<br />

<strong>the</strong> data cloud. The <strong>SOM</strong> tra<strong>in</strong><strong>in</strong>g algorithm moves <strong>the</strong><br />

weight vectors so that <strong>the</strong>y span across <strong>the</strong> data cloud and<br />

so that <strong>the</strong> map is organized: neighbor<strong>in</strong>g neurons on <strong>the</strong><br />

grid get similar weight vectors. Two variants of <strong>the</strong> <strong>SOM</strong><br />

tra<strong>in</strong><strong>in</strong>g algorithm have been implemented <strong>in</strong> <strong>the</strong> <strong>Toolbox</strong>.<br />

In <strong>the</strong> traditional sequential tra<strong>in</strong><strong>in</strong>g, samples are<br />

presented to <strong>the</strong> map one at a time, and <strong>the</strong> algorithm<br />

gradually moves <strong>the</strong> weight vectors towards <strong>the</strong>m, as<br />

shown <strong>in</strong> Figure 2. In <strong>the</strong> batch tra<strong>in</strong><strong>in</strong>g, <strong>the</strong> data set is<br />

presented to <strong>the</strong> <strong>SOM</strong> as a whole, and <strong>the</strong> new weight<br />

vectors are weighted averages of <strong>the</strong> data vectors. Both<br />

algorithms are iterative, but <strong>the</strong> batch version is much<br />

2<br />

1<br />

0


faster <strong>in</strong> <strong>Matlab</strong> s<strong>in</strong>ce matrix operations can be utilized<br />

efficiently.<br />

For a more complete description of <strong>the</strong> <strong>SOM</strong> and its<br />

implementation <strong>in</strong> <strong>Matlab</strong>, please refer to <strong>the</strong> book by<br />

Kohonen [1], and to <strong>the</strong> <strong>SOM</strong> <strong>Toolbox</strong> documentation.<br />

X<br />

BMU<br />

Figure 2. Updat<strong>in</strong>g <strong>the</strong> best match<strong>in</strong>g unit (BMU) and<br />

its neighbors towards <strong>the</strong> <strong>in</strong>put sample marked with x.<br />

The solid and dashed l<strong>in</strong>es correspond to situation before<br />

and after updat<strong>in</strong>g, respectively.<br />

3. Performance<br />

The <strong>Toolbox</strong> can be downloaded for free from<br />

http://www.cis.hut.fi/projects/somtoolbox. It requires no<br />

o<strong>the</strong>r toolboxes, just <strong>the</strong> basic functions of <strong>Matlab</strong> (version<br />

5.1 or later). The total diskspace required for <strong>the</strong> <strong>Toolbox</strong><br />

itself is less than 1 MB. The documentation takes a few<br />

MBs more.<br />

The performance tests were made <strong>in</strong> a mach<strong>in</strong>e with 3<br />

GBs of memory and 8 250 MHz R10000 CPUs (one of<br />

which was used by <strong>the</strong> test process) runn<strong>in</strong>g IRIX 6.5<br />

operat<strong>in</strong>g system. Some tests were also performed <strong>in</strong> a<br />

workstation with a s<strong>in</strong>gle 350 MHz Pentium II CPU, 128<br />

MBs of memory and L<strong>in</strong>ux operat<strong>in</strong>g system. The <strong>Matlab</strong><br />

version <strong>in</strong> both environments was 5.3.<br />

The purpose of <strong>the</strong> performance tests was only to<br />

evaluate <strong>the</strong> computational load of <strong>the</strong> algorithms. No<br />

attempt was made to compare <strong>the</strong> quality of <strong>the</strong> result<strong>in</strong>g<br />

mapp<strong>in</strong>gs, primarily because <strong>the</strong>re is no uniformly<br />

recognized “correct” method to evaluate it. The tests were<br />

performed with data sets and maps of different sizes, and<br />

three tra<strong>in</strong><strong>in</strong>g functions: som_batchtra<strong>in</strong>,<br />

som_seqtra<strong>in</strong> and som_sompaktra<strong>in</strong>, <strong>the</strong> last of<br />

which calls <strong>the</strong> C-program vsom to perform <strong>the</strong> actual<br />

tra<strong>in</strong><strong>in</strong>g. This program is part of <strong>the</strong> <strong>SOM</strong>_PAK [3],<br />

which is a free software package implement<strong>in</strong>g <strong>the</strong> <strong>SOM</strong><br />

algorithm <strong>in</strong> ANSI-C.<br />

Some typical comput<strong>in</strong>g times are shown <strong>in</strong> Table 1. As<br />

a general result, som_batchtra<strong>in</strong> was clearly <strong>the</strong><br />

fastest. In IRIX it was upto 20 times faster than<br />

som_seqtra<strong>in</strong> and upto 8 times faster than<br />

som_sompaktra<strong>in</strong>. Median values were 6 times and 3<br />

times, respectively. The som_batchtra<strong>in</strong> was<br />

especially faster with larger data sets, while with a small<br />

set and large map it was actually slower. However, <strong>the</strong><br />

latter case is very atypical, and can thus be ignored. In<br />

L<strong>in</strong>ux, <strong>the</strong> smaller amount of memory clearly came <strong>in</strong>to<br />

play: <strong>the</strong> marg<strong>in</strong>al between batch and o<strong>the</strong>r tra<strong>in</strong><strong>in</strong>g<br />

functions was halved.<br />

The number of data samples clearly had a l<strong>in</strong>ear effect<br />

on <strong>the</strong> computational load. On <strong>the</strong> o<strong>the</strong>r hand, <strong>the</strong> number<br />

of map units seemed to have a quadratic effect, at least<br />

with som_batchtra<strong>in</strong>. Of course, also <strong>in</strong>crease <strong>in</strong><br />

<strong>in</strong>put dimension <strong>in</strong>creased <strong>the</strong> comput<strong>in</strong>g times: about<br />

two- to threefold as <strong>in</strong>put dimension <strong>in</strong>creased from 10 to<br />

50. The most supris<strong>in</strong>g result of <strong>the</strong> performance test was<br />

that especially with large data sets and maps, <strong>the</strong><br />

som_batchtra<strong>in</strong> outperformed <strong>the</strong> C-program (vsom<br />

used by som_sompaktra<strong>in</strong>). The reason is probably<br />

<strong>the</strong> fact that <strong>in</strong> <strong>SOM</strong>_PAK, distances between map units<br />

on <strong>the</strong> grid are always calculated anew when needed. In<br />

<strong>SOM</strong> <strong>Toolbox</strong>, all <strong>the</strong>se are calculated beforehand.<br />

Likewise for many o<strong>the</strong>r required matrices.<br />

Indeed, <strong>the</strong> major deficiency of <strong>the</strong> <strong>SOM</strong> <strong>Toolbox</strong>, and<br />

especially of batch tra<strong>in</strong><strong>in</strong>g algorithm, is <strong>the</strong> expenditure<br />

of memory. A rough lower bound estimate of <strong>the</strong> amount<br />

of memory used by som_batchtra<strong>in</strong> is given by:<br />

8(5(m+n)d + 3m 2 ) bytes, where m is <strong>the</strong> number of<br />

map units, n is <strong>the</strong> number of data samples and d is <strong>the</strong><br />

<strong>in</strong>put space dimension. For [3000 x 10] data matrix and<br />

300 map units <strong>the</strong> amount of memory required is still<br />

moderate, <strong>in</strong> <strong>the</strong> order of 3.5 MBs. But for [30000 x 50]<br />

data matrix and 3000 map units, <strong>the</strong> memory requirement<br />

is more than 280 MBs, <strong>the</strong> majority of which comes from<br />

<strong>the</strong> last term of <strong>the</strong> equation. The sequential algorithm is<br />

less extreme requir<strong>in</strong>g only one half or one third of this.<br />

<strong>SOM</strong>_PAK requires much less memory, about 20 MBs for<br />

<strong>the</strong> [30000 x 50] case, and can operate with buffered data.<br />

Table 1. Typical comput<strong>in</strong>g times. Data set size is<br />

given as [n x d] where n is <strong>the</strong> number of data samples<br />

and d is <strong>the</strong> <strong>in</strong>put dimension.<br />

data size<br />

IRIX<br />

map units batch seq sompak<br />

[300x10] 30 0.2 s 3.1 s 0.9 s<br />

[3000x10] 300 7 s 54 s 17 s<br />

[30000x10] 1000 5 m<strong>in</strong> 19 m<strong>in</strong> 9 m<strong>in</strong><br />

[30000x50] 3000<br />

L<strong>in</strong>ux<br />

27 m<strong>in</strong> 5.7 h 75 m<strong>in</strong><br />

[300x10] 30 0.3 s 2.7 s 1.9 s<br />

[3000x10] 300 24 s 76 s 26 s<br />

[30000x10] 1000 13 m<strong>in</strong> 40 m<strong>in</strong> 15 m<strong>in</strong>


4. Use of <strong>SOM</strong> <strong>Toolbox</strong><br />

4.1. Data format<br />

The k<strong>in</strong>d of data that can be processed with <strong>the</strong><br />

<strong>Toolbox</strong> is so-called spreadsheet or table data. Each row<br />

of <strong>the</strong> table is one data sample. The columns of <strong>the</strong> table<br />

are <strong>the</strong> variables of <strong>the</strong> data set. The variables might be <strong>the</strong><br />

properties of an object, or a set of measurements measured<br />

at a specific time. The important th<strong>in</strong>g is that every sample<br />

has <strong>the</strong> same set of variables. Some of <strong>the</strong> values may be<br />

miss<strong>in</strong>g, but <strong>the</strong> majority should be <strong>the</strong>re. The table<br />

representation is a very common data format. If <strong>the</strong><br />

available data does not conform to <strong>the</strong>se specifications, it<br />

can usually be transformed so that it does.<br />

The <strong>Toolbox</strong> can handle both numeric and categorial<br />

data, but only <strong>the</strong> former is utilized <strong>in</strong> <strong>the</strong> <strong>SOM</strong> algorithm.<br />

In <strong>the</strong> <strong>Toolbox</strong>, categorial data can be <strong>in</strong>serted <strong>in</strong>to labels<br />

associated with each data sample. They can be considered<br />

as post-it notes attached to each sample. The user can<br />

check on <strong>the</strong>m later to see what was <strong>the</strong> mean<strong>in</strong>g of some<br />

specific sample, but <strong>the</strong> tra<strong>in</strong><strong>in</strong>g algorithm ignores <strong>the</strong>m.<br />

Function som_autolabel can be used to handle<br />

categorial variables. If <strong>the</strong> categorial variables need to be<br />

utilized <strong>in</strong> tra<strong>in</strong><strong>in</strong>g <strong>the</strong> <strong>SOM</strong>, <strong>the</strong>y can be converted <strong>in</strong>to<br />

numerical variables us<strong>in</strong>g, e.g., mapp<strong>in</strong>g or 1-of-n<br />

cod<strong>in</strong>g [4].<br />

Note that for a variable to be “numeric”, <strong>the</strong> numeric<br />

representation must be mean<strong>in</strong>gful: values 1, 2 and 4<br />

correspond<strong>in</strong>g to objects A, B and C should really mean<br />

that (<strong>in</strong> terms of this variable) B is between A and C, and<br />

that <strong>the</strong> distance between B and A is smaller than <strong>the</strong><br />

distance between B and C. Identification numbers, error<br />

codes, etc. rarely have such mean<strong>in</strong>g, and <strong>the</strong>y should be<br />

handled as categorial data.<br />

4.2. Construction of data sets<br />

First, <strong>the</strong> data has to be brought <strong>in</strong>to <strong>Matlab</strong> us<strong>in</strong>g, for<br />

example, standard <strong>Matlab</strong> functions load and fscanf.<br />

In addition, <strong>the</strong> <strong>Toolbox</strong> has function som_read_data<br />

which can be used to read ASCII data files:<br />

sD = som_read_data(‘data.txt’);<br />

The data is usually put <strong>in</strong>to a so-called data struct,<br />

which is a <strong>Matlab</strong> struct def<strong>in</strong>ed <strong>in</strong> <strong>the</strong> <strong>Toolbox</strong> to group<br />

<strong>in</strong>formation related to a data set. It has fields for numerical<br />

data (.data), str<strong>in</strong>gs (.labels), as well as for<br />

<strong>in</strong>formation about data set and <strong>the</strong> <strong>in</strong>dividual variables.<br />

The <strong>Toolbox</strong> utilizes many o<strong>the</strong>r structs as well, for<br />

example a map struct which holds all <strong>in</strong>formation related<br />

to a <strong>SOM</strong>. A numerical matrix can be converted <strong>in</strong>to a<br />

data struct with: sD = som_data_struct(D). If <strong>the</strong><br />

data only consists of numerical values, it is not actually<br />

necessary to use data structs at all. Most functions accept<br />

numerical matrices as well. However, if <strong>the</strong>re are<br />

categorial variables, data structs has be used. The<br />

categorial variables are converted to str<strong>in</strong>gs and put <strong>in</strong>to<br />

<strong>the</strong> .labels field of <strong>the</strong> data struct as a cell array of<br />

str<strong>in</strong>gs.<br />

4.3. Data preprocess<strong>in</strong>g<br />

Data preprocess<strong>in</strong>g <strong>in</strong> general can be just about<br />

anyth<strong>in</strong>g: simple transformations or normalizations<br />

performed on s<strong>in</strong>gle variables, filters, calculation of new<br />

variables from exist<strong>in</strong>g ones. In <strong>the</strong> <strong>Toolbox</strong>, only <strong>the</strong> first<br />

of <strong>the</strong>se is implemented as part of <strong>the</strong> package.<br />

Specifically, <strong>the</strong> function som_normalize can be used<br />

to perform l<strong>in</strong>ear and logarithmic scal<strong>in</strong>gs and histogram<br />

equalizations of <strong>the</strong> numerical variables (<strong>the</strong> .data<br />

field). There is also a graphical user <strong>in</strong>terface tool for<br />

preprocess<strong>in</strong>g data, see Figure 3.<br />

Scal<strong>in</strong>g of variables is of special importance <strong>in</strong> <strong>the</strong><br />

<strong>Toolbox</strong>, s<strong>in</strong>ce <strong>the</strong> <strong>SOM</strong> algorithm uses Euclidean metric<br />

to measure distances between vectors. If one variable has<br />

values <strong>in</strong> <strong>the</strong> range of [0,...,1000] and ano<strong>the</strong>r <strong>in</strong> <strong>the</strong> range<br />

of [0,...,1] <strong>the</strong> former will almost completely dom<strong>in</strong>ate <strong>the</strong><br />

map organization because of its greater impact on <strong>the</strong><br />

distances measured. Typically, one would want <strong>the</strong><br />

variables to be equally important. The standard way to<br />

achieve this is to l<strong>in</strong>early scale all variables so that <strong>the</strong>ir<br />

variances are equal to one.<br />

One of <strong>the</strong> advantages of us<strong>in</strong>g data structs <strong>in</strong>stead of<br />

simple data matrices is that <strong>the</strong> structs reta<strong>in</strong> <strong>in</strong>formation<br />

of <strong>the</strong> normalizations <strong>in</strong> <strong>the</strong> field .comp_norm. Us<strong>in</strong>g<br />

function som_denormalize one can reverse <strong>the</strong><br />

normalization to get <strong>the</strong> values <strong>in</strong> <strong>the</strong> orig<strong>in</strong>al scale: sD =<br />

som_denormalize(sD). Also, one can repeat <strong>the</strong><br />

exactly same normalizations to o<strong>the</strong>r data sets.<br />

All normalizations are s<strong>in</strong>gle-variable transformations.<br />

One can make one k<strong>in</strong>d of normalization to one variable,<br />

and ano<strong>the</strong>r type of normalization to ano<strong>the</strong>r variable.<br />

Also, multiple normalizations one after <strong>the</strong> o<strong>the</strong>r can be<br />

made for each variable. For example, consider a data set<br />

sD with three numerical variables. The user could do a<br />

histogram equalization to <strong>the</strong> first variable, a logarithmic<br />

scal<strong>in</strong>g to <strong>the</strong> third variable, and f<strong>in</strong>ally a l<strong>in</strong>ear scal<strong>in</strong>g to<br />

unit variance to all three variables:<br />

sD = som_normalize(sD,'histD',1);<br />

sD = som_normalize(sD,'log',3);<br />

sD = som_normalize(sD,'var',1:3);<br />

The data does not necessarily have to be preprocessed<br />

at all before creat<strong>in</strong>g a <strong>SOM</strong> us<strong>in</strong>g it. However, <strong>in</strong> most<br />

real tasks preprocess<strong>in</strong>g is important; perhaps even <strong>the</strong><br />

most important part of <strong>the</strong> whole process [4].


2<br />

1<br />

0<br />

−1<br />

Figure 3. Data set preprocess<strong>in</strong>g tool.<br />

Figure 4. <strong>SOM</strong> <strong>in</strong>itialization and tra<strong>in</strong><strong>in</strong>g tool.<br />

4.4. Initialization and tra<strong>in</strong><strong>in</strong>g<br />

There are two <strong>in</strong>itialization (random and l<strong>in</strong>ear) and<br />

two tra<strong>in</strong><strong>in</strong>g (sequential and batch) algorithms<br />

implemented <strong>in</strong> <strong>the</strong> <strong>Toolbox</strong>. By default l<strong>in</strong>ear<br />

<strong>in</strong>itialization and batch tra<strong>in</strong><strong>in</strong>g algorithm are used. The<br />

simplest way to <strong>in</strong>itialize and tra<strong>in</strong> a <strong>SOM</strong> is to use<br />

function som_make which does both us<strong>in</strong>g automatically<br />

selected parameters:<br />

sM = som_make(sD);<br />

The tra<strong>in</strong><strong>in</strong>g is done is two phases: rough tra<strong>in</strong><strong>in</strong>g with<br />

large (<strong>in</strong>itial) neighborhood radius and large (<strong>in</strong>itial)<br />

learn<strong>in</strong>g rate, and f<strong>in</strong>etun<strong>in</strong>g with small radius and<br />

learn<strong>in</strong>g rate. If tighter control over <strong>the</strong> tra<strong>in</strong><strong>in</strong>g<br />

parameters is desired, <strong>the</strong> respective <strong>in</strong>itialization and<br />

tra<strong>in</strong><strong>in</strong>g functions, e.g. som_batchtra<strong>in</strong>, can be used<br />

directly. There is also a graphical user <strong>in</strong>terface tool for<br />

<strong>in</strong>itializ<strong>in</strong>g and tra<strong>in</strong><strong>in</strong>g <strong>SOM</strong>s, see Figure 4.<br />

4.5. Visualization and analysis<br />

There are a variety of methods to visualize <strong>the</strong> <strong>SOM</strong>. In<br />

<strong>the</strong> <strong>Toolbox</strong>, <strong>the</strong> basic tool is <strong>the</strong> function som_show. It<br />

can be used to show <strong>the</strong> U-matrix and <strong>the</strong> component<br />

planes of <strong>the</strong> <strong>SOM</strong>:<br />

som_show(sM);<br />

The U-matrix visualizes distances between neighbor<strong>in</strong>g<br />

map units, and thus shows <strong>the</strong> cluster structure of <strong>the</strong> map:<br />

high values of <strong>the</strong> U-matrix <strong>in</strong>dicate a cluster border,<br />

uniform areas of low values <strong>in</strong>dicate clusters <strong>the</strong>mselves.<br />

Each component plane shows <strong>the</strong> values of one variable <strong>in</strong><br />

each map unit. On top of <strong>the</strong>se visualizations, additional<br />

<strong>in</strong>formation can be shown: labels, data histograms and<br />

trajectories.<br />

With function som_vis much more advanced<br />

visualizations are possible. The function is based on <strong>the</strong><br />

idea that <strong>the</strong> visualization of a data set simply consists of a<br />

set of objects, each with a unique position, color and<br />

shape. In addition, connections between objects, for<br />

example neighborhood relations, can be shown us<strong>in</strong>g<br />

l<strong>in</strong>es. With som_vis <strong>the</strong> user is able to assign arbitrary<br />

values to each of <strong>the</strong>se properties. For example, x-, y-, and<br />

z-coord<strong>in</strong>ates, object size and color can each stand for one<br />

variable, thus enabl<strong>in</strong>g <strong>the</strong> simultaneous visualization of<br />

five variables. The different options are:<br />

- <strong>the</strong> position of an object can be 2- or 3-dimensional<br />

- <strong>the</strong> color of an object can be freely selected from<br />

<strong>the</strong> RGB cube, although typically <strong>in</strong>dexed color is<br />

used<br />

- <strong>the</strong> shape of an object can be any of <strong>the</strong> <strong>Matlab</strong><br />

plot markers ('.','+', etc.), a pie chart, a bar


chart, a plot or even an arbitrarily shaped polygon,<br />

typically a rectangle or hexagon<br />

- l<strong>in</strong>es between objects can have arbitrary color,<br />

width and any of <strong>the</strong> <strong>Matlab</strong> l<strong>in</strong>e modes, e.g. '-'<br />

- <strong>in</strong> addition to <strong>the</strong> objects, associated labels can be<br />

shown<br />

For quantitative analysis of <strong>the</strong> <strong>SOM</strong> <strong>the</strong>re are at <strong>the</strong><br />

moment only a few tools. The function som_quality<br />

supplies two quality measures for <strong>SOM</strong>: average<br />

quantization error and topographic error. However, us<strong>in</strong>g<br />

low level functions, like som_neighborhood,<br />

som_bmus and som_unit_dists, it is easy to<br />

implement new analysis functions. Much research is be<strong>in</strong>g<br />

done <strong>in</strong> this area, and many new functions for <strong>the</strong> analysis<br />

will be added to <strong>the</strong> <strong>Toolbox</strong> <strong>in</strong> <strong>the</strong> future, for example<br />

tools for cluster<strong>in</strong>g and analysis of <strong>the</strong> properties of <strong>the</strong><br />

clusters. Also new visualization functions for mak<strong>in</strong>g<br />

projections and specific visualization tasks will be added<br />

to <strong>the</strong> <strong>Toolbox</strong>.<br />

4.6. Example<br />

Here is a simple example of <strong>the</strong> usage of <strong>the</strong> <strong>Toolbox</strong> to<br />

make and visualize a <strong>SOM</strong> of a data set. As <strong>the</strong> example<br />

data, <strong>the</strong> well-known Iris data set is used [5]. This data set<br />

consists of four measurements from 150 Iris flowers: 50<br />

Iris-setosa, 50 Iris-versicolor and 50 Iris-virg<strong>in</strong>ica. The<br />

measurements are length and width of sepal and petal<br />

leaves. The data is <strong>in</strong> an ASCII file, <strong>the</strong> first few l<strong>in</strong>es of<br />

which are shown below. The first l<strong>in</strong>e conta<strong>in</strong>s <strong>the</strong> names<br />

of <strong>the</strong> variables. Each of <strong>the</strong> follow<strong>in</strong>g l<strong>in</strong>es gives one<br />

data sample beg<strong>in</strong>n<strong>in</strong>g with numerical variables and<br />

followed by labels.<br />

#n sepallen sepalwid petallen petalwid<br />

5.1 3.5 1.4 0.2 setosa<br />

4.9 3.0 1.4 0.2 setosa<br />

...<br />

The data set is loaded <strong>in</strong>to <strong>Matlab</strong> and normalized.<br />

Before normalization, an <strong>in</strong>itial statistical look of <strong>the</strong> data<br />

set would be <strong>in</strong> order, for example us<strong>in</strong>g variable-wise<br />

histograms. This <strong>in</strong>formation would provide an <strong>in</strong>itial idea<br />

of what <strong>the</strong> data is about, and would <strong>in</strong>dicate how <strong>the</strong><br />

variables should be preprocessed. In this example, <strong>the</strong><br />

variance normalization is used. After <strong>the</strong> data set is ready,<br />

a <strong>SOM</strong> is tra<strong>in</strong>ed. S<strong>in</strong>ce <strong>the</strong> data set had labels, <strong>the</strong> map is<br />

also labeled us<strong>in</strong>g som_autolabel. After this, <strong>the</strong><br />

<strong>SOM</strong> is visualized us<strong>in</strong>g som_show. The U-matrix is<br />

shown along with all four component planes. Also <strong>the</strong><br />

labels of each map unit are shown on an empty grid us<strong>in</strong>g<br />

som_addlabels. The values of components are<br />

denormalized so that <strong>the</strong> values shown on <strong>the</strong> colorbar are<br />

<strong>in</strong> <strong>the</strong> orig<strong>in</strong>al value range. The visualizations are shown<br />

<strong>in</strong> Figure 5.<br />

%% make <strong>the</strong> data<br />

sD = som_read_data('iris.data');<br />

sD = som_normalize(sD,'var');<br />

%% make <strong>the</strong> <strong>SOM</strong><br />

sM = som_make(sD,'munits',30);<br />

sM = som_autolabel(sM,sD,'vote');<br />

%% basic visualization<br />

som_show(sM,’umat’,’all’,’comp’,1:4,...<br />

’empty’,’Labels’,’norm’,’d’);<br />

som_addlabels(sM,1,6);<br />

From <strong>the</strong> U-matrix it is easy to see that <strong>the</strong> top three<br />

rows of <strong>the</strong> <strong>SOM</strong> form a very clear cluster. By look<strong>in</strong>g at<br />

<strong>the</strong> labels, it is immediately seen that this corresponds to<br />

<strong>the</strong> Setosa subspecies. The two o<strong>the</strong>r subspecies<br />

Versicolor and Virg<strong>in</strong>ica form <strong>the</strong> o<strong>the</strong>r cluster. The Umatrix<br />

shows no clear separation between <strong>the</strong>m, but from<br />

<strong>the</strong> labels it seems that <strong>the</strong>y correspond to two different<br />

parts of <strong>the</strong> cluster. From <strong>the</strong> component planes it can be<br />

seen that <strong>the</strong> petal length and petal width are very closely<br />

related to each o<strong>the</strong>r. Also some correlation exists between<br />

<strong>the</strong>m and sepal length. The Setosa subspecies exhibits<br />

small petals and short but wide sepals. The separat<strong>in</strong>g<br />

factor between Versicolor and Virg<strong>in</strong>ica is that <strong>the</strong> latter<br />

has bigger leaves.<br />

U−matrix<br />

petallength<br />

d<br />

1.6<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

5.52<br />

4.64<br />

3.76<br />

2.88<br />

1.99<br />

sepallength<br />

d<br />

d<br />

7.09<br />

6.67<br />

6.26<br />

5.84<br />

5.43<br />

5.02<br />

<strong>Map</strong>: <strong>SOM</strong> 06−Sep−1999, Size: 14 6<br />

sepalwidth<br />

petalwidth Labels<br />

se se se se se se<br />

1.96 se se se se se se<br />

se se se se se<br />

1.58<br />

se<br />

ve<br />

se se<br />

1.2<br />

ve ve ve ve ve ve<br />

ve ve ve ve ve vi<br />

ve ve ve ve ve ve<br />

ve ve ve ve<br />

0.817 vi ve ve ve<br />

vi ve vi vi vi vi<br />

0.436 vi<br />

vi vi vi vi<br />

vi vi vi<br />

vi vi vi vi vi<br />

Figure 5. Visualization of <strong>the</strong> <strong>SOM</strong> of Iris data. Umatrix<br />

on top left, <strong>the</strong>n component planes, and map unit<br />

labels on bottom right. The six figures are l<strong>in</strong>ked by<br />

position: <strong>in</strong> each figure, <strong>the</strong> hexagon <strong>in</strong> a certa<strong>in</strong> position<br />

corresponds to <strong>the</strong> same map unit. In <strong>the</strong> U-matrix,<br />

additional hexagons exist between all pairs of neighbor<strong>in</strong>g<br />

map units. For example, <strong>the</strong> map unit <strong>in</strong> top left corner has<br />

low values for sepal length, petal length and width, and<br />

relatively high value for sepal width. The label associated<br />

with <strong>the</strong> map unit is 'se' (Setosa) and from <strong>the</strong> U-matrix it<br />

can be seen that <strong>the</strong> unit is very close to its neighbors.<br />

d<br />

3.7<br />

3.49<br />

3.27<br />

3.05<br />

2.84<br />

2.62


Component planes are very convenient when one has to<br />

visualize a lot of <strong>in</strong>formation at once. However, when only<br />

a few variables are of <strong>in</strong>terest scatter plots are much more<br />

efficient. Figures 6 and 7 show two scatter plots made<br />

us<strong>in</strong>g <strong>the</strong> som_grid function. Figure 6 shows <strong>the</strong> PCAprojection<br />

of both data and <strong>the</strong> map grid, and Figure 7<br />

visualizes all four variables of <strong>the</strong> <strong>SOM</strong> plus <strong>the</strong><br />

subspecies <strong>in</strong>formation us<strong>in</strong>g three coord<strong>in</strong>ates, marker<br />

size and marker color.<br />

3<br />

2<br />

1<br />

0<br />

−1<br />

−2<br />

se se<br />

se<br />

se<br />

se<br />

se<br />

se<br />

se<br />

se<br />

se<br />

se se<br />

se se<br />

se<br />

se se<br />

se<br />

se<br />

se<br />

ve<br />

vi<br />

ve<br />

veveve vi<br />

vi vi<br />

vi vi vi<br />

vi<br />

veveve<br />

ve vi vi<br />

ve ve ve<br />

ve ve<br />

ve ve<br />

ve<br />

ve<br />

vi<br />

vi vi<br />

vi<br />

vi vi<br />

ve vi<br />

ve vi<br />

ve veve<br />

vi ve<br />

ve ve<br />

−3<br />

−3 −2 −1 0 1 2 3 4<br />

Figure 6. Projection of <strong>the</strong> IRIS data set to <strong>the</strong><br />

subspace spanned by its two eigenvectors with greatest<br />

eigenvalues. The three subspecies have been plotted us<strong>in</strong>g<br />

different markers: IRU 6HWRVD x for Versicolor and ¸ IRU<br />

Virg<strong>in</strong>ica. The <strong>SOM</strong> grid has been projected to <strong>the</strong> same<br />

subspace. Neighbor<strong>in</strong>g map units connected with l<strong>in</strong>es.<br />

Labels associated with map units are also shown.<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

4.5<br />

Figure 7. The four variables and <strong>the</strong> subspecies<br />

<strong>in</strong>formation from <strong>the</strong> <strong>SOM</strong>. Three coord<strong>in</strong>ates and marker<br />

size show <strong>the</strong> four variables. Marker color gives<br />

subspecies: black for Setosa, dark gray for Versicolor and<br />

light gray for Virg<strong>in</strong>ica.<br />

5<br />

5.5<br />

vi<br />

6<br />

6.5<br />

7<br />

7.5<br />

5. Conclusions<br />

In this paper, <strong>the</strong> <strong>SOM</strong> <strong>Toolbox</strong> has been shortly<br />

<strong>in</strong>troduced. The <strong>SOM</strong> is an excellent tool <strong>in</strong> <strong>the</strong><br />

visualization of high dimensional data [6]. As such it is<br />

most suitable for data understand<strong>in</strong>g phase of <strong>the</strong><br />

knowledge discovery process, although it can be used for<br />

data preparation, model<strong>in</strong>g and classification as well.<br />

In future work, our research will concentrate on <strong>the</strong><br />

quantitative analysis of <strong>SOM</strong> mapp<strong>in</strong>gs, especially<br />

analysis of clusters and <strong>the</strong>ir properties. New functions<br />

and graphical user <strong>in</strong>terface tools will be added to <strong>the</strong><br />

<strong>Toolbox</strong> to <strong>in</strong>crease its usefulness <strong>in</strong> data m<strong>in</strong><strong>in</strong>g. Also<br />

outside contributions to <strong>the</strong> <strong>Toolbox</strong> are welcome.<br />

It is our hope that <strong>the</strong> <strong>SOM</strong> <strong>Toolbox</strong> promotes <strong>the</strong><br />

utilization of <strong>SOM</strong> algorithm – <strong>in</strong> research as well as <strong>in</strong><br />

<strong>in</strong>dustry – by mak<strong>in</strong>g its best features more readily<br />

accessible.<br />

Acknowledgements<br />

This work has been partially carried out <strong>in</strong> ‘Adaptive<br />

and Intelligent Systems Applications’ technology program<br />

of Technology Development Center of F<strong>in</strong>land, and <strong>the</strong><br />

EU f<strong>in</strong>anced Brite/Euram project ‘Application of Neural<br />

Network Based Models for Optimization of <strong>the</strong> Roll<strong>in</strong>g<br />

Process’ (NEUROLL). We would like to thank Mr. Mika<br />

Pollari for implement<strong>in</strong>g <strong>the</strong> <strong>in</strong>itialization and tra<strong>in</strong><strong>in</strong>g<br />

GUI.<br />

References<br />

[1] Kohonen T. Self-Organiz<strong>in</strong>g <strong>Map</strong>s. Spr<strong>in</strong>ger, Berl<strong>in</strong>, 1995.<br />

[2] Vesanto J., Alhoniemi E., Himberg J., Kiviluoto K.,<br />

Parvia<strong>in</strong>en J. Self-Organiz<strong>in</strong>g <strong>Map</strong> for Data M<strong>in</strong><strong>in</strong>g <strong>in</strong><br />

MATLAB: <strong>the</strong> <strong>SOM</strong> <strong>Toolbox</strong>. Simulation News Europe<br />

1999;25:54.<br />

[3] Kohonen T., Hynn<strong>in</strong>en J., Kangas J., Laaksonen J.<br />

<strong>SOM</strong>_PAK: The Self-Organiz<strong>in</strong>g <strong>Map</strong> Program Package,<br />

Technical Report A31, Hels<strong>in</strong>ki University of Technology,<br />

1996, http://www.cis.hut.fi/nnrc/nnrc-programs.html<br />

[4] Pyle D. Data Preparation for Data M<strong>in</strong><strong>in</strong>g. Morgan<br />

Kaufman Publishers, San Francisco, 1999.<br />

[5] Anderson E. The Irises of <strong>the</strong> Gaspe Pen<strong>in</strong>sula. Bull.<br />

American Iris Society; 1935;59:2-5.<br />

[6] Vesanto J. <strong>SOM</strong>-Based Visualization Methods. Intelligent<br />

Data Analysis 1999;3:111-126.<br />

Address for correspondence.<br />

Juha Vesanto<br />

Hels<strong>in</strong>ki University of Technology<br />

P.O.Box 5400, FIN-02015 HUT, F<strong>in</strong>land<br />

Juha.Vesanto@hut.fi<br />

http://www.cis.hut.fi/projects/somtoolbox

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!